← Home

Bringing NAM A2 to Embedded Hardware

11 February, 2026

Note: this post is a partner post to a post I wrote for the Tone3000 blog, with a more in-depth discussion of all the optimizations and results we found.

We all love NAM, and we want to run our favourite captures EVERYWHERE. And that's what we have been trying to achieve with A2, especially Slimmable models. As part of the development of A2, we are working closely with partners to make sure A2 suits their hardware. We also wanted to have a general idea of what running NAM in hardware platforms with more resources looks like, so we decided to get our hands dirty and implement a simple NAM loader for the Electrosmith Daisy Seed board, which has an ARM Cortex M7 that is currently very popular for DSP applications.

The Daisy Seed board has the following specs:

The Daisy Seed has optionally an extra 64 MB of SDRAM, but that RAM is much slower than the onboard SRAM. People usually use that RAM for loopers and delays, not anything that requires super intense realtime math like a neural model… so we ignored it in these experiments.

To make testing and benchmarking easier, we plugged the Daisy Seed into a Daisy Pod, which is basically a powered breakout board for the Daisy that includes stereo audio I/O, a micro SD card slot, and a couple of knobs. Most important for us was the SD card, as we could load a bunch of NAM models into it and have our benchmarking software run all of them sequentially (and write the logs to the SD card as well).

Challenges

We already knew from the beginning that memory limitations were strict (and that’s one of the reasons for Slimmable NAM). With 1 MB for both the model weights and any buffers, we are restricted in the size of model we can use. We decided to try a smaller model than A1, known in the community as A1-nano, with a little but important tweak: replacing the tanh activation with a ReLU. Tanh is an expensive operation that requires too much computation, which led the deep learning community to look for cheaper options. ReLU fits the bill here as one of the simplest activation functions that is currently used in lots of AI models.

While the code in NeuralAmpModelerCore has been battle-tested in the NAM plugin, it was not designed with embedded applications in mind. When we first tried to run it on the Daisy Seed, even with an A1-nano-ReLU model, it would take over 5 seconds of compute time to process 2 seconds of audio... which we should understand would not work for a pedal that is supposed to be doing audio processing in real time.

Optimization process

In order to tackle that issue, we first instrumented the code with profiling to capture how long the code spent on each operation, so we could identify easy targets for optimization. The actual main culprit for the slowdown was Eigen, the library we use for linear algebra, more specifically in how it does matrix multiplications with small matrices and how it allocates temporary memory in embedded devices without an operating system. We added specialized operations for specific sizes of matrices that are used in A1-nano-ReLU, as well as other unrelated optimizations.

Once we identified all the bottlenecks in the code for the A1-nano-ReLU model, we came up with different optimization tactics. I would love to say that I first implemented microbenchmarks to see which ones really mattered and then I added them to the code, but the real process was actually in reverse: I started by optimizing the actual Core code, then implemented microbenchmarks to better understand what was going on clown-face.

The microbenchmarks are all available in this repository for reference. These can be used as a reference for your own hardware, if you are implementing an embedded device that runs NAM.

Results

Micro-benchmark Results: Inline Ops vs Eigen

These benchmarks measure individual operations in isolation to guide optimization decisions for the NeuralAmpModelerCore WaveNet inference engine. Each section presents results for both platforms side-by-side.

Platforms

DesktopDaisy
CPUApple Silicon (Macbook Pro M5)ARM Cortex-M7 @ 480 MHz, FTZ=1, DN=1
Compiler flags-O3 -ffast-math -march=native -DNDEBUG -std=c++17-O2 -ffast-math -funroll-loops -ftree-vectorize -std=gnu++17
PlatformmacOSElectrosmith Daisy Seed (STM32H750, 512 KB SRAM, 128 KB DTCM)
Eigen parallelizationDisabled (-DEIGEN_DONT_PARALLELIZE)Disabled (-DEIGEN_DONT_PARALLELIZE)
Timingstd::chrono::high_resolution_clockDWT CYCCNT (cycle-accurate, zero overhead). 1 cycle = 2.08 ns @ 480 MHz
Iterations5000 (warmup: 100)1000 (warmup: 100)

1. Inline GEMM vs Eigen GEMM

The 1x1 convolution (Conv1x1) is the single hottest operation in WaveNet inference -- it runs once per layer per frame-block to mix channels. It boils down to a matrix-matrix multiply: (channels_out x channels_in) * (channels_in x frames). Eigen's general GEMM path allocates temporary panel buffers via malloc on every call, which is fine on desktop but catastrophically slow on Cortex-M7 with newlib. The "inline generic" variant replaces Eigen with a plain triple-loop; the "unrolled" variant fully unrolls the loop at compile time for known dimension pairs (e.g., 4x8), keeping weights in registers.

Matrix sizes matching typical WaveNet 1x1 convolution dimensions (channels x channels, applied across frames). Each size is measured once: Eigen is benchmarked first, then the inline variant(s) run against that same baseline with the same input data.

Desktop (48 frames)

SizeEigenInline GenericSpeedup
2x2191.1 ns122.7 ns1.56x
4x4196.2 ns519.4 ns0.38x
4x8230.2 ns1.0 us0.23x
4x8 (unrolled)230.2 ns75.2 ns3.06x
8x4199.8 ns992.9 ns0.20x
8x8227.4 ns1.6 us0.14x
8x8 (unrolled)227.4 ns111.6 ns2.04x

Desktop (2048 frames)

SizeEigenInline GenericSpeedup
2x22.1 us2.3 us0.90x
4x42.0 us8.4 us0.24x
4x82.4 us18.8 us0.13x
4x8 (unrolled)2.4 us1.9 us1.28x
8x42.6 us16.8 us0.15x
8x83.9 us38.0 us0.10x
8x8 (unrolled)3.9 us3.4 us1.15x

Daisy (48 frames)

SizeEigenInline GenericSpeedup
2x24044 cyc (8.4 us)2759 cyc (5.7 us)1.47x
4x48117 cyc (16.9 us)7175 cyc (14.9 us)1.13x
4x810657 cyc (22.2 us)9385 cyc (19.6 us)1.14x
4x8 (unrolled)10657 cyc (22.2 us)3437 cyc (7.2 us)3.10x
8x414314 cyc (29.8 us)13701 cyc (28.5 us)1.04x
8x819578 cyc (40.8 us)18125 cyc (37.8 us)1.08x

Desktop: Generic inline triple-loops lose badly to Eigen (0.10--0.38x) because Eigen leverages NEON SIMD. Fully unrolled compile-time specializations beat Eigen at all sizes and frame counts (1.15--3.06x), with the largest wins at small frame counts where Eigen's setup overhead is proportionally larger.

Daisy: Inline GEMM beats Eigen at every size because Eigen's general GEMM path calls malloc/free for panel buffer allocation on every invocation. The generic triple-loop wins by 1.04--1.47x. The fully unrolled 4x8 specialization (weights loaded into registers) wins by 3.10x -- the biggest single optimization in the entire WaveNet pipeline.


1b. DTCM Memory Placement Effect on GEMM (Daisy only)

The STM32H750 has 128 KB of DTCM (Data Tightly-Coupled Memory) that provides single-cycle, deterministic access -- no cache misses, no wait states. Production code copies model weights into DTCM to avoid cache eviction during the full WaveNet pipeline. This benchmark isolates the question: does DTCM placement speed up the GEMM kernel itself, or is the benefit only about avoiding interference from other data?

Generic Inline GEMM

SizeSRAM (baseline)DTCM weightSpeedupDTCM bothSpeedup
2x22758 cyc (5.7 us)2763 cyc (5.8 us)1.00x2897 cyc (6.0 us)0.95x
4x47173 cyc (14.9 us)7179 cyc (15.0 us)1.00x7316 cyc (15.2 us)0.98x
4x89384 cyc (19.6 us)9390 cyc (19.6 us)1.00x9528 cyc (19.9 us)0.98x
8x818125 cyc (37.8 us)18136 cyc (37.8 us)1.00x18285 cyc (38.1 us)0.99x

Unrolled 4x8 GEMM

VariantCyclesSpeedup vs SRAM
SRAM (baseline)3437 cyc (7.2 us)--
DTCM weight only3555 cyc (7.4 us)0.97x
DTCM weight + input3430 cyc (7.1 us)1.00x

Daisy: DTCM placement has no measurable benefit for GEMM operands. In some cases it is marginally slower (up to 5%). The Cortex-M7's D-cache achieves near-single-cycle latency for hot data that fits in cache, and these small weight matrices (64--256 bytes) are always cache-hot after warmup. DTCM's advantage is deterministic latency (no cold-start misses), not throughput for steady-state computation. The production DTCM weight copy remains justified for avoiding cache eviction by other data in the full WaveNet pipeline, but the GEMM kernel itself sees no benefit in isolation. This benchmark is not super representative of real-world usage, as we'd often be juggling multiple operations and invalidating the cache.


2. Effect of __restrict__ on Inline GEMM

The __restrict__ qualifier tells the compiler that pointers don't alias, theoretically enabling more aggressive load/store reordering and vectorization in tight loops. Since our inline GEMM reads from two input matrices and writes to a third, aliasing analysis could matter. Both versions use noinline to prevent the compiler from deducing non-aliasing from calling context, isolating the effect of the qualifier itself.

Desktop (48 frames)

SizeWithout __restrict__With __restrict__Speedup
4x4 (generic)446.8 ns464.8 ns0.96x
4x8 (generic)916.4 ns1.1 us0.83x
8x8 (generic)1.5 us1.5 us1.00x
4x8 (unrolled)36.4 ns36.3 ns1.00x

Desktop (2048 frames)

SizeWithout __restrict__With __restrict__Speedup
4x4 (generic)10.1 us8.5 us1.19x
4x8 (generic)21.1 us20.4 us1.03x
8x8 (generic)41.0 us41.6 us0.99x
4x8 (unrolled)1.9 us1.9 us1.01x

Daisy (48 frames)

SizeWithout __restrict__With __restrict__Speedup
4x47172 cyc (14.9 us)7174 cyc (14.9 us)1.00x
4x89384 cyc (19.5 us)9385 cyc (19.6 us)1.00x
8x818125 cyc (37.8 us)18126 cyc (37.8 us)1.00x

Desktop: __restrict__ has negligible effect in most cases. The compiler already deduces non-aliasing. The one outlier (4x4 at 2048 frames, 1.19x) is likely noise or a minor alignment effect.

Daisy: Zero measurable effect, same as desktop. The compiler already deduces non-aliasing regardless of platform.


3. std::memcpy vs Eigen Block Assignment

WaveNet layers frequently copy intermediate activation buffers -- for example, saving the pre-activation state for the residual path, or copying gated outputs into the next layer's input. These are contiguous matrix copies (all elements in one flat buffer). The question is whether a plain std::memcpy or Eigen's block assignment (.leftCols() = ...) generates faster code, given that memcpy quality varies dramatically across C libraries.

Desktop (48 frames)

RowsEigen block assignstd::memcpySpeedup
225.1 ns20.6 ns1.21x
425.9 ns27.1 ns0.96x
837.6 ns32.9 ns1.14x
1679.9 ns49.9 ns1.60x

Desktop (2048 frames)

RowsEigen block assignstd::memcpySpeedup
2379.2 ns219.6 ns1.73x
4767.8 ns375.6 ns2.04x
81.2 us571.1 ns2.15x
162.0 us1.3 us1.60x

Daisy (48 frames)

RowsEigen block assignstd::memcpymemcpy Speedup
4379 cyc (0.79 us)1636 cyc (3.4 us)0.23x
8715 cyc (1.5 us)3175 cyc (6.6 us)0.23x
161344 cyc (2.8 us)6248 cyc (13.0 us)0.22x

Desktop: std::memcpy consistently beats Eigen block assignment, with the gap widening at larger sizes (up to 2.15x). The platform's highly optimized memcpy implementation (NEON/AVX) outperforms Eigen's generated stores.

Daisy: The opposite of desktop -- Eigen block assignment is 4.3--4.6x faster than std::memcpy. Newlib's memcpy implementation is not optimized for the Cortex-M7's AXI bus and D-cache; Eigen's generated code likely uses more efficient register-width stores.


4. Element-wise Operations: Unrolled Loop vs Eigen

Element-wise addition and accumulation appear throughout WaveNet: skip connections sum layer outputs (z = a + b), and residual connections accumulate into a running buffer (dst += src). Eigen uses expression templates that defer evaluation and fuse operations, but for simple element-wise ops the template machinery may add overhead compared to a plain loop that the compiler can auto-vectorize directly.

Desktop -- Addition (z = a + b)

ChannelsFramesEigenUnrolledSpeedup
44834.4 ns32.9 ns1.04x
84863.9 ns60.4 ns1.06x
1648100.9 ns73.2 ns1.38x
42048824.9 ns541.1 ns1.52x
820481.4 us903.5 ns1.51x
1620483.2 us2.7 us1.19x

Desktop -- Accumulation (dst += src)

ChannelsFramesEigenUnrolledSpeedup
44843.0 ns31.7 ns1.36x
84895.6 ns64.9 ns1.47x
1648208.1 ns141.3 ns1.47x
42048736.8 ns594.0 ns1.24x
820482.3 us2.3 us1.02x
1620483.9 us3.9 us1.01x

Daisy -- Addition (z = a + b, 48 frames)

ChannelsFramesEigen a + bUnrolled loopSpeedup
448756 cyc (1.6 us)698 cyc (1.5 us)1.08x
8481454 cyc (3.0 us)1346 cyc (2.8 us)1.08x

Desktop: Unrolled loops are 1.0--1.5x faster for element-wise ops, with bigger wins at small/medium buffer sizes typical of real-time audio. Accumulation shows larger gains at small frame counts.

Daisy: Unrolled loops are modestly faster (1.08x), consistent with desktop results at 48 frames. The savings per call are small (~100--120 cycles) but accumulate over many WaveNet layers.


5. Bias Broadcast: Unrolled vs Eigen colwise()

Every convolution layer adds a per-channel bias vector after the matrix multiply: each element of the bias is broadcast across all frames in that channel's row. Eigen provides colwise() for this pattern; the alternative is a hand-written loop that iterates over channels and uses pointer arithmetic to stride across frames.

Desktop

ChannelsFramesEigen colwise()UnrolledSpeedup
24877.2 ns71.3 ns1.08x
44860.4 ns57.3 ns1.05x
84899.3 ns87.3 ns1.14x
220481.7 us1.2 us1.38x
420481.0 us630.5 ns1.66x
820481.6 us1.3 us1.24x

Daisy (48 frames)

ChannelsFramesEigen colwise()Unrolled loopSpeedup
4483541 cyc (7.4 us)3744 cyc (7.8 us)0.95x
8485721 cyc (11.9 us)5838 cyc (12.2 us)0.98x

Desktop: Unrolled bias broadcast is 1.05--1.66x faster than Eigen's colwise(), with larger wins at higher frame counts.

Daisy: Eigen wins by 2--5%, reversing the desktop result. Eigen's colwise() likely generates tighter scalar code on ARM than the hand-written loop. Keep Eigen for bias broadcast on Daisy.


6. Hardswish: Branchy vs Branchless

Hardswish is a piecewise activation function with three regions: 0 for x <= -3, x for x >= 3, and x * (x + 3) / 6 in between. The naive implementation uses if/else branches, which cause pipeline stalls from misprediction when inputs span all three regions (as they do in typical WaveNet activations). The branchless version uses fmin/fmax to compute the same result without conditional jumps. Input values are uniformly distributed in [-5, 5] to trigger worst-case branch misprediction.

Desktop

nBranchyBranchlessUnrolledBranchless SpeedupMax Error
192109.6 ns35.8 ns49.4 ns3.06x4.77e-07
384247.4 ns73.2 ns86.2 ns3.38x4.77e-07
81926.0 us854.4 ns1.2 us6.97x4.77e-07
163849.6 us1.7 us2.4 us5.73x4.77e-07
3276821.2 us4.5 us5.9 us4.76x4.77e-07

Daisy

nBranchy (if/else)Branchless + unrolledSpeedup
1926248 cyc (13.0 us)5209 cyc (10.9 us)1.20x
38412413 cyc (25.9 us)10245 cyc (21.3 us)1.21x

Desktop: Branchless hardswish is 3--7x faster than the branchy version. The simple branchless scalar loop actually beats the manually unrolled variant -- the compiler auto-vectorizes the scalar version more effectively.

Daisy: Branchless hardswish is 1.20--1.21x faster, a much smaller win than desktop. The Cortex-M7 branch predictor handles the 3-region pattern better than expected, and without SIMD the branchless version can't exploit vectorization. Still worth using since it's a consistent 20% improvement.


7. Activation Loop Unrolling (1-wide vs 4-wide)

Activation functions are applied element-wise to every value in the activation buffer after each convolution. Manual 4-wide unrolling processes four elements per loop iteration, giving the compiler explicit instruction-level parallelism (ILP) and aligning with SIMD register widths (4 floats = 1 SSE/NEON register). The benefit depends on whether the activation is compute-bound (cheap ops like ReLU) or latency-bound (expensive ops like expf in Sigmoid/SiLU).

Desktop

Activationn1-wide4-wideSpeedup
ReLU19258.0 ns58.1 ns1.00x
ReLU384105.5 ns88.3 ns1.19x
ReLU163842.9 us2.0 us1.44x
Sigmoid192317.2 ns313.6 ns1.01x
Sigmoid384645.5 ns612.8 ns1.05x
Sigmoid1638418.0 us16.2 us1.11x
SiLU192188.6 ns191.4 ns0.99x
SiLU384390.5 ns393.5 ns0.99x
SiLU1638416.5 us16.5 us0.99x
Softsign19219.5 ns19.1 ns1.02x
Softsign38435.6 ns34.1 ns1.04x
Softsign163841.6 us1.5 us1.05x

Daisy

Activationn1-wide4-wideSpeedup
ReLU3844269 cyc (8.9 us)3918 cyc (8.2 us)1.09x
Sigmoid38447625 cyc (99.2 us)47325 cyc (98.6 us)1.01x

Desktop: Manual 4-wide unrolling helps ReLU (up to 1.44x) but has negligible effect on transcendental-heavy activations (Sigmoid, SiLU) where expf dominates the cost. Softsign is already fast enough that unrolling provides no meaningful benefit.

Daisy: A modest 9% win for ReLU, consistent with desktop at n=384. Sigmoid is dominated by expf cost (~124 cycles/element), making unrolling irrelevant.


8. LUT Activation vs Computed (expf)

Sigmoid, SiLU, and Tanh all depend on expf(), which is the single most expensive scalar operation in WaveNet inference (~124 cycles per element on Cortex-M7). A lookup table (LUT) pre-computes the activation over a fixed input range, then uses linear interpolation between table entries at runtime -- replacing the expf call with an array index, a multiply, and an add. The tradeoff is a small approximation error that depends on table size. The computed baseline is measured once per activation/size and shared across all LUT sizes.

Desktop (n = 192)

ActivationLUT SizeComputedLUT + lerpSpeedupMax Error
Sigmoid256512.1 ns275.2 ns1.86x7.39e-05
Sigmoid1024512.1 ns257.5 ns1.99x4.59e-06
Sigmoid2048512.1 ns256.5 ns2.00x1.19e-06
SiLU256452.6 ns255.8 ns1.77x3.80e-04
SiLU1024452.6 ns256.9 ns1.76x2.32e-05
SiLU2048452.6 ns227.9 ns1.99x6.24e-06
Tanh256495.8 ns200.0 ns2.48x5.79e-04
Tanh1024495.8 ns197.7 ns2.51x3.63e-05
Tanh2048495.8 ns197.6 ns2.51x9.00e-06

Desktop (n = 384)

ActivationLUT SizeComputedLUT + lerpSpeedupMax Error
Sigmoid256677.2 ns359.5 ns1.88x7.38e-05
Sigmoid1024677.2 ns359.6 ns1.88x4.63e-06
Sigmoid2048677.2 ns349.8 ns1.94x1.25e-06
SiLU256577.9 ns328.1 ns1.76x3.83e-04
SiLU1024577.9 ns307.8 ns1.88x2.38e-05
SiLU2048577.9 ns308.3 ns1.87x6.36e-06
Tanh256693.8 ns298.1 ns2.33x5.70e-04
Tanh1024693.8 ns286.3 ns2.42x3.71e-05
Tanh2048693.8 ns286.5 ns2.42x9.24e-06

Desktop (n = 16384)

ActivationLUT SizeComputedLUT + lerpSpeedupMax Error
Sigmoid25616.7 us9.4 us1.77x7.41e-05
Sigmoid102416.7 us9.6 us1.74x4.75e-06
Sigmoid204816.7 us9.7 us1.71x1.31e-06
SiLU25616.3 us9.5 us1.71x3.85e-04
SiLU102416.3 us9.6 us1.69x2.45e-05
SiLU204816.3 us9.7 us1.68x6.57e-06
Tanh25624.4 us9.5 us2.56x5.91e-04
Tanh102424.4 us9.7 us2.52x3.71e-05
Tanh204824.4 us9.7 us2.50x9.83e-06

Daisy (n = 384)

ActivationnLUT SizeComputedLUT + lerpSpeedup
SiLU384204847552 cyc (99.1 us)16901 cyc (35.2 us)2.81x

Desktop: LUT + linear interpolation is 1.7--2.6x faster than computed expf. A 1024-point table offers a good speed/accuracy tradeoff (~5e-06 max error). Tanh benefits the most since it evaluates two expf calls per element.

Daisy: LUT + linear interpolation is 2.81x faster than computed SiLU, a larger win than desktop (1.87x for SiLU at n=384). The bigger gap reflects the higher relative cost of expf on Cortex-M7 (~124 cyc/element computed vs ~44 cyc/element LUT).


9. Strided Sub-matrix Copy (Desktop only)

WaveNet's gated activation splits a 2N-channel activation into two N-channel halves (for the sigmoid gate and tanh path). When the source matrix has 2N rows but the destination has N rows, source and destination have different column strides in memory, so a simple memcpy won't work. Eigen's .topRows().leftCols() handles this via expression templates; the manual version uses explicit pointer arithmetic to copy row-by-row with the correct strides.

Desktop

RowsFramesEigen .topRows()Manual stride copySpeedup
4->24875.9 ns21.7 ns3.50x
8->44868.2 ns18.3 ns3.72x
16->84876.5 ns21.9 ns3.49x
4->220482.3 us659.1 ns3.56x
8->420482.2 us491.4 ns4.54x
16->820482.5 us868.8 ns2.90x

Desktop: Manual strided copy is 2.9--4.5x faster than Eigen's .topRows().leftCols(). Eigen's expression template overhead dominates for non-contiguous copies.


10. Depthwise Conv: Inline vs Eigen Diagonal

Some WaveNet variants use depthwise (channel-wise) 1x1 convolutions, where each channel is scaled independently -- equivalent to multiplying each row of the activation matrix by a scalar weight. Eigen expresses this as weight.asDiagonal() * input, which constructs a diagonal matrix view and dispatches to a general matrix multiply. The inline version skips the abstraction and directly multiplies each element by the corresponding channel weight.

Desktop

ChannelsFramesEigen asDiagonal()Inline element-wiseSpeedup
248116.0 ns55.1 ns2.11x
448124.3 ns36.7 ns3.38x
848182.2 ns87.7 ns2.08x
220483.7 us1.3 us2.89x
420482.7 us607.0 ns4.49x
820484.1 us2.2 us1.83x

Daisy (48 frames)

ChannelsFramesEigen asDiagonal()Inline element-wiseSpeedup
4484681 cyc (9.8 us)3175 cyc (6.6 us)1.47x
8487356 cyc (15.3 us)7478 cyc (15.6 us)0.98x

Desktop: Inline element-wise depthwise conv is 1.8--4.5x faster than Eigen's asDiagonal() multiply across all tested sizes.

Daisy: Inline wins at 4 channels (1.47x) but ties at 8 channels. The 8-channel case may be hitting register pressure limits where Eigen's code generation produces equivalent scalar code.


11. FiLM Layer: Inline vs Eigen Array Expressions

Feature-wise Linear Modulation (FiLM) applies a learned per-channel affine transform -- output = scale * input + shift -- to condition one network's activations on another's predictions. In WaveNet, FiLM layers modulate hidden activations using conditioning signals. Eigen expresses this with .array() broadcasting; the inline version uses a nested loop over channels and frames. "Scale only" omits the shift term, reducing the operation to a per-channel multiply.

Desktop -- Scale + Shift

ChannelsFramesEigen .array()Inline loopSpeedup
448192.9 ns122.8 ns1.57x
848247.9 ns213.0 ns1.16x
420483.6 us2.1 us1.68x
820483.8 us10.3 us0.36x

Desktop -- Scale Only

ChannelsFramesEigen .array()Inline loopSpeedup
448142.1 ns70.1 ns2.03x
848152.8 ns65.9 ns2.32x
420482.5 us1.4 us1.84x
820483.1 us1.9 us1.68x

Daisy (48 frames)

ChannelsFramesEigen .array()Inline loopSpeedup
4482725 cyc (5.7 us)2449 cyc (5.1 us)1.11x
8484837 cyc (10.1 us)4464 cyc (9.3 us)1.08x

Desktop: Inline FiLM is 1.2--2.3x faster in most cases. Scale-only is uniformly faster. The 8-channel x 2048-frame scale+shift case shows a regression (0.36x), likely a cache/vectorization edge case.

Daisy: Inline FiLM is 1.08--1.11x faster, a modest but consistent win across both channel counts.


12. Ring Buffer Write: Eigen Block vs Nested Loop

WaveNet uses dilated causal convolutions, which require access to past input frames. A ring buffer stores these frames in a fixed-size circular matrix, and each new block of frames must be written at the current write position. This is a contiguous multi-channel write into the middle of a larger matrix. Eigen's .middleCols() compiles to an optimized block copy; the nested-loop version writes element-by-element with explicit index arithmetic.

Desktop

ChannelsFramesNested loopEigen middleCols()Eigen Speedup
24857.1 ns22.4 ns2.56x
44844.2 ns24.0 ns1.84x
84851.5 ns48.7 ns1.06x
220481.7 us399.6 ns4.25x
42048760.1 ns660.6 ns1.15x
82048954.9 ns1.1 us0.87x

Daisy (48 frames)

ChannelsFramesNested loopEigen middleCols()Eigen Speedup
4483690 cyc (7.7 us)374 cyc (0.78 us)9.86x
8486713 cyc (14.0 us)715 cyc (1.5 us)9.39x

Desktop: Eigen's block assignment wins by up to 4.25x for small channel counts. At 8 channels and large frame counts, the nested loop catches up. Eigen compiles .middleCols() = ... to vectorized contiguous writes that scalar nested loops can't match.

Daisy: Eigen is 9.4--9.9x faster -- an even larger win than desktop. The scalar nested loop generates extremely poor code on ARM, while Eigen compiles to an optimized contiguous copy. This is the strongest argument for keeping Eigen in the codebase.


Summary: Daisy vs Desktop

OptimizationDesktop (Apple Silicon)Daisy (Cortex-M7)Notes
Inline generic GEMM vs Eigen0.14--0.38x (Eigen wins)1.04--1.47xEigen malloc kills ARM perf
Unrolled GEMM vs Eigen2.04--3.06x3.10xWins on both platforms
DTCM weight placementN/A1.00xNo benefit in isolation
__restrict__1.00x1.00xNo effect on either
memcpy vs Eigen block copy1.14--2.15x (memcpy wins)0.22--0.23x (Eigen wins)Newlib memcpy is slow
Element-wise add1.04--1.06x1.08xSmall consistent win
Bias broadcast1.05--1.14x0.95--0.98x (Eigen wins)Platform-dependent
Hardswish branchless3.06--3.38x1.20--1.21xLess impact without SIMD
ReLU unrolling1.19x1.09xModest on both
LUT SiLU vs computed1.87x2.81xBigger win on ARM (expf costlier)
Depthwise inline2.08--3.38x1.47x (4ch only)8ch ties on ARM
FiLM inline1.16--1.57x1.08--1.11xConsistent small win
Ring buffer Eigen1.84--4.25x9.39--9.86xEigen's biggest win, even larger on ARM

JSON - when processing text becomes a problem

NeuralAmpModelerCore uses nlohmann::json to parse .nam files, but it requires a lot of memory to parse a model and memory is not exactly something we have plenty of... E.g. an A1 NAM model stored as a JSON file takes almost 400KB of memory. Remember when we said the Cortex M7 has between 1-2MB of RAM? Yeah, that is going to be a problem.

JSON is cool and very accessible for users, as you can just open it in a text file or parse it without any special dependencies in most programming languages. And parsing JSON is definitely not an issue on desktop (your browser does it thousands of times every day without breaking a sweat)... so we did not want to replace the JSON format.

Our approach was to refactor the factory methods in NeuralAmpModelerCore so they expect some sort of model configuration object instead of a JSON object, and offload parsing either JSON or another file-based representation of models to a parser. This way, both JSON and other formats can use the same "entry point" to build a model from a file, they would just use different parsers.

We then came up with the namb format, a compact binary format for models to be used instead of the regular nam format in embedded devices. This format is meant to be used as an exchange format between a controller application (say, an app to load models that runs on your phone) and the embedded device. The app converts the nam format that is retrieved from the T3K website into a compact format that is more suited to your pedal, then transmits it over Bluetooth or an USB cable. There is no model conversion here: think of namb just as a more "compressed" format. Learn more at this repo.

Contributions

All of the optimizations that were helpful for us to run NeuralAmpModelerCore on the Daisy were merged, meaning these are all readily available for anyone who wants to try something similar. They currently require compiling with CXXFLAGS+=USE_NAM_INLINE_GEMM.

In order to enable using our binary loader, model construction in core was decoupled from JSON parsing, which was also merged. The code for nam-binary-loader is available as a separate library and tool in this repo.

Finally, we are publishing example code for the Daisy Seed board, to be used as a blueprint for porting NeuralAmpModelerCore to embedded targets. This was our target during optimization: we worked until we were able to run an A1 nano ReLU model in real time on the Daisy Seed. The code and instructions for how to build and run it on the Daisy are in their own repo.

We hope this helps anyone who is interested in experimenting with supporting NAM in their hardware with trying things out.