Bringing NAM A2 to Embedded Hardware

11 February, 2026

Note: this post is a partner post to a post I wrote for the Tone3000 blog, with a more in-depth discussion of all the optimizations and results we found.

We all love NAM, and we want to run our favourite captures EVERYWHERE. And that's what we have been trying to achieve with A2, especially Slimmable models. As part of the development of A2, we are working closely with partners to make sure A2 suits their hardware. We also wanted to have a general idea of what running NAM in hardware platforms with more resources looks like, so we decided to get our hands dirty and implement a simple NAM loader for the Electrosmith Daisy Seed board, which has an ARM Cortex M7 that is currently very popular for DSP applications.

The Daisy Seed board has the following specs:

ARM Cortex M7 running at 480 MHz
1 MB of SRAM (of which 128 kB are tightly coupled RAM for data)
8 MB of Flash memory (for firmware/stored models)
96 kHz stereo codec (we used 48 kHz mono)

The Daisy Seed has optionally an extra 64 MB of SDRAM, but that RAM is much slower than the onboard SRAM. People usually use that RAM for loopers and delays, not anything that requires super intense realtime math like a neural model… so we ignored it in these experiments.

To make testing and benchmarking easier, we plugged the Daisy Seed into a Daisy Pod, which is basically a powered breakout board for the Daisy that includes stereo audio I/O, a micro SD card slot, and a couple of knobs. Most important for us was the SD card, as we could load a bunch of NAM models into it and have our benchmarking software run all of them sequentially (and write the logs to the SD card as well).

Challenges

We already knew from the beginning that memory limitations were strict (and that’s one of the reasons for Slimmable NAM). With 1 MB for both the model weights and any buffers, we are restricted in the size of model we can use. We decided to try a smaller model than A1, known in the community as A1-nano, with a little but important tweak: replacing the tanh activation with a ReLU. Tanh is an expensive operation that requires too much computation, which led the deep learning community to look for cheaper options. ReLU fits the bill here as one of the simplest activation functions that is currently used in lots of AI models.

While the code in NeuralAmpModelerCore has been battle-tested in the NAM plugin, it was not designed with embedded applications in mind. When we first tried to run it on the Daisy Seed, even with an A1-nano-ReLU model, it would take over 5 seconds of compute time to process 2 seconds of audio... which we should understand would not work for a pedal that is supposed to be doing audio processing in real time.

Optimization process

In order to tackle that issue, we first instrumented the code with profiling to capture how long the code spent on each operation, so we could identify easy targets for optimization. The actual main culprit for the slowdown was Eigen, the library we use for linear algebra, more specifically in how it does matrix multiplications with small matrices and how it allocates temporary memory in embedded devices without an operating system. We added specialized operations for specific sizes of matrices that are used in A1-nano-ReLU, as well as other unrelated optimizations.

Once we identified all the bottlenecks in the code for the A1-nano-ReLU model, we came up with different optimization tactics. I would love to say that I first implemented microbenchmarks to see which ones really mattered and then I added them to the code, but the real process was actually in reverse: I started by optimizing the actual Core code, then implemented microbenchmarks to better understand what was going on clown-face.

The microbenchmarks are all available in this repository for reference. These can be used as a reference for your own hardware, if you are implementing an embedded device that runs NAM.

Results

Micro-benchmark Results: Inline Ops vs Eigen

These benchmarks measure individual operations in isolation to guide optimization decisions for the NeuralAmpModelerCore WaveNet inference engine. Each section presents results for both platforms side-by-side.

Platforms

	Desktop	Daisy
CPU	Apple Silicon (Macbook Pro M5)	ARM Cortex-M7 @ 480 MHz, FTZ=1, DN=1
Compiler flags	`-O3 -ffast-math -march=native -DNDEBUG -std=c++17`	`-O2 -ffast-math -funroll-loops -ftree-vectorize -std=gnu++17`
Platform	macOS	Electrosmith Daisy Seed (STM32H750, 512 KB SRAM, 128 KB DTCM)
Eigen parallelization	Disabled (`-DEIGEN_DONT_PARALLELIZE`)	Disabled (`-DEIGEN_DONT_PARALLELIZE`)
Timing	`std::chrono::high_resolution_clock`	DWT CYCCNT (cycle-accurate, zero overhead). 1 cycle = 2.08 ns @ 480 MHz
Iterations	5000 (warmup: 100)	1000 (warmup: 100)

1. Inline GEMM vs Eigen GEMM

The 1x1 convolution (Conv1x1) is the single hottest operation in WaveNet inference -- it runs once per layer per frame-block to mix channels. It boils down to a matrix-matrix multiply: (channels_out x channels_in) * (channels_in x frames). Eigen's general GEMM path allocates temporary panel buffers via malloc on every call, which is fine on desktop but catastrophically slow on Cortex-M7 with newlib. The "inline generic" variant replaces Eigen with a plain triple-loop; the "unrolled" variant fully unrolls the loop at compile time for known dimension pairs (e.g., 4x8), keeping weights in registers.

Matrix sizes matching typical WaveNet 1x1 convolution dimensions (channels x channels, applied across frames). Each size is measured once: Eigen is benchmarked first, then the inline variant(s) run against that same baseline with the same input data.

Desktop (48 frames)

Size	Eigen	Inline Generic	Speedup
2x2	191.1 ns	122.7 ns	1.56x
4x4	196.2 ns	519.4 ns	0.38x
4x8	230.2 ns	1.0 us	0.23x
4x8 (unrolled)	230.2 ns	75.2 ns	3.06x
8x4	199.8 ns	992.9 ns	0.20x
8x8	227.4 ns	1.6 us	0.14x
8x8 (unrolled)	227.4 ns	111.6 ns	2.04x

Desktop (2048 frames)

Size	Eigen	Inline Generic	Speedup
2x2	2.1 us	2.3 us	0.90x
4x4	2.0 us	8.4 us	0.24x
4x8	2.4 us	18.8 us	0.13x
4x8 (unrolled)	2.4 us	1.9 us	1.28x
8x4	2.6 us	16.8 us	0.15x
8x8	3.9 us	38.0 us	0.10x
8x8 (unrolled)	3.9 us	3.4 us	1.15x

Daisy (48 frames)

Size	Eigen	Inline Generic	Speedup
2x2	4044 cyc (8.4 us)	2759 cyc (5.7 us)	1.47x
4x4	8117 cyc (16.9 us)	7175 cyc (14.9 us)	1.13x
4x8	10657 cyc (22.2 us)	9385 cyc (19.6 us)	1.14x
4x8 (unrolled)	10657 cyc (22.2 us)	3437 cyc (7.2 us)	3.10x
8x4	14314 cyc (29.8 us)	13701 cyc (28.5 us)	1.04x
8x8	19578 cyc (40.8 us)	18125 cyc (37.8 us)	1.08x

Desktop: Generic inline triple-loops lose badly to Eigen (0.10--0.38x) because Eigen leverages NEON SIMD. Fully unrolled compile-time specializations beat Eigen at all sizes and frame counts (1.15--3.06x), with the largest wins at small frame counts where Eigen's setup overhead is proportionally larger.

Daisy: Inline GEMM beats Eigen at every size because Eigen's general GEMM path calls malloc/free for panel buffer allocation on every invocation. The generic triple-loop wins by 1.04--1.47x. The fully unrolled 4x8 specialization (weights loaded into registers) wins by 3.10x -- the biggest single optimization in the entire WaveNet pipeline.

1b. DTCM Memory Placement Effect on GEMM (Daisy only)

The STM32H750 has 128 KB of DTCM (Data Tightly-Coupled Memory) that provides single-cycle, deterministic access -- no cache misses, no wait states. Production code copies model weights into DTCM to avoid cache eviction during the full WaveNet pipeline. This benchmark isolates the question: does DTCM placement speed up the GEMM kernel itself, or is the benefit only about avoiding interference from other data?

Generic Inline GEMM

Size	SRAM (baseline)	DTCM weight	Speedup	DTCM both	Speedup
2x2	2758 cyc (5.7 us)	2763 cyc (5.8 us)	1.00x	2897 cyc (6.0 us)	0.95x
4x4	7173 cyc (14.9 us)	7179 cyc (15.0 us)	1.00x	7316 cyc (15.2 us)	0.98x
4x8	9384 cyc (19.6 us)	9390 cyc (19.6 us)	1.00x	9528 cyc (19.9 us)	0.98x
8x8	18125 cyc (37.8 us)	18136 cyc (37.8 us)	1.00x	18285 cyc (38.1 us)	0.99x

Unrolled 4x8 GEMM

Variant	Cycles	Speedup vs SRAM
SRAM (baseline)	3437 cyc (7.2 us)	--
DTCM weight only	3555 cyc (7.4 us)	0.97x
DTCM weight + input	3430 cyc (7.1 us)	1.00x

Daisy: DTCM placement has no measurable benefit for GEMM operands. In some cases it is marginally slower (up to 5%). The Cortex-M7's D-cache achieves near-single-cycle latency for hot data that fits in cache, and these small weight matrices (64--256 bytes) are always cache-hot after warmup. DTCM's advantage is deterministic latency (no cold-start misses), not throughput for steady-state computation. The production DTCM weight copy remains justified for avoiding cache eviction by other data in the full WaveNet pipeline, but the GEMM kernel itself sees no benefit in isolation. This benchmark is not super representative of real-world usage, as we'd often be juggling multiple operations and invalidating the cache.

2. Effect of `restrict` on Inline GEMM

The __restrict__ qualifier tells the compiler that pointers don't alias, theoretically enabling more aggressive load/store reordering and vectorization in tight loops. Since our inline GEMM reads from two input matrices and writes to a third, aliasing analysis could matter. Both versions use noinline to prevent the compiler from deducing non-aliasing from calling context, isolating the effect of the qualifier itself.

Desktop (48 frames)

Size	Without `__restrict__`	With `__restrict__`	Speedup
4x4 (generic)	446.8 ns	464.8 ns	0.96x
4x8 (generic)	916.4 ns	1.1 us	0.83x
8x8 (generic)	1.5 us	1.5 us	1.00x
4x8 (unrolled)	36.4 ns	36.3 ns	1.00x

Desktop (2048 frames)

Size	Without `__restrict__`	With `__restrict__`	Speedup
4x4 (generic)	10.1 us	8.5 us	1.19x
4x8 (generic)	21.1 us	20.4 us	1.03x
8x8 (generic)	41.0 us	41.6 us	0.99x
4x8 (unrolled)	1.9 us	1.9 us	1.01x

Daisy (48 frames)

Size	Without `__restrict__`	With `__restrict__`	Speedup
4x4	7172 cyc (14.9 us)	7174 cyc (14.9 us)	1.00x
4x8	9384 cyc (19.5 us)	9385 cyc (19.6 us)	1.00x
8x8	18125 cyc (37.8 us)	18126 cyc (37.8 us)	1.00x

Desktop: __restrict__ has negligible effect in most cases. The compiler already deduces non-aliasing. The one outlier (4x4 at 2048 frames, 1.19x) is likely noise or a minor alignment effect.

Daisy: Zero measurable effect, same as desktop. The compiler already deduces non-aliasing regardless of platform.

3. `std::memcpy` vs Eigen Block Assignment

WaveNet layers frequently copy intermediate activation buffers -- for example, saving the pre-activation state for the residual path, or copying gated outputs into the next layer's input. These are contiguous matrix copies (all elements in one flat buffer). The question is whether a plain std::memcpy or Eigen's block assignment (.leftCols() = ...) generates faster code, given that memcpy quality varies dramatically across C libraries.

Desktop (48 frames)

Rows	Eigen block assign	std::memcpy	Speedup
2	25.1 ns	20.6 ns	1.21x
4	25.9 ns	27.1 ns	0.96x
8	37.6 ns	32.9 ns	1.14x
16	79.9 ns	49.9 ns	1.60x

Desktop (2048 frames)

Rows	Eigen block assign	std::memcpy	Speedup
2	379.2 ns	219.6 ns	1.73x
4	767.8 ns	375.6 ns	2.04x
8	1.2 us	571.1 ns	2.15x
16	2.0 us	1.3 us	1.60x

Daisy (48 frames)

Rows	Eigen block assign	std::memcpy	memcpy Speedup
4	379 cyc (0.79 us)	1636 cyc (3.4 us)	0.23x
8	715 cyc (1.5 us)	3175 cyc (6.6 us)	0.23x
16	1344 cyc (2.8 us)	6248 cyc (13.0 us)	0.22x

Desktop: std::memcpy consistently beats Eigen block assignment, with the gap widening at larger sizes (up to 2.15x). The platform's highly optimized memcpy implementation (NEON/AVX) outperforms Eigen's generated stores.

Daisy: The opposite of desktop -- Eigen block assignment is 4.3--4.6x faster than std::memcpy. Newlib's memcpy implementation is not optimized for the Cortex-M7's AXI bus and D-cache; Eigen's generated code likely uses more efficient register-width stores.

4. Element-wise Operations: Unrolled Loop vs Eigen

Element-wise addition and accumulation appear throughout WaveNet: skip connections sum layer outputs (z = a + b), and residual connections accumulate into a running buffer (dst += src). Eigen uses expression templates that defer evaluation and fuse operations, but for simple element-wise ops the template machinery may add overhead compared to a plain loop that the compiler can auto-vectorize directly.

Desktop -- Addition (z = a + b)

Channels	Frames	Eigen	Unrolled	Speedup
4	48	34.4 ns	32.9 ns	1.04x
8	48	63.9 ns	60.4 ns	1.06x
16	48	100.9 ns	73.2 ns	1.38x
4	2048	824.9 ns	541.1 ns	1.52x
8	2048	1.4 us	903.5 ns	1.51x
16	2048	3.2 us	2.7 us	1.19x

Desktop -- Accumulation (dst += src)

Channels	Frames	Eigen	Unrolled	Speedup
4	48	43.0 ns	31.7 ns	1.36x
8	48	95.6 ns	64.9 ns	1.47x
16	48	208.1 ns	141.3 ns	1.47x
4	2048	736.8 ns	594.0 ns	1.24x
8	2048	2.3 us	2.3 us	1.02x
16	2048	3.9 us	3.9 us	1.01x

Daisy -- Addition (z = a + b, 48 frames)

Channels	Frames	Eigen `a + b`	Unrolled loop	Speedup
4	48	756 cyc (1.6 us)	698 cyc (1.5 us)	1.08x
8	48	1454 cyc (3.0 us)	1346 cyc (2.8 us)	1.08x

Desktop: Unrolled loops are 1.0--1.5x faster for element-wise ops, with bigger wins at small/medium buffer sizes typical of real-time audio. Accumulation shows larger gains at small frame counts.

Daisy: Unrolled loops are modestly faster (1.08x), consistent with desktop results at 48 frames. The savings per call are small (~100--120 cycles) but accumulate over many WaveNet layers.

5. Bias Broadcast: Unrolled vs Eigen `colwise()`

Every convolution layer adds a per-channel bias vector after the matrix multiply: each element of the bias is broadcast across all frames in that channel's row. Eigen provides colwise() for this pattern; the alternative is a hand-written loop that iterates over channels and uses pointer arithmetic to stride across frames.

Desktop

Channels	Frames	Eigen `colwise()`	Unrolled	Speedup
2	48	77.2 ns	71.3 ns	1.08x
4	48	60.4 ns	57.3 ns	1.05x
8	48	99.3 ns	87.3 ns	1.14x
2	2048	1.7 us	1.2 us	1.38x
4	2048	1.0 us	630.5 ns	1.66x
8	2048	1.6 us	1.3 us	1.24x

Daisy (48 frames)

Channels	Frames	Eigen `colwise()`	Unrolled loop	Speedup
4	48	3541 cyc (7.4 us)	3744 cyc (7.8 us)	0.95x
8	48	5721 cyc (11.9 us)	5838 cyc (12.2 us)	0.98x

Desktop: Unrolled bias broadcast is 1.05--1.66x faster than Eigen's colwise(), with larger wins at higher frame counts.

Daisy: Eigen wins by 2--5%, reversing the desktop result. Eigen's colwise() likely generates tighter scalar code on ARM than the hand-written loop. Keep Eigen for bias broadcast on Daisy.

6. Hardswish: Branchy vs Branchless

Hardswish is a piecewise activation function with three regions: 0 for x <= -3, x for x >= 3, and x * (x + 3) / 6 in between. The naive implementation uses if/else branches, which cause pipeline stalls from misprediction when inputs span all three regions (as they do in typical WaveNet activations). The branchless version uses fmin/fmax to compute the same result without conditional jumps. Input values are uniformly distributed in [-5, 5] to trigger worst-case branch misprediction.

Desktop

n	Branchy	Branchless	Unrolled	Branchless Speedup	Max Error
192	109.6 ns	35.8 ns	49.4 ns	3.06x	4.77e-07
384	247.4 ns	73.2 ns	86.2 ns	3.38x	4.77e-07
8192	6.0 us	854.4 ns	1.2 us	6.97x	4.77e-07
16384	9.6 us	1.7 us	2.4 us	5.73x	4.77e-07
32768	21.2 us	4.5 us	5.9 us	4.76x	4.77e-07

Daisy

n	Branchy (if/else)	Branchless + unrolled	Speedup
192	6248 cyc (13.0 us)	5209 cyc (10.9 us)	1.20x
384	12413 cyc (25.9 us)	10245 cyc (21.3 us)	1.21x

Desktop: Branchless hardswish is 3--7x faster than the branchy version. The simple branchless scalar loop actually beats the manually unrolled variant -- the compiler auto-vectorizes the scalar version more effectively.

Daisy: Branchless hardswish is 1.20--1.21x faster, a much smaller win than desktop. The Cortex-M7 branch predictor handles the 3-region pattern better than expected, and without SIMD the branchless version can't exploit vectorization. Still worth using since it's a consistent 20% improvement.

7. Activation Loop Unrolling (1-wide vs 4-wide)

Activation functions are applied element-wise to every value in the activation buffer after each convolution. Manual 4-wide unrolling processes four elements per loop iteration, giving the compiler explicit instruction-level parallelism (ILP) and aligning with SIMD register widths (4 floats = 1 SSE/NEON register). The benefit depends on whether the activation is compute-bound (cheap ops like ReLU) or latency-bound (expensive ops like expf in Sigmoid/SiLU).

Desktop

Activation	n	1-wide	4-wide	Speedup
ReLU	192	58.0 ns	58.1 ns	1.00x
ReLU	384	105.5 ns	88.3 ns	1.19x
ReLU	16384	2.9 us	2.0 us	1.44x
Sigmoid	192	317.2 ns	313.6 ns	1.01x
Sigmoid	384	645.5 ns	612.8 ns	1.05x
Sigmoid	16384	18.0 us	16.2 us	1.11x
SiLU	192	188.6 ns	191.4 ns	0.99x
SiLU	384	390.5 ns	393.5 ns	0.99x
SiLU	16384	16.5 us	16.5 us	0.99x
Softsign	192	19.5 ns	19.1 ns	1.02x
Softsign	384	35.6 ns	34.1 ns	1.04x
Softsign	16384	1.6 us	1.5 us	1.05x

Daisy

Activation	n	1-wide	4-wide	Speedup
ReLU	384	4269 cyc (8.9 us)	3918 cyc (8.2 us)	1.09x
Sigmoid	384	47625 cyc (99.2 us)	47325 cyc (98.6 us)	1.01x

Desktop: Manual 4-wide unrolling helps ReLU (up to 1.44x) but has negligible effect on transcendental-heavy activations (Sigmoid, SiLU) where expf dominates the cost. Softsign is already fast enough that unrolling provides no meaningful benefit.

Daisy: A modest 9% win for ReLU, consistent with desktop at n=384. Sigmoid is dominated by expf cost (~124 cycles/element), making unrolling irrelevant.

8. LUT Activation vs Computed (`expf`)

Sigmoid, SiLU, and Tanh all depend on expf(), which is the single most expensive scalar operation in WaveNet inference (~124 cycles per element on Cortex-M7). A lookup table (LUT) pre-computes the activation over a fixed input range, then uses linear interpolation between table entries at runtime -- replacing the expf call with an array index, a multiply, and an add. The tradeoff is a small approximation error that depends on table size. The computed baseline is measured once per activation/size and shared across all LUT sizes.

Desktop (n = 192)

Activation	LUT Size	Computed	LUT + lerp	Speedup	Max Error
Sigmoid	256	512.1 ns	275.2 ns	1.86x	7.39e-05
Sigmoid	1024	512.1 ns	257.5 ns	1.99x	4.59e-06
Sigmoid	2048	512.1 ns	256.5 ns	2.00x	1.19e-06
SiLU	256	452.6 ns	255.8 ns	1.77x	3.80e-04
SiLU	1024	452.6 ns	256.9 ns	1.76x	2.32e-05
SiLU	2048	452.6 ns	227.9 ns	1.99x	6.24e-06
Tanh	256	495.8 ns	200.0 ns	2.48x	5.79e-04
Tanh	1024	495.8 ns	197.7 ns	2.51x	3.63e-05
Tanh	2048	495.8 ns	197.6 ns	2.51x	9.00e-06

Desktop (n = 384)

Activation	LUT Size	Computed	LUT + lerp	Speedup	Max Error
Sigmoid	256	677.2 ns	359.5 ns	1.88x	7.38e-05
Sigmoid	1024	677.2 ns	359.6 ns	1.88x	4.63e-06
Sigmoid	2048	677.2 ns	349.8 ns	1.94x	1.25e-06
SiLU	256	577.9 ns	328.1 ns	1.76x	3.83e-04
SiLU	1024	577.9 ns	307.8 ns	1.88x	2.38e-05
SiLU	2048	577.9 ns	308.3 ns	1.87x	6.36e-06
Tanh	256	693.8 ns	298.1 ns	2.33x	5.70e-04
Tanh	1024	693.8 ns	286.3 ns	2.42x	3.71e-05
Tanh	2048	693.8 ns	286.5 ns	2.42x	9.24e-06

Desktop (n = 16384)

Activation	LUT Size	Computed	LUT + lerp	Speedup	Max Error
Sigmoid	256	16.7 us	9.4 us	1.77x	7.41e-05
Sigmoid	1024	16.7 us	9.6 us	1.74x	4.75e-06
Sigmoid	2048	16.7 us	9.7 us	1.71x	1.31e-06
SiLU	256	16.3 us	9.5 us	1.71x	3.85e-04
SiLU	1024	16.3 us	9.6 us	1.69x	2.45e-05
SiLU	2048	16.3 us	9.7 us	1.68x	6.57e-06
Tanh	256	24.4 us	9.5 us	2.56x	5.91e-04
Tanh	1024	24.4 us	9.7 us	2.52x	3.71e-05
Tanh	2048	24.4 us	9.7 us	2.50x	9.83e-06

Daisy (n = 384)

Activation	n	LUT Size	Computed	LUT + lerp	Speedup
SiLU	384	2048	47552 cyc (99.1 us)	16901 cyc (35.2 us)	2.81x

Desktop: LUT + linear interpolation is 1.7--2.6x faster than computed expf. A 1024-point table offers a good speed/accuracy tradeoff (~5e-06 max error). Tanh benefits the most since it evaluates two expf calls per element.

Daisy: LUT + linear interpolation is 2.81x faster than computed SiLU, a larger win than desktop (1.87x for SiLU at n=384). The bigger gap reflects the higher relative cost of expf on Cortex-M7 (~124 cyc/element computed vs ~44 cyc/element LUT).

9. Strided Sub-matrix Copy (Desktop only)

WaveNet's gated activation splits a 2N-channel activation into two N-channel halves (for the sigmoid gate and tanh path). When the source matrix has 2N rows but the destination has N rows, source and destination have different column strides in memory, so a simple memcpy won't work. Eigen's .topRows().leftCols() handles this via expression templates; the manual version uses explicit pointer arithmetic to copy row-by-row with the correct strides.

Desktop

Rows	Frames	Eigen `.topRows()`	Manual stride copy	Speedup
4->2	48	75.9 ns	21.7 ns	3.50x
8->4	48	68.2 ns	18.3 ns	3.72x
16->8	48	76.5 ns	21.9 ns	3.49x
4->2	2048	2.3 us	659.1 ns	3.56x
8->4	2048	2.2 us	491.4 ns	4.54x
16->8	2048	2.5 us	868.8 ns	2.90x

Desktop: Manual strided copy is 2.9--4.5x faster than Eigen's .topRows().leftCols(). Eigen's expression template overhead dominates for non-contiguous copies.

10. Depthwise Conv: Inline vs Eigen Diagonal

Some WaveNet variants use depthwise (channel-wise) 1x1 convolutions, where each channel is scaled independently -- equivalent to multiplying each row of the activation matrix by a scalar weight. Eigen expresses this as weight.asDiagonal() * input, which constructs a diagonal matrix view and dispatches to a general matrix multiply. The inline version skips the abstraction and directly multiplies each element by the corresponding channel weight.

Desktop

Channels	Frames	Eigen `asDiagonal()`	Inline element-wise	Speedup
2	48	116.0 ns	55.1 ns	2.11x
4	48	124.3 ns	36.7 ns	3.38x
8	48	182.2 ns	87.7 ns	2.08x
2	2048	3.7 us	1.3 us	2.89x
4	2048	2.7 us	607.0 ns	4.49x
8	2048	4.1 us	2.2 us	1.83x

Daisy (48 frames)

Channels	Frames	Eigen `asDiagonal()`	Inline element-wise	Speedup
4	48	4681 cyc (9.8 us)	3175 cyc (6.6 us)	1.47x
8	48	7356 cyc (15.3 us)	7478 cyc (15.6 us)	0.98x

Desktop: Inline element-wise depthwise conv is 1.8--4.5x faster than Eigen's asDiagonal() multiply across all tested sizes.

Daisy: Inline wins at 4 channels (1.47x) but ties at 8 channels. The 8-channel case may be hitting register pressure limits where Eigen's code generation produces equivalent scalar code.

11. FiLM Layer: Inline vs Eigen Array Expressions

Feature-wise Linear Modulation (FiLM) applies a learned per-channel affine transform -- output = scale * input + shift -- to condition one network's activations on another's predictions. In WaveNet, FiLM layers modulate hidden activations using conditioning signals. Eigen expresses this with .array() broadcasting; the inline version uses a nested loop over channels and frames. "Scale only" omits the shift term, reducing the operation to a per-channel multiply.

Desktop -- Scale + Shift

Channels	Frames	Eigen `.array()`	Inline loop	Speedup
4	48	192.9 ns	122.8 ns	1.57x
8	48	247.9 ns	213.0 ns	1.16x
4	2048	3.6 us	2.1 us	1.68x
8	2048	3.8 us	10.3 us	0.36x

Desktop -- Scale Only

Channels	Frames	Eigen `.array()`	Inline loop	Speedup
4	48	142.1 ns	70.1 ns	2.03x
8	48	152.8 ns	65.9 ns	2.32x
4	2048	2.5 us	1.4 us	1.84x
8	2048	3.1 us	1.9 us	1.68x

Daisy (48 frames)

Channels	Frames	Eigen `.array()`	Inline loop	Speedup
4	48	2725 cyc (5.7 us)	2449 cyc (5.1 us)	1.11x
8	48	4837 cyc (10.1 us)	4464 cyc (9.3 us)	1.08x

Desktop: Inline FiLM is 1.2--2.3x faster in most cases. Scale-only is uniformly faster. The 8-channel x 2048-frame scale+shift case shows a regression (0.36x), likely a cache/vectorization edge case.

Daisy: Inline FiLM is 1.08--1.11x faster, a modest but consistent win across both channel counts.

12. Ring Buffer Write: Eigen Block vs Nested Loop

WaveNet uses dilated causal convolutions, which require access to past input frames. A ring buffer stores these frames in a fixed-size circular matrix, and each new block of frames must be written at the current write position. This is a contiguous multi-channel write into the middle of a larger matrix. Eigen's .middleCols() compiles to an optimized block copy; the nested-loop version writes element-by-element with explicit index arithmetic.

Desktop

Channels	Frames	Nested loop	Eigen `middleCols()`	Eigen Speedup
2	48	57.1 ns	22.4 ns	2.56x
4	48	44.2 ns	24.0 ns	1.84x
8	48	51.5 ns	48.7 ns	1.06x
2	2048	1.7 us	399.6 ns	4.25x
4	2048	760.1 ns	660.6 ns	1.15x
8	2048	954.9 ns	1.1 us	0.87x

Daisy (48 frames)

Channels	Frames	Nested loop	Eigen `middleCols()`	Eigen Speedup
4	48	3690 cyc (7.7 us)	374 cyc (0.78 us)	9.86x
8	48	6713 cyc (14.0 us)	715 cyc (1.5 us)	9.39x

Desktop: Eigen's block assignment wins by up to 4.25x for small channel counts. At 8 channels and large frame counts, the nested loop catches up. Eigen compiles .middleCols() = ... to vectorized contiguous writes that scalar nested loops can't match.

Daisy: Eigen is 9.4--9.9x faster -- an even larger win than desktop. The scalar nested loop generates extremely poor code on ARM, while Eigen compiles to an optimized contiguous copy. This is the strongest argument for keeping Eigen in the codebase.

Summary: Daisy vs Desktop

Optimization	Desktop (Apple Silicon)	Daisy (Cortex-M7)	Notes
Inline generic GEMM vs Eigen	0.14--0.38x (Eigen wins)	1.04--1.47x	Eigen malloc kills ARM perf
Unrolled GEMM vs Eigen	2.04--3.06x	3.10x	Wins on both platforms
DTCM weight placement	N/A	1.00x	No benefit in isolation
`__restrict__`	1.00x	1.00x	No effect on either
`memcpy` vs Eigen block copy	1.14--2.15x (memcpy wins)	0.22--0.23x (Eigen wins)	Newlib memcpy is slow
Element-wise add	1.04--1.06x	1.08x	Small consistent win
Bias broadcast	1.05--1.14x	0.95--0.98x (Eigen wins)	Platform-dependent
Hardswish branchless	3.06--3.38x	1.20--1.21x	Less impact without SIMD
ReLU unrolling	1.19x	1.09x	Modest on both
LUT SiLU vs computed	1.87x	2.81x	Bigger win on ARM (expf costlier)
Depthwise inline	2.08--3.38x	1.47x (4ch only)	8ch ties on ARM
FiLM inline	1.16--1.57x	1.08--1.11x	Consistent small win
Ring buffer Eigen	1.84--4.25x	9.39--9.86x	Eigen's biggest win, even larger on ARM

JSON - when processing text becomes a problem

NeuralAmpModelerCore uses nlohmann::json to parse .nam files, but it requires a lot of memory to parse a model and memory is not exactly something we have plenty of... E.g. an A1 NAM model stored as a JSON file takes almost 400KB of memory. Remember when we said the Cortex M7 has between 1-2MB of RAM? Yeah, that is going to be a problem.

JSON is cool and very accessible for users, as you can just open it in a text file or parse it without any special dependencies in most programming languages. And parsing JSON is definitely not an issue on desktop (your browser does it thousands of times every day without breaking a sweat)... so we did not want to replace the JSON format.

Our approach was to refactor the factory methods in NeuralAmpModelerCore so they expect some sort of model configuration object instead of a JSON object, and offload parsing either JSON or another file-based representation of models to a parser. This way, both JSON and other formats can use the same "entry point" to build a model from a file, they would just use different parsers.

We then came up with the namb format, a compact binary format for models to be used instead of the regular nam format in embedded devices. This format is meant to be used as an exchange format between a controller application (say, an app to load models that runs on your phone) and the embedded device. The app converts the nam format that is retrieved from the T3K website into a compact format that is more suited to your pedal, then transmits it over Bluetooth or an USB cable. There is no model conversion here: think of namb just as a more "compressed" format. Learn more at this repo.

Contributions

All of the optimizations that were helpful for us to run NeuralAmpModelerCore on the Daisy were merged, meaning these are all readily available for anyone who wants to try something similar. They currently require compiling with CXXFLAGS+=USE_NAM_INLINE_GEMM.

In order to enable using our binary loader, model construction in core was decoupled from JSON parsing, which was also merged. The code for nam-binary-loader is available as a separate library and tool in this repo.

Finally, we are publishing example code for the Daisy Seed board, to be used as a blueprint for porting NeuralAmpModelerCore to embedded targets. This was our target during optimization: we worked until we were able to run an A1 nano ReLU model in real time on the Daisy Seed. The code and instructions for how to build and run it on the Daisy are in their own repo.

We hope this helps anyone who is interested in experimenting with supporting NAM in their hardware with trying things out.

Bringing NAM A2 to Embedded Hardware

Challenges

Optimization process

Results

Micro-benchmark Results: Inline Ops vs Eigen

Platforms

1. Inline GEMM vs Eigen GEMM

Desktop (48 frames)

Desktop (2048 frames)

Daisy (48 frames)

1b. DTCM Memory Placement Effect on GEMM (Daisy only)

Generic Inline GEMM

Unrolled 4x8 GEMM

2. Effect of __restrict__ on Inline GEMM

Desktop (48 frames)

Desktop (2048 frames)

Daisy (48 frames)

3. std::memcpy vs Eigen Block Assignment

Desktop (48 frames)

Desktop (2048 frames)

Daisy (48 frames)

4. Element-wise Operations: Unrolled Loop vs Eigen

Desktop -- Addition (z = a + b)

Desktop -- Accumulation (dst += src)

Daisy -- Addition (z = a + b, 48 frames)

5. Bias Broadcast: Unrolled vs Eigen colwise()

Desktop

Daisy (48 frames)

6. Hardswish: Branchy vs Branchless

Desktop

Daisy

7. Activation Loop Unrolling (1-wide vs 4-wide)

Desktop

Daisy

8. LUT Activation vs Computed (expf)

Desktop (n = 192)

Desktop (n = 384)

Desktop (n = 16384)

Daisy (n = 384)

9. Strided Sub-matrix Copy (Desktop only)

Desktop

10. Depthwise Conv: Inline vs Eigen Diagonal

Desktop

Daisy (48 frames)

11. FiLM Layer: Inline vs Eigen Array Expressions

Desktop -- Scale + Shift

Desktop -- Scale Only

Daisy (48 frames)

12. Ring Buffer Write: Eigen Block vs Nested Loop

Desktop

Daisy (48 frames)

Summary: Daisy vs Desktop

JSON - when processing text becomes a problem

Contributions

2. Effect of `restrict` on Inline GEMM

3. `std::memcpy` vs Eigen Block Assignment

5. Bias Broadcast: Unrolled vs Eigen `colwise()`

8. LUT Activation vs Computed (`expf`)