Bringing NAM A2 to Embedded Hardware
11 February, 2026
Note: this post is a partner post to a post I wrote for the Tone3000 blog, with a more in-depth discussion of all the optimizations and results we found.
We all love NAM, and we want to run our favourite captures EVERYWHERE. And that's what we have been trying to achieve with A2, especially Slimmable models. As part of the development of A2, we are working closely with partners to make sure A2 suits their hardware. We also wanted to have a general idea of what running NAM in hardware platforms with more resources looks like, so we decided to get our hands dirty and implement a simple NAM loader for the Electrosmith Daisy Seed board, which has an ARM Cortex M7 that is currently very popular for DSP applications.
The Daisy Seed board has the following specs:
- ARM Cortex M7 running at 480 MHz
- 1 MB of SRAM (of which 128 kB are tightly coupled RAM for data)
- 8 MB of Flash memory (for firmware/stored models)
- 96 kHz stereo codec (we used 48 kHz mono)
The Daisy Seed has optionally an extra 64 MB of SDRAM, but that RAM is much slower than the onboard SRAM. People usually use that RAM for loopers and delays, not anything that requires super intense realtime math like a neural model… so we ignored it in these experiments.
To make testing and benchmarking easier, we plugged the Daisy Seed into a Daisy Pod, which is basically a powered breakout board for the Daisy that includes stereo audio I/O, a micro SD card slot, and a couple of knobs. Most important for us was the SD card, as we could load a bunch of NAM models into it and have our benchmarking software run all of them sequentially (and write the logs to the SD card as well).
Challenges
We already knew from the beginning that memory limitations were strict (and that’s one of the reasons for Slimmable NAM). With 1 MB for both the model weights and any buffers, we are restricted in the size of model we can use. We decided to try a smaller model than A1, known in the community as A1-nano, with a little but important tweak: replacing the tanh activation with a ReLU. Tanh is an expensive operation that requires too much computation, which led the deep learning community to look for cheaper options. ReLU fits the bill here as one of the simplest activation functions that is currently used in lots of AI models.
While the code in NeuralAmpModelerCore has been battle-tested in the NAM plugin, it was not designed with embedded applications in mind. When we first tried to run it on the Daisy Seed, even with an A1-nano-ReLU model, it would take over 5 seconds of compute time to process 2 seconds of audio... which we should understand would not work for a pedal that is supposed to be doing audio processing in real time.
Optimization process
In order to tackle that issue, we first instrumented the code with profiling to capture how long the code spent on each operation, so we could identify easy targets for optimization. The actual main culprit for the slowdown was Eigen, the library we use for linear algebra, more specifically in how it does matrix multiplications with small matrices and how it allocates temporary memory in embedded devices without an operating system. We added specialized operations for specific sizes of matrices that are used in A1-nano-ReLU, as well as other unrelated optimizations.
Once we identified all the bottlenecks in the code for the A1-nano-ReLU model, we came up with different optimization tactics. I would love to say that I first implemented microbenchmarks to see which ones really mattered and then I added them to the code, but the real process was actually in reverse: I started by optimizing the actual Core code, then implemented microbenchmarks to better understand what was going on clown-face.
The microbenchmarks are all available in this repository for reference. These can be used as a reference for your own hardware, if you are implementing an embedded device that runs NAM.
Results
Micro-benchmark Results: Inline Ops vs Eigen
These benchmarks measure individual operations in isolation to guide optimization decisions for the NeuralAmpModelerCore WaveNet inference engine. Each section presents results for both platforms side-by-side.
Platforms
| Desktop | Daisy | |
|---|---|---|
| CPU | Apple Silicon (Macbook Pro M5) | ARM Cortex-M7 @ 480 MHz, FTZ=1, DN=1 |
| Compiler flags | -O3 -ffast-math -march=native -DNDEBUG -std=c++17 | -O2 -ffast-math -funroll-loops -ftree-vectorize -std=gnu++17 |
| Platform | macOS | Electrosmith Daisy Seed (STM32H750, 512 KB SRAM, 128 KB DTCM) |
| Eigen parallelization | Disabled (-DEIGEN_DONT_PARALLELIZE) | Disabled (-DEIGEN_DONT_PARALLELIZE) |
| Timing | std::chrono::high_resolution_clock | DWT CYCCNT (cycle-accurate, zero overhead). 1 cycle = 2.08 ns @ 480 MHz |
| Iterations | 5000 (warmup: 100) | 1000 (warmup: 100) |
1. Inline GEMM vs Eigen GEMM
The 1x1 convolution (Conv1x1) is the single hottest operation in WaveNet inference -- it runs once per layer per frame-block to mix channels. It boils down to a matrix-matrix multiply: (channels_out x channels_in) * (channels_in x frames). Eigen's general GEMM path allocates temporary panel buffers via malloc on every call, which is fine on desktop but catastrophically slow on Cortex-M7 with newlib. The "inline generic" variant replaces Eigen with a plain triple-loop; the "unrolled" variant fully unrolls the loop at compile time for known dimension pairs (e.g., 4x8), keeping weights in registers.
Matrix sizes matching typical WaveNet 1x1 convolution dimensions (channels x channels, applied across frames). Each size is measured once: Eigen is benchmarked first, then the inline variant(s) run against that same baseline with the same input data.
Desktop (48 frames)
| Size | Eigen | Inline Generic | Speedup |
|---|---|---|---|
| 2x2 | 191.1 ns | 122.7 ns | 1.56x |
| 4x4 | 196.2 ns | 519.4 ns | 0.38x |
| 4x8 | 230.2 ns | 1.0 us | 0.23x |
| 4x8 (unrolled) | 230.2 ns | 75.2 ns | 3.06x |
| 8x4 | 199.8 ns | 992.9 ns | 0.20x |
| 8x8 | 227.4 ns | 1.6 us | 0.14x |
| 8x8 (unrolled) | 227.4 ns | 111.6 ns | 2.04x |
Desktop (2048 frames)
| Size | Eigen | Inline Generic | Speedup |
|---|---|---|---|
| 2x2 | 2.1 us | 2.3 us | 0.90x |
| 4x4 | 2.0 us | 8.4 us | 0.24x |
| 4x8 | 2.4 us | 18.8 us | 0.13x |
| 4x8 (unrolled) | 2.4 us | 1.9 us | 1.28x |
| 8x4 | 2.6 us | 16.8 us | 0.15x |
| 8x8 | 3.9 us | 38.0 us | 0.10x |
| 8x8 (unrolled) | 3.9 us | 3.4 us | 1.15x |
Daisy (48 frames)
| Size | Eigen | Inline Generic | Speedup |
|---|---|---|---|
| 2x2 | 4044 cyc (8.4 us) | 2759 cyc (5.7 us) | 1.47x |
| 4x4 | 8117 cyc (16.9 us) | 7175 cyc (14.9 us) | 1.13x |
| 4x8 | 10657 cyc (22.2 us) | 9385 cyc (19.6 us) | 1.14x |
| 4x8 (unrolled) | 10657 cyc (22.2 us) | 3437 cyc (7.2 us) | 3.10x |
| 8x4 | 14314 cyc (29.8 us) | 13701 cyc (28.5 us) | 1.04x |
| 8x8 | 19578 cyc (40.8 us) | 18125 cyc (37.8 us) | 1.08x |
Desktop: Generic inline triple-loops lose badly to Eigen (0.10--0.38x) because Eigen leverages NEON SIMD. Fully unrolled compile-time specializations beat Eigen at all sizes and frame counts (1.15--3.06x), with the largest wins at small frame counts where Eigen's setup overhead is proportionally larger.
Daisy: Inline GEMM beats Eigen at every size because Eigen's general GEMM path calls malloc/free for panel buffer allocation on every invocation. The generic triple-loop wins by 1.04--1.47x. The fully unrolled 4x8 specialization (weights loaded into registers) wins by 3.10x -- the biggest single optimization in the entire WaveNet pipeline.
1b. DTCM Memory Placement Effect on GEMM (Daisy only)
The STM32H750 has 128 KB of DTCM (Data Tightly-Coupled Memory) that provides single-cycle, deterministic access -- no cache misses, no wait states. Production code copies model weights into DTCM to avoid cache eviction during the full WaveNet pipeline. This benchmark isolates the question: does DTCM placement speed up the GEMM kernel itself, or is the benefit only about avoiding interference from other data?
Generic Inline GEMM
| Size | SRAM (baseline) | DTCM weight | Speedup | DTCM both | Speedup |
|---|---|---|---|---|---|
| 2x2 | 2758 cyc (5.7 us) | 2763 cyc (5.8 us) | 1.00x | 2897 cyc (6.0 us) | 0.95x |
| 4x4 | 7173 cyc (14.9 us) | 7179 cyc (15.0 us) | 1.00x | 7316 cyc (15.2 us) | 0.98x |
| 4x8 | 9384 cyc (19.6 us) | 9390 cyc (19.6 us) | 1.00x | 9528 cyc (19.9 us) | 0.98x |
| 8x8 | 18125 cyc (37.8 us) | 18136 cyc (37.8 us) | 1.00x | 18285 cyc (38.1 us) | 0.99x |
Unrolled 4x8 GEMM
| Variant | Cycles | Speedup vs SRAM |
|---|---|---|
| SRAM (baseline) | 3437 cyc (7.2 us) | -- |
| DTCM weight only | 3555 cyc (7.4 us) | 0.97x |
| DTCM weight + input | 3430 cyc (7.1 us) | 1.00x |
Daisy: DTCM placement has no measurable benefit for GEMM operands. In some cases it is marginally slower (up to 5%). The Cortex-M7's D-cache achieves near-single-cycle latency for hot data that fits in cache, and these small weight matrices (64--256 bytes) are always cache-hot after warmup. DTCM's advantage is deterministic latency (no cold-start misses), not throughput for steady-state computation. The production DTCM weight copy remains justified for avoiding cache eviction by other data in the full WaveNet pipeline, but the GEMM kernel itself sees no benefit in isolation. This benchmark is not super representative of real-world usage, as we'd often be juggling multiple operations and invalidating the cache.
2. Effect of __restrict__ on Inline GEMM
The __restrict__ qualifier tells the compiler that pointers don't alias, theoretically enabling more aggressive load/store reordering and vectorization in tight loops. Since our inline GEMM reads from two input matrices and writes to a third, aliasing analysis could matter. Both versions use noinline to prevent the compiler from deducing non-aliasing from calling context, isolating the effect of the qualifier itself.
Desktop (48 frames)
| Size | Without __restrict__ | With __restrict__ | Speedup |
|---|---|---|---|
| 4x4 (generic) | 446.8 ns | 464.8 ns | 0.96x |
| 4x8 (generic) | 916.4 ns | 1.1 us | 0.83x |
| 8x8 (generic) | 1.5 us | 1.5 us | 1.00x |
| 4x8 (unrolled) | 36.4 ns | 36.3 ns | 1.00x |
Desktop (2048 frames)
| Size | Without __restrict__ | With __restrict__ | Speedup |
|---|---|---|---|
| 4x4 (generic) | 10.1 us | 8.5 us | 1.19x |
| 4x8 (generic) | 21.1 us | 20.4 us | 1.03x |
| 8x8 (generic) | 41.0 us | 41.6 us | 0.99x |
| 4x8 (unrolled) | 1.9 us | 1.9 us | 1.01x |
Daisy (48 frames)
| Size | Without __restrict__ | With __restrict__ | Speedup |
|---|---|---|---|
| 4x4 | 7172 cyc (14.9 us) | 7174 cyc (14.9 us) | 1.00x |
| 4x8 | 9384 cyc (19.5 us) | 9385 cyc (19.6 us) | 1.00x |
| 8x8 | 18125 cyc (37.8 us) | 18126 cyc (37.8 us) | 1.00x |
Desktop: __restrict__ has negligible effect in most cases. The compiler already deduces non-aliasing. The one outlier (4x4 at 2048 frames, 1.19x) is likely noise or a minor alignment effect.
Daisy: Zero measurable effect, same as desktop. The compiler already deduces non-aliasing regardless of platform.
3. std::memcpy vs Eigen Block Assignment
WaveNet layers frequently copy intermediate activation buffers -- for example, saving the pre-activation state for the residual path, or copying gated outputs into the next layer's input. These are contiguous matrix copies (all elements in one flat buffer). The question is whether a plain std::memcpy or Eigen's block assignment (.leftCols() = ...) generates faster code, given that memcpy quality varies dramatically across C libraries.
Desktop (48 frames)
| Rows | Eigen block assign | std::memcpy | Speedup |
|---|---|---|---|
| 2 | 25.1 ns | 20.6 ns | 1.21x |
| 4 | 25.9 ns | 27.1 ns | 0.96x |
| 8 | 37.6 ns | 32.9 ns | 1.14x |
| 16 | 79.9 ns | 49.9 ns | 1.60x |
Desktop (2048 frames)
| Rows | Eigen block assign | std::memcpy | Speedup |
|---|---|---|---|
| 2 | 379.2 ns | 219.6 ns | 1.73x |
| 4 | 767.8 ns | 375.6 ns | 2.04x |
| 8 | 1.2 us | 571.1 ns | 2.15x |
| 16 | 2.0 us | 1.3 us | 1.60x |
Daisy (48 frames)
| Rows | Eigen block assign | std::memcpy | memcpy Speedup |
|---|---|---|---|
| 4 | 379 cyc (0.79 us) | 1636 cyc (3.4 us) | 0.23x |
| 8 | 715 cyc (1.5 us) | 3175 cyc (6.6 us) | 0.23x |
| 16 | 1344 cyc (2.8 us) | 6248 cyc (13.0 us) | 0.22x |
Desktop: std::memcpy consistently beats Eigen block assignment, with the gap widening at larger sizes (up to 2.15x). The platform's highly optimized memcpy implementation (NEON/AVX) outperforms Eigen's generated stores.
Daisy: The opposite of desktop -- Eigen block assignment is 4.3--4.6x faster than std::memcpy. Newlib's memcpy implementation is not optimized for the Cortex-M7's AXI bus and D-cache; Eigen's generated code likely uses more efficient register-width stores.
4. Element-wise Operations: Unrolled Loop vs Eigen
Element-wise addition and accumulation appear throughout WaveNet: skip connections sum layer outputs (z = a + b), and residual connections accumulate into a running buffer (dst += src). Eigen uses expression templates that defer evaluation and fuse operations, but for simple element-wise ops the template machinery may add overhead compared to a plain loop that the compiler can auto-vectorize directly.
Desktop -- Addition (z = a + b)
| Channels | Frames | Eigen | Unrolled | Speedup |
|---|---|---|---|---|
| 4 | 48 | 34.4 ns | 32.9 ns | 1.04x |
| 8 | 48 | 63.9 ns | 60.4 ns | 1.06x |
| 16 | 48 | 100.9 ns | 73.2 ns | 1.38x |
| 4 | 2048 | 824.9 ns | 541.1 ns | 1.52x |
| 8 | 2048 | 1.4 us | 903.5 ns | 1.51x |
| 16 | 2048 | 3.2 us | 2.7 us | 1.19x |
Desktop -- Accumulation (dst += src)
| Channels | Frames | Eigen | Unrolled | Speedup |
|---|---|---|---|---|
| 4 | 48 | 43.0 ns | 31.7 ns | 1.36x |
| 8 | 48 | 95.6 ns | 64.9 ns | 1.47x |
| 16 | 48 | 208.1 ns | 141.3 ns | 1.47x |
| 4 | 2048 | 736.8 ns | 594.0 ns | 1.24x |
| 8 | 2048 | 2.3 us | 2.3 us | 1.02x |
| 16 | 2048 | 3.9 us | 3.9 us | 1.01x |
Daisy -- Addition (z = a + b, 48 frames)
| Channels | Frames | Eigen a + b | Unrolled loop | Speedup |
|---|---|---|---|---|
| 4 | 48 | 756 cyc (1.6 us) | 698 cyc (1.5 us) | 1.08x |
| 8 | 48 | 1454 cyc (3.0 us) | 1346 cyc (2.8 us) | 1.08x |
Desktop: Unrolled loops are 1.0--1.5x faster for element-wise ops, with bigger wins at small/medium buffer sizes typical of real-time audio. Accumulation shows larger gains at small frame counts.
Daisy: Unrolled loops are modestly faster (1.08x), consistent with desktop results at 48 frames. The savings per call are small (~100--120 cycles) but accumulate over many WaveNet layers.
5. Bias Broadcast: Unrolled vs Eigen colwise()
Every convolution layer adds a per-channel bias vector after the matrix multiply: each element of the bias is broadcast across all frames in that channel's row. Eigen provides colwise() for this pattern; the alternative is a hand-written loop that iterates over channels and uses pointer arithmetic to stride across frames.
Desktop
| Channels | Frames | Eigen colwise() | Unrolled | Speedup |
|---|---|---|---|---|
| 2 | 48 | 77.2 ns | 71.3 ns | 1.08x |
| 4 | 48 | 60.4 ns | 57.3 ns | 1.05x |
| 8 | 48 | 99.3 ns | 87.3 ns | 1.14x |
| 2 | 2048 | 1.7 us | 1.2 us | 1.38x |
| 4 | 2048 | 1.0 us | 630.5 ns | 1.66x |
| 8 | 2048 | 1.6 us | 1.3 us | 1.24x |
Daisy (48 frames)
| Channels | Frames | Eigen colwise() | Unrolled loop | Speedup |
|---|---|---|---|---|
| 4 | 48 | 3541 cyc (7.4 us) | 3744 cyc (7.8 us) | 0.95x |
| 8 | 48 | 5721 cyc (11.9 us) | 5838 cyc (12.2 us) | 0.98x |
Desktop: Unrolled bias broadcast is 1.05--1.66x faster than Eigen's colwise(), with larger wins at higher frame counts.
Daisy: Eigen wins by 2--5%, reversing the desktop result. Eigen's colwise() likely generates tighter scalar code on ARM than the hand-written loop. Keep Eigen for bias broadcast on Daisy.
6. Hardswish: Branchy vs Branchless
Hardswish is a piecewise activation function with three regions: 0 for x <= -3, x for x >= 3, and x * (x + 3) / 6 in between. The naive implementation uses if/else branches, which cause pipeline stalls from misprediction when inputs span all three regions (as they do in typical WaveNet activations). The branchless version uses fmin/fmax to compute the same result without conditional jumps. Input values are uniformly distributed in [-5, 5] to trigger worst-case branch misprediction.
Desktop
| n | Branchy | Branchless | Unrolled | Branchless Speedup | Max Error |
|---|---|---|---|---|---|
| 192 | 109.6 ns | 35.8 ns | 49.4 ns | 3.06x | 4.77e-07 |
| 384 | 247.4 ns | 73.2 ns | 86.2 ns | 3.38x | 4.77e-07 |
| 8192 | 6.0 us | 854.4 ns | 1.2 us | 6.97x | 4.77e-07 |
| 16384 | 9.6 us | 1.7 us | 2.4 us | 5.73x | 4.77e-07 |
| 32768 | 21.2 us | 4.5 us | 5.9 us | 4.76x | 4.77e-07 |
Daisy
| n | Branchy (if/else) | Branchless + unrolled | Speedup |
|---|---|---|---|
| 192 | 6248 cyc (13.0 us) | 5209 cyc (10.9 us) | 1.20x |
| 384 | 12413 cyc (25.9 us) | 10245 cyc (21.3 us) | 1.21x |
Desktop: Branchless hardswish is 3--7x faster than the branchy version. The simple branchless scalar loop actually beats the manually unrolled variant -- the compiler auto-vectorizes the scalar version more effectively.
Daisy: Branchless hardswish is 1.20--1.21x faster, a much smaller win than desktop. The Cortex-M7 branch predictor handles the 3-region pattern better than expected, and without SIMD the branchless version can't exploit vectorization. Still worth using since it's a consistent 20% improvement.
7. Activation Loop Unrolling (1-wide vs 4-wide)
Activation functions are applied element-wise to every value in the activation buffer after each convolution. Manual 4-wide unrolling processes four elements per loop iteration, giving the compiler explicit instruction-level parallelism (ILP) and aligning with SIMD register widths (4 floats = 1 SSE/NEON register). The benefit depends on whether the activation is compute-bound (cheap ops like ReLU) or latency-bound (expensive ops like expf in Sigmoid/SiLU).
Desktop
| Activation | n | 1-wide | 4-wide | Speedup |
|---|---|---|---|---|
| ReLU | 192 | 58.0 ns | 58.1 ns | 1.00x |
| ReLU | 384 | 105.5 ns | 88.3 ns | 1.19x |
| ReLU | 16384 | 2.9 us | 2.0 us | 1.44x |
| Sigmoid | 192 | 317.2 ns | 313.6 ns | 1.01x |
| Sigmoid | 384 | 645.5 ns | 612.8 ns | 1.05x |
| Sigmoid | 16384 | 18.0 us | 16.2 us | 1.11x |
| SiLU | 192 | 188.6 ns | 191.4 ns | 0.99x |
| SiLU | 384 | 390.5 ns | 393.5 ns | 0.99x |
| SiLU | 16384 | 16.5 us | 16.5 us | 0.99x |
| Softsign | 192 | 19.5 ns | 19.1 ns | 1.02x |
| Softsign | 384 | 35.6 ns | 34.1 ns | 1.04x |
| Softsign | 16384 | 1.6 us | 1.5 us | 1.05x |
Daisy
| Activation | n | 1-wide | 4-wide | Speedup |
|---|---|---|---|---|
| ReLU | 384 | 4269 cyc (8.9 us) | 3918 cyc (8.2 us) | 1.09x |
| Sigmoid | 384 | 47625 cyc (99.2 us) | 47325 cyc (98.6 us) | 1.01x |
Desktop: Manual 4-wide unrolling helps ReLU (up to 1.44x) but has negligible effect on transcendental-heavy activations (Sigmoid, SiLU) where expf dominates the cost. Softsign is already fast enough that unrolling provides no meaningful benefit.
Daisy: A modest 9% win for ReLU, consistent with desktop at n=384. Sigmoid is dominated by expf cost (~124 cycles/element), making unrolling irrelevant.
8. LUT Activation vs Computed (expf)
Sigmoid, SiLU, and Tanh all depend on expf(), which is the single most expensive scalar operation in WaveNet inference (~124 cycles per element on Cortex-M7). A lookup table (LUT) pre-computes the activation over a fixed input range, then uses linear interpolation between table entries at runtime -- replacing the expf call with an array index, a multiply, and an add. The tradeoff is a small approximation error that depends on table size. The computed baseline is measured once per activation/size and shared across all LUT sizes.
Desktop (n = 192)
| Activation | LUT Size | Computed | LUT + lerp | Speedup | Max Error |
|---|---|---|---|---|---|
| Sigmoid | 256 | 512.1 ns | 275.2 ns | 1.86x | 7.39e-05 |
| Sigmoid | 1024 | 512.1 ns | 257.5 ns | 1.99x | 4.59e-06 |
| Sigmoid | 2048 | 512.1 ns | 256.5 ns | 2.00x | 1.19e-06 |
| SiLU | 256 | 452.6 ns | 255.8 ns | 1.77x | 3.80e-04 |
| SiLU | 1024 | 452.6 ns | 256.9 ns | 1.76x | 2.32e-05 |
| SiLU | 2048 | 452.6 ns | 227.9 ns | 1.99x | 6.24e-06 |
| Tanh | 256 | 495.8 ns | 200.0 ns | 2.48x | 5.79e-04 |
| Tanh | 1024 | 495.8 ns | 197.7 ns | 2.51x | 3.63e-05 |
| Tanh | 2048 | 495.8 ns | 197.6 ns | 2.51x | 9.00e-06 |
Desktop (n = 384)
| Activation | LUT Size | Computed | LUT + lerp | Speedup | Max Error |
|---|---|---|---|---|---|
| Sigmoid | 256 | 677.2 ns | 359.5 ns | 1.88x | 7.38e-05 |
| Sigmoid | 1024 | 677.2 ns | 359.6 ns | 1.88x | 4.63e-06 |
| Sigmoid | 2048 | 677.2 ns | 349.8 ns | 1.94x | 1.25e-06 |
| SiLU | 256 | 577.9 ns | 328.1 ns | 1.76x | 3.83e-04 |
| SiLU | 1024 | 577.9 ns | 307.8 ns | 1.88x | 2.38e-05 |
| SiLU | 2048 | 577.9 ns | 308.3 ns | 1.87x | 6.36e-06 |
| Tanh | 256 | 693.8 ns | 298.1 ns | 2.33x | 5.70e-04 |
| Tanh | 1024 | 693.8 ns | 286.3 ns | 2.42x | 3.71e-05 |
| Tanh | 2048 | 693.8 ns | 286.5 ns | 2.42x | 9.24e-06 |
Desktop (n = 16384)
| Activation | LUT Size | Computed | LUT + lerp | Speedup | Max Error |
|---|---|---|---|---|---|
| Sigmoid | 256 | 16.7 us | 9.4 us | 1.77x | 7.41e-05 |
| Sigmoid | 1024 | 16.7 us | 9.6 us | 1.74x | 4.75e-06 |
| Sigmoid | 2048 | 16.7 us | 9.7 us | 1.71x | 1.31e-06 |
| SiLU | 256 | 16.3 us | 9.5 us | 1.71x | 3.85e-04 |
| SiLU | 1024 | 16.3 us | 9.6 us | 1.69x | 2.45e-05 |
| SiLU | 2048 | 16.3 us | 9.7 us | 1.68x | 6.57e-06 |
| Tanh | 256 | 24.4 us | 9.5 us | 2.56x | 5.91e-04 |
| Tanh | 1024 | 24.4 us | 9.7 us | 2.52x | 3.71e-05 |
| Tanh | 2048 | 24.4 us | 9.7 us | 2.50x | 9.83e-06 |
Daisy (n = 384)
| Activation | n | LUT Size | Computed | LUT + lerp | Speedup |
|---|---|---|---|---|---|
| SiLU | 384 | 2048 | 47552 cyc (99.1 us) | 16901 cyc (35.2 us) | 2.81x |
Desktop: LUT + linear interpolation is 1.7--2.6x faster than computed expf. A 1024-point table offers a good speed/accuracy tradeoff (~5e-06 max error). Tanh benefits the most since it evaluates two expf calls per element.
Daisy: LUT + linear interpolation is 2.81x faster than computed SiLU, a larger win than desktop (1.87x for SiLU at n=384). The bigger gap reflects the higher relative cost of expf on Cortex-M7 (~124 cyc/element computed vs ~44 cyc/element LUT).
9. Strided Sub-matrix Copy (Desktop only)
WaveNet's gated activation splits a 2N-channel activation into two N-channel halves (for the sigmoid gate and tanh path). When the source matrix has 2N rows but the destination has N rows, source and destination have different column strides in memory, so a simple memcpy won't work. Eigen's .topRows().leftCols() handles this via expression templates; the manual version uses explicit pointer arithmetic to copy row-by-row with the correct strides.
Desktop
| Rows | Frames | Eigen .topRows() | Manual stride copy | Speedup |
|---|---|---|---|---|
| 4->2 | 48 | 75.9 ns | 21.7 ns | 3.50x |
| 8->4 | 48 | 68.2 ns | 18.3 ns | 3.72x |
| 16->8 | 48 | 76.5 ns | 21.9 ns | 3.49x |
| 4->2 | 2048 | 2.3 us | 659.1 ns | 3.56x |
| 8->4 | 2048 | 2.2 us | 491.4 ns | 4.54x |
| 16->8 | 2048 | 2.5 us | 868.8 ns | 2.90x |
Desktop: Manual strided copy is 2.9--4.5x faster than Eigen's .topRows().leftCols(). Eigen's expression template overhead dominates for non-contiguous copies.
10. Depthwise Conv: Inline vs Eigen Diagonal
Some WaveNet variants use depthwise (channel-wise) 1x1 convolutions, where each channel is scaled independently -- equivalent to multiplying each row of the activation matrix by a scalar weight. Eigen expresses this as weight.asDiagonal() * input, which constructs a diagonal matrix view and dispatches to a general matrix multiply. The inline version skips the abstraction and directly multiplies each element by the corresponding channel weight.
Desktop
| Channels | Frames | Eigen asDiagonal() | Inline element-wise | Speedup |
|---|---|---|---|---|
| 2 | 48 | 116.0 ns | 55.1 ns | 2.11x |
| 4 | 48 | 124.3 ns | 36.7 ns | 3.38x |
| 8 | 48 | 182.2 ns | 87.7 ns | 2.08x |
| 2 | 2048 | 3.7 us | 1.3 us | 2.89x |
| 4 | 2048 | 2.7 us | 607.0 ns | 4.49x |
| 8 | 2048 | 4.1 us | 2.2 us | 1.83x |
Daisy (48 frames)
| Channels | Frames | Eigen asDiagonal() | Inline element-wise | Speedup |
|---|---|---|---|---|
| 4 | 48 | 4681 cyc (9.8 us) | 3175 cyc (6.6 us) | 1.47x |
| 8 | 48 | 7356 cyc (15.3 us) | 7478 cyc (15.6 us) | 0.98x |
Desktop: Inline element-wise depthwise conv is 1.8--4.5x faster than Eigen's asDiagonal() multiply across all tested sizes.
Daisy: Inline wins at 4 channels (1.47x) but ties at 8 channels. The 8-channel case may be hitting register pressure limits where Eigen's code generation produces equivalent scalar code.
11. FiLM Layer: Inline vs Eigen Array Expressions
Feature-wise Linear Modulation (FiLM) applies a learned per-channel affine transform -- output = scale * input + shift -- to condition one network's activations on another's predictions. In WaveNet, FiLM layers modulate hidden activations using conditioning signals. Eigen expresses this with .array() broadcasting; the inline version uses a nested loop over channels and frames. "Scale only" omits the shift term, reducing the operation to a per-channel multiply.
Desktop -- Scale + Shift
| Channels | Frames | Eigen .array() | Inline loop | Speedup |
|---|---|---|---|---|
| 4 | 48 | 192.9 ns | 122.8 ns | 1.57x |
| 8 | 48 | 247.9 ns | 213.0 ns | 1.16x |
| 4 | 2048 | 3.6 us | 2.1 us | 1.68x |
| 8 | 2048 | 3.8 us | 10.3 us | 0.36x |
Desktop -- Scale Only
| Channels | Frames | Eigen .array() | Inline loop | Speedup |
|---|---|---|---|---|
| 4 | 48 | 142.1 ns | 70.1 ns | 2.03x |
| 8 | 48 | 152.8 ns | 65.9 ns | 2.32x |
| 4 | 2048 | 2.5 us | 1.4 us | 1.84x |
| 8 | 2048 | 3.1 us | 1.9 us | 1.68x |
Daisy (48 frames)
| Channels | Frames | Eigen .array() | Inline loop | Speedup |
|---|---|---|---|---|
| 4 | 48 | 2725 cyc (5.7 us) | 2449 cyc (5.1 us) | 1.11x |
| 8 | 48 | 4837 cyc (10.1 us) | 4464 cyc (9.3 us) | 1.08x |
Desktop: Inline FiLM is 1.2--2.3x faster in most cases. Scale-only is uniformly faster. The 8-channel x 2048-frame scale+shift case shows a regression (0.36x), likely a cache/vectorization edge case.
Daisy: Inline FiLM is 1.08--1.11x faster, a modest but consistent win across both channel counts.
12. Ring Buffer Write: Eigen Block vs Nested Loop
WaveNet uses dilated causal convolutions, which require access to past input frames. A ring buffer stores these frames in a fixed-size circular matrix, and each new block of frames must be written at the current write position. This is a contiguous multi-channel write into the middle of a larger matrix. Eigen's .middleCols() compiles to an optimized block copy; the nested-loop version writes element-by-element with explicit index arithmetic.
Desktop
| Channels | Frames | Nested loop | Eigen middleCols() | Eigen Speedup |
|---|---|---|---|---|
| 2 | 48 | 57.1 ns | 22.4 ns | 2.56x |
| 4 | 48 | 44.2 ns | 24.0 ns | 1.84x |
| 8 | 48 | 51.5 ns | 48.7 ns | 1.06x |
| 2 | 2048 | 1.7 us | 399.6 ns | 4.25x |
| 4 | 2048 | 760.1 ns | 660.6 ns | 1.15x |
| 8 | 2048 | 954.9 ns | 1.1 us | 0.87x |
Daisy (48 frames)
| Channels | Frames | Nested loop | Eigen middleCols() | Eigen Speedup |
|---|---|---|---|---|
| 4 | 48 | 3690 cyc (7.7 us) | 374 cyc (0.78 us) | 9.86x |
| 8 | 48 | 6713 cyc (14.0 us) | 715 cyc (1.5 us) | 9.39x |
Desktop: Eigen's block assignment wins by up to 4.25x for small channel counts. At 8 channels and large frame counts, the nested loop catches up. Eigen compiles .middleCols() = ... to vectorized contiguous writes that scalar nested loops can't match.
Daisy: Eigen is 9.4--9.9x faster -- an even larger win than desktop. The scalar nested loop generates extremely poor code on ARM, while Eigen compiles to an optimized contiguous copy. This is the strongest argument for keeping Eigen in the codebase.
Summary: Daisy vs Desktop
| Optimization | Desktop (Apple Silicon) | Daisy (Cortex-M7) | Notes |
|---|---|---|---|
| Inline generic GEMM vs Eigen | 0.14--0.38x (Eigen wins) | 1.04--1.47x | Eigen malloc kills ARM perf |
| Unrolled GEMM vs Eigen | 2.04--3.06x | 3.10x | Wins on both platforms |
| DTCM weight placement | N/A | 1.00x | No benefit in isolation |
__restrict__ | 1.00x | 1.00x | No effect on either |
memcpy vs Eigen block copy | 1.14--2.15x (memcpy wins) | 0.22--0.23x (Eigen wins) | Newlib memcpy is slow |
| Element-wise add | 1.04--1.06x | 1.08x | Small consistent win |
| Bias broadcast | 1.05--1.14x | 0.95--0.98x (Eigen wins) | Platform-dependent |
| Hardswish branchless | 3.06--3.38x | 1.20--1.21x | Less impact without SIMD |
| ReLU unrolling | 1.19x | 1.09x | Modest on both |
| LUT SiLU vs computed | 1.87x | 2.81x | Bigger win on ARM (expf costlier) |
| Depthwise inline | 2.08--3.38x | 1.47x (4ch only) | 8ch ties on ARM |
| FiLM inline | 1.16--1.57x | 1.08--1.11x | Consistent small win |
| Ring buffer Eigen | 1.84--4.25x | 9.39--9.86x | Eigen's biggest win, even larger on ARM |
JSON - when processing text becomes a problem
NeuralAmpModelerCore uses nlohmann::json to parse .nam files, but it requires a lot of memory to parse a model and memory is not exactly something we have plenty of... E.g. an A1 NAM model stored as a JSON file takes almost 400KB of memory. Remember when we said the Cortex M7 has between 1-2MB of RAM? Yeah, that is going to be a problem.
JSON is cool and very accessible for users, as you can just open it in a text file or parse it without any special dependencies in most programming languages. And parsing JSON is definitely not an issue on desktop (your browser does it thousands of times every day without breaking a sweat)... so we did not want to replace the JSON format.
Our approach was to refactor the factory methods in NeuralAmpModelerCore so they expect some sort of model configuration object instead of a JSON object, and offload parsing either JSON or another file-based representation of models to a parser. This way, both JSON and other formats can use the same "entry point" to build a model from a file, they would just use different parsers.
We then came up with the namb format, a compact binary format for models to be used instead of the regular nam format in embedded devices. This format is meant to be used as an exchange format between a controller application (say, an app to load models that runs on your phone) and the embedded device. The app converts the nam format that is retrieved from the T3K website into a compact format that is more suited to your pedal, then transmits it over Bluetooth or an USB cable. There is no model conversion here: think of namb just as a more "compressed" format. Learn more at this repo.
Contributions
All of the optimizations that were helpful for us to run NeuralAmpModelerCore on the Daisy were merged, meaning these are all readily available for anyone who wants to try something similar. They currently require compiling with CXXFLAGS+=USE_NAM_INLINE_GEMM.
In order to enable using our binary loader, model construction in core was decoupled from JSON parsing, which was also merged. The code for nam-binary-loader is available as a separate library and tool in this repo.
Finally, we are publishing example code for the Daisy Seed board, to be used as a blueprint for porting NeuralAmpModelerCore to embedded targets. This was our target during optimization: we worked until we were able to run an A1 nano ReLU model in real time on the Daisy Seed. The code and instructions for how to build and run it on the Daisy are in their own repo.
We hope this helps anyone who is interested in experimenting with supporting NAM in their hardware with trying things out.