Plonky3 Goldilocks performance

Perf-counter-driven contributions to Plonky3's Goldilocks Poseidon paths on aarch64 NEON and x86 AVX. Measured speedups, methodology write-ups, one audit that triggered a same-day upstream fix.

Apr 2026 — present · Maintainer
  • Rust
  • aarch64 NEON
  • AVX-2
  • AVX-512
  • Plonky3
  • perf-counter analysis

Poseidon2 cross-permute batching

−26.4%

Goldilocks Zen 5 AVX-2 · PR #1667 (in review)

Merkle wall time

−16.4%

Goldilocks NEON · PR #1656 (in review)

AVX Poseidon1 speedup

1.20×

audit → @Nashtare #1645 (merged)

Poseidon2 e2e prove

−2.93%

Goldilocks NEON · PR #1623 (merged)

What this is

The performance track of my zkmcu work. zkmcu’s deployment shape is prove on a Raspberry Pi 5 (Cortex-A76 NEON), verify on a Pico 2 (Cortex-M33), which means the Pi 5 prover side has to be fast or the whole pipeline isn’t useful. Plonky3 is the proving system underneath, and Goldilocks Poseidon2 is the hot path on both halves. So I have been doing perf-counter surveys of its Goldilocks kernels and submitting fixes upstream.

Shipped contributions

Plonky3 PR #1623add_canonical_asm (merged). Introduced a variant of the NEON Goldilocks addition kernel that skips the canonicalize-b subs/csel pair where the caller can statically guarantee b < P. Applied at three classes of provably-canonical call sites in the Poseidon2 NEON inline-asm: matrix-step internal sums, MDS accumulators, and round-constant additions (with a mechanical test asserting every constant in the round-constant tables is canonical). −2.93% end-to-end on prove_goldilocks_poseidon2; −5.91% / −6.21% on the external permute kernels. Write-up: The two instructions hiding in every Goldilocks Poseidon2 add.

Plonky3 #1642 → #1645 — packed-AVX-2 Poseidon1 audit. Perf-counter survey on Zen 5 found that Permutation<[PackedGoldilocksAVX2; N]> was running at 0.77× scalar — actively slower than the scalar path it was supposed to replace, because the MDS step dispatched through the generic apply_circulant fallback instead of the Karatsuba specialisation that the scalar path used. Filed as Issue #1642 with the counter trace and source-derived expected-multiply count. @Nashtare shipped the fix in #1645 same-day; post-merge ratio is 0.93× scalar, a 1.20× wall-time delta on the packed permutation. AVX-512 same gap, same fix. Write-up: The SIMD path that was 0.77× scalar.

Plonky3 PR #1656 — packed-NEON Poseidon2 dispatch (in review). Discovered via the Pi 5 Merkle workload survey: Poseidon2GoldilocksFused<8>’s packed-type Permutation impl was running ~16% slower than the platform-generic Poseidon2 SIMD path on Merkle commit, despite winning per-permute microbenchmarks by 3.5%. Phase decomposition reconciled the two: out-of-order overlap masked the structural per-phase gap in the per-permute benchmark, but Merkle’s layer-wise data-dependency chain removed the OoO mask. Routed the packed dispatch through generic; 16.37% Merkle wall-time reduction at log2 leaves ∈ 18 on Pi 5 Cortex-A76. Write-up: The Poseidon2 regression my microbenchmark told me wasn’t there.

Plonky3 PR #1667 — cross-permute Poseidon2 batching (in review). Interleaves two Poseidon2 permutations through the per-round loop so the compiler can schedule across their independent data flow, recovering throughput the single-permute kernel doesn’t have. Validated across 5 stages on aarch64 NEON + Zen 5 AVX-2. Headline wins on Goldilocks: 26.4% on Zen 5 AVX-2 (width 8), 23% on Pi 5 NEON, 14-15% on Merkle commit (Zen 5).

How I work

Every contribution starts with a measurement, create a test harness around everything the code might affect in the slightest way and tend to aim for a coverage nearing 300%. The pattern across all three is: open perf annotate (or cargo bench + counter sampling) on a real workload, look for the line that doesn’t match what the algorithm should be doing, and confirm the gap with a phase-decomposition harness before touching code. The Poseidon2 piece of that harness is now public: poseidon2-harness — dual-tree differential testing + PMU-cycle benches across aarch64 NEON. The write-ups document the reasoning chain so the next person doing the same survey doesn’t have to start from scratch.

Cost framing per contribution is on each writing post’s shipped impact section.