The two instructions hiding in every Goldilocks Poseidon2 add

How I dropped a redundant subs/csel pair from Plonky3's NEON Goldilocks addition kernel after staring at a perf annotate dump.

6 min read
  • #plonky3
  • #goldilocks
  • #performance
  • #poseidon2
  • #aarch64
  • #neon
  • #benchmarking

I was profiling Plonky3’s Goldilocks Poseidon2 on a Raspberry Pi 5, trying to figure out where the wall-clock seconds in prove_goldilocks_poseidon2 were actually going. The headline answer was unsurprising, the prove path is FFT and Merkle work, but the external permute kernels (external_initial_permute_dual_w8, external_terminal_permute_dual_w8) sat in a fat enough position that I scrolled down into them with perf annotate, mostly out of curiosity.

What I saw there bothered me.

The pattern that wouldn’t go away

The output of perf annotate on the round kernel was a wall of inline NEON assembly. Most of it I expected: loads, muls, the Goldilocks reduction sequence, more loads. But interleaved through all of it, every few instructions, was the same two-op pair:

subs   xN, xM, x_p
csel   xM, xM, xN, hs

A compare against PP (Goldilocks’s prime, P=264232+1P = 2^{64} - 2^{32} + 1), then a conditional select. The canonicalize step:

b{bP,bPb,b<Pb \mapsto \begin{cases} b - P, & b \geq P \\ b, & b < P \end{cases}

Once or twice would have been fine because every Goldilocks add has to make sure its inputs are in range before reducing. But this pair was showing up everywhere. I counted sixteen of them in just the round-constant addition block. Each external round was paying twos of ALU ops, multiplied across every state element, on every iteration.

The story I’d been telling myself was that’s just the cost of being safe against arbitrary inputs. add_asm is a public-ish kernel; it can’t assume b < P because callers might be feeding it values straight out of a multiply that hasn’t fully reduced. The two-instruction canonicalize-b is the price you pay for not corrupting downstream state.

The friction with that story was sitting right there in the line above each subs/csel: the operands were literally being loaded from a round-constant table I had written.

The thing I should have seen sooner

Round constants are static. They’re a const GOLDILOCKS_POSEIDON2_RC_*: [u64; N] array, baked at compile time, never touched by user input. The values are all < P. Of course they are. I’d hand-checked them when I wrote the table, and they came out of a paper anyway. Asking the CPU, sixteen times per round, “is this constant less than P?”, when the answer was already typed into the source, felt like a runtime tax on something I knew at compile time.

That was the first crack. The wider question fell out from there: how many other call sites of add_asm have a b that’s provably canonical, even if the language can’t tell?

I went through the inline-asm files for the NEON Poseidon2 and classified every add_asm call by where its second operand came from. Three buckets came out:

  1. b is the output of add_asm or double_asm. Both already reduce. There’s no path inside the asm sequence that leaves a non-canonical residue. Canonicalization is wasted.
  2. b is a round constant from a const table. Provably canonical from the source, if we can convince ourselves of the precondition.
  3. b is an accumulator initialised to zero. Zero is trivially < P.

Plus the case I couldn’t skip:

  1. b is a public input to the kernel. Some of apply_mat4_asm’s nine internal sums add x[i] directly to a running sum. Those x[i]’s come from arbitrary callers; the canonicalize is load-bearing. There’s an existing proptest, test_apply_mat4_asm_danger, that exercises non-canonical inputs to exactly that contract. Touching those would break it.

So the right shape wasn’t to change add_asm. It was to introduce a sibling (the same add_asm, minus the canonicalize-b pair) and use it only where the precondition holds.

What I considered and rejected

Before settling on a sibling function, two options seemed cleaner to me on the surface:

  • A const-generic flag on add_asm itself: add_asm::<{ B_IS_CANONICAL: true }>(a, b). Elegant on paper, ugly in practice. The inline-asm body would need to conditionally emit different instruction sequences via cfg! or if const shenanigans, and then the call-site noise (add_asm::<true> vs add_asm::<false>) doesn’t actually help readers understand why the precondition holds.

  • A runtime branch inside add_asm that skips the canonicalize when b < P. This is hilariously self-defeating because a branch costs more than the two instructions you’re trying to save, and the branch predictor has no information to predict it.

A sibling function with a name (add_canonical_asm) that encodes the precondition in its identity won out. You can grep for it. You can pub(super) it inside aarch64_neon so no caller outside the module can violate the precondition. And you can put a debug_assert!(b < P) in there so any future change that violates the precondition fails loudly under cargo test rather than silently corrupting a proof.

Proving the precondition

For the round-constant case I needed an actual proof, not just a comment. The fix was a small test:

#[test]
fn test_goldilocks_poseidon2_rc_tables_canonical() {
    for &c in GOLDILOCKS_POSEIDON2_RC_EXTERNAL.iter() {
        assert!(c.value < P);
    }
    for &c in GOLDILOCKS_POSEIDON2_RC_INTERNAL.iter() {
        assert!(c.value < P);
    }
}

It iterates every round-constant array and asserts canonicality. Anyone editing the tables triggers it on the next cargo test so the precondition is mechanically enforced, not vibes-enforced.

For the other two classes, the proof is the inline-asm code itself. Each add_canonical_asm call site gets an annotation in source spelling out why b is canonical (e.g., // b: output of add_asm above), which is cheap, locally checkable, and stays accurate as long as the asm body doesn’t drift.

The diff and the numbers

The actual change ends up being unremarkable: I introduced add_canonical_asm, swapped it in at the three classes of sites, and added the round-constant assertion test. Pi 5, Cortex-A76, governor=performance, 5×5 reps, median:

BenchmarkBeforeAfterΔ
prove_goldilocks_poseidon2 (e2e)124.18 s120.54 s−2.93 %
external_terminal_permute_dual_w83 4343 231−5.91 %
external_initial_permute_dual_w83 6173 392−6.21 %

The kernel-level deltas dilute into the e2e number because most of prove_goldilocks_poseidon2’s wall-time is FFT and Merkle work this doesn’t touch.

The pattern that I found worth keeping

The instruction trim itself was tiny and field-specific. It only mattered because Goldilocks had a fast enough reduction that two ALU ops per add showed up on the Cortex-A76 cycle counter. On a heavier field the multiply dominates and you wouldn’t notice.

What I take from this:

If you can prove an invariant somewhere cheaper than runtime, you can drop the runtime check.

The cheaper place might be:

  • The type system: a Canonical<F> wrapper that nothing can construct except via reduction.
  • A const-table assertion at test time, like the round-constant check above.
  • Local code-shape annotations at the call site, when the producer is two lines up and visibly canonicalising.
  • A property test that exercises the contract, like test_apply_mat4_asm_danger, so a future regression fails loudly.

The trick is being honest about where the proof lives. When I say a call site is “safe”, I should have been able to point at the exact thing that makes it safe. The diff was twenty lines; the principle is reusable.

Full diff and benchmark transcript: Plonky3#1623.

Continued in: The SIMD path that was 0.77× scalar. What surfaced when I turned the same perf-counter survey on the AVX-2 Poseidon1 path next.

Shipped impact

−2.93% end-to-end Goldilocks Poseidon2 prove time on aarch64 NEON

Measured on Pi 5 Cortex-A76 · 5 seeds × 5 reps, median · governor=performance

Where it applies

  • Any aarch64 NEON server running Goldilocks Poseidon2 prover workloads (AWS Graviton, Ampere Altra, Apple Silicon Mac CI/build). The optimisation drops a fixed 2-instruction subs/csel pair per call. Directionally microarchitecture-independent, magnitude depends on issue width and integer-port pressure.
  • Kernel-level wins are larger: −5.91% on external_terminal_permute_dual_w8, −6.21% on external_initial_permute_dual_w8. The e2e number dilutes because most prove wall-time is FFT and Merkle work this PR does not touch.
  • x86 servers (Intel, AMD): no change. This PR is aarch64-only.

Cost translation

aarch64 prover fleet sizemonthly saving at 2.93% e2e
$1k/mo ~$29/mo
$10k/mo ~$293/mo
$50k/mo ~$1.5k/mo

The −2.93% e2e is directly measured (no Merkle-fraction extrapolation). Cost scenarios assume linear scaling and aarch64-only workload mix. Apple Silicon / Graviton / Altra direction same, magnitude unmeasured.