Writing
Notes on zero-knowledge cryptography, embedded Rust, and the gap between them.
- 7 min read −27.2% Zen 5 AVX-2 Poseidon2 (W=8)
The trait-method boundary that cost Plonky3 Poseidon2 27%
permute_mut(a); permute_mut(b) on independent states leaves 27% on Zen 5. LLVM can't see across the function-call boundary. Hand-rolling both permutes in one scope reclaims it.
- #plonky3
- #goldilocks
- #performance
- #poseidon2
- #batching
- #ilp
- #x86
- #aarch64
- 8 min read
The AVX-512 kernel that was 1.79× slower than the compiler's
I spent six hours writing a hand-tuned AVX-512 Goldilocks Poseidon2 kernel. It came back 1.79× slower than the path it was trying to replace. Then I almost wrote a second one.
- #plonky3
- #goldilocks
- #performance
- #poseidon2
- #x86
- #avx512
- #methodology
- #post-mortem
- 8 min read −16.4% Merkle wall time
The Poseidon2 regression my microbenchmark told me wasn't there
A 3.5% per-permute win on Pi 5 hid a 16% Merkle wall-time loss in Plonky3's Goldilocks Poseidon2. Same kernels, same hardware. How phase decomposition reconciled the two.
- #plonky3
- #goldilocks
- #performance
- #poseidon2
- #aarch64
- #neon
- #benchmarking
- 6 min read 1.20× audit-triggered speedup
The SIMD path that was 0.77× scalar
How a perf-counter sanity check on Plonky3's AVX-2 packed Goldilocks Poseidon1 surfaced a missing Karatsuba back-port, and what assumption I'd been making.
- #plonky3
- #goldilocks
- #performance
- #poseidon1
- #x86
- #avx
- #audit
- #benchmarking
- 6 min read −2.93% e2e prove time
The two instructions hiding in every Goldilocks Poseidon2 add
How I dropped a redundant subs/csel pair from Plonky3's NEON Goldilocks addition kernel after staring at a perf annotate dump.
- #plonky3
- #goldilocks
- #performance
- #poseidon2
- #aarch64
- #neon
- #benchmarking