<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Niek Kamer | Writing</title><description>Notes on zero-knowledge cryptography, embedded Rust, and the gap between them.</description><link>https://niek-kamer.github.io/</link><language>en</language><item><title>The trait-method boundary that cost Plonky3 Poseidon2 27%</title><link>https://niek-kamer.github.io/writing/goldilocks-poseidon2-cross-permute-batching/</link><guid isPermaLink="true">https://niek-kamer.github.io/writing/goldilocks-poseidon2-cross-permute-batching/</guid><description>permute_mut(a); permute_mut(b) on independent states leaves 27% on Zen 5. LLVM can&apos;t see across the function-call boundary. Hand-rolling both permutes in one scope reclaims it.</description><pubDate>Sun, 17 May 2026 00:00:00 GMT</pubDate><category>plonky3</category><category>goldilocks</category><category>performance</category><category>poseidon2</category><category>batching</category><category>ilp</category><category>x86</category><category>aarch64</category></item><item><title>The AVX-512 kernel that was 1.79× slower than the compiler&apos;s</title><link>https://niek-kamer.github.io/writing/goldilocks-poseidon2-avx512-falsifications/</link><guid isPermaLink="true">https://niek-kamer.github.io/writing/goldilocks-poseidon2-avx512-falsifications/</guid><description>I spent six hours writing a hand-tuned AVX-512 Goldilocks Poseidon2 kernel. It came back 1.79× slower than the path it was trying to replace. Then I almost wrote a second one.</description><pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate><category>plonky3</category><category>goldilocks</category><category>performance</category><category>poseidon2</category><category>x86</category><category>avx512</category><category>methodology</category><category>post-mortem</category></item><item><title>The Poseidon2 regression my microbenchmark told me wasn&apos;t there</title><link>https://niek-kamer.github.io/writing/goldilocks-poseidon2-merkle-vs-microbench/</link><guid isPermaLink="true">https://niek-kamer.github.io/writing/goldilocks-poseidon2-merkle-vs-microbench/</guid><description>A 3.5% per-permute win on Pi 5 hid a 16% Merkle wall-time loss in Plonky3&apos;s Goldilocks Poseidon2. Same kernels, same hardware. How phase decomposition reconciled the two.</description><pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate><category>plonky3</category><category>goldilocks</category><category>performance</category><category>poseidon2</category><category>aarch64</category><category>neon</category><category>benchmarking</category></item><item><title>The SIMD path that was 0.77× scalar</title><link>https://niek-kamer.github.io/writing/goldilocks-poseidon1-avx-mds-fallback/</link><guid isPermaLink="true">https://niek-kamer.github.io/writing/goldilocks-poseidon1-avx-mds-fallback/</guid><description>How a perf-counter sanity check on Plonky3&apos;s AVX-2 packed Goldilocks Poseidon1 surfaced a missing Karatsuba back-port, and what assumption I&apos;d been making.</description><pubDate>Wed, 13 May 2026 00:00:00 GMT</pubDate><category>plonky3</category><category>goldilocks</category><category>performance</category><category>poseidon1</category><category>x86</category><category>avx</category><category>audit</category><category>benchmarking</category></item><item><title>The two instructions hiding in every Goldilocks Poseidon2 add</title><link>https://niek-kamer.github.io/writing/goldilocks-poseidon2-neon-canonicalize/</link><guid isPermaLink="true">https://niek-kamer.github.io/writing/goldilocks-poseidon2-neon-canonicalize/</guid><description>How I dropped a redundant subs/csel pair from Plonky3&apos;s NEON Goldilocks addition kernel after staring at a perf annotate dump.</description><pubDate>Wed, 13 May 2026 00:00:00 GMT</pubDate><category>plonky3</category><category>goldilocks</category><category>performance</category><category>poseidon2</category><category>aarch64</category><category>neon</category><category>benchmarking</category></item></channel></rss>