Lesson 1 — Performance engineering for Reth
Question
Reth's perf engineering is the production discipline. Profile + optimise + measure. The patterns transfer to any Rust system.
Principle (minimum model)
- Profile first.
cargo flamegraphreveals hot paths. Don't guess; measure. - Hot paths in Reth. revm interpreter (~70 % time), MDBX I/O (~15 %), serde (~5 %), other (~10 %). Optimisation effort follows.
- Optimisation tools. rayon (parallelism), inline (avoid function call), branchless (avoid mispredict), const fn (compile-time eval).
- Common patterns. Pre-allocate Vec capacity; use
SmallVec<T, N>for fixed-size collections; cache hot computations; batch I/O. - Profile-guided. Reth's CI runs benchmarks; regressions blocked at PR-time. Production discipline.
- Cross-cutting metrics. Sync time, peak memory, per-stage throughput. Monitored in production.
Worked example + steps
Performance engineering for Reth
A Reth fork ships. Block import was 12ms in your benches; in production it's 80ms. Where did the 68ms go? You don't know — because nobody profiled. This is the failure mode this lesson exists to prevent: invisible slowdowns, the kind that compound silently until a validator falls 200 blocks behind.
If you're going to ship a Reth fork or write hot-path code in Revm, profiling and benchmarking are non-negotiable. Premature optimization is bad; invisible slowdowns are worse.
1. Profile first, optimize second
The discipline: never optimize anything you haven't measured. Two tools cover the two questions you'll ever ask:
| Tool | Purpose |
|---|---|
| flamegraph | "Where is time being spent overall?" |
| Criterion | "Did this specific change make function X faster?" |
Flamegraph in 30 seconds
cargo install flamegraph
cargo flamegraph --bin reth -- node --chain mainnet
# Open flamegraph.svg in a browser
Wide bars at the top of the flamegraph = your hot paths. Don't optimize anything that isn't visible there.
Criterion microbenchmarks
// Cargo.toml
// [dev-dependencies]
// criterion = "0.5"
// benches/my_bench.rs
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_my_thing(c: &mut Criterion) {
c.bench_function("hash 1KB", |b| {
let data = vec![0u8; 1024];
b.iter(|| keccak256(&data))
});
}
criterion_group!(benches, bench_my_thing);
criterion_main!(benches);
cargo bench produces statistical comparisons. Always commit your benchmark results when claiming a perf improvement.
Whole-node benchmarks: snapshot the disk
The anti-fluency prompt above lands the real problem: microbenchmarks don't capture system effects. The fix is to benchmark a whole-node workload, but that needs the same on-disk state every run — and a 500 GB Reth database takes hours to rebuild between runs. Paradigm uses tempoxyz/schelk for exactly this: a block-device snapshot/rollback tool that restores the scratch volume by copying only the blocks that the benchmark wrote (tracked via dm-era), so rollback takes seconds instead of hours, and the workload still runs against plain ext4 on real NVMe with no overlay or CoW filesystem in the read path.
The pedagogical point: when you read perf claims about Reth ("we shaved 15% off staged sync"), assume the authors are using something like schelk between runs. A benchmark without rollback discipline is unrepeatable; a benchmark with the wrong rollback (LVM thin overlays, btrfs snapshots) measures the rollback machinery as much as the workload.
🔍 Find in repo. Open
tempoxyz/schelkand readdocs/SKILL.md. Three things will surprise you about how it does rollback. Name them before continuing — then verify against the repo.
2. Cache lines, not lines of code
On a modern CPU, reading from RAM is ~100x slower than doing arithmetic on a register. So "make the code shorter" is the wrong knob — make the memory layout friendlier is the right one. The unit the CPU actually loads is a 64-byte cache line (not a byte, not a struct field — a fixed 64-byte chunk).
Implications
- Struct of Arrays > Array of Structs for hot loops
- Pad hot fields to a cache line to avoid false sharing
- Sort data for predictable access patterns
// Bad: every iteration touches 200 bytes
struct Row {
id: u64,
big_blob: [u8; 192],
}
// Better: separate hot and cold fields
struct Hot { id: u64, version: u32 }
struct Cold { big_blob: [u8; 192] }
3. Allocator choice
Every Vec::push and Box::new eventually calls into the global allocator. Which allocator you use changes the latency distribution — not the throughput, the tails. Reth picks jemalloc (Facebook's allocator, now under the tikv-jemallocator crate) over the system default (glibc malloc on Linux) because jemalloc keeps p99 stable under heavy fragmentation.
# Cargo.toml
[dependencies]
tikv-jemallocator = "0.5"
// main.rs
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
This single line frequently shaves 10-30% off tail latency in I/O-heavy services.
4. Reth's actual production build profiles
From the paradigmxyz/reth Cargo.toml:
[profile.release]
opt-level = 3
lto = "thin"
debug = "none"
strip = "symbols"
panic = "unwind"
codegen-units = 16
[profile.maxperf]
inherits = "release"
lto = "fat"
codegen-units = 1
[profile.maxperf-symbols]
inherits = "maxperf"
debug = "full"
strip = "none"
This is what the Paradigm team actually ships. Three profiles, three trade-offs:
release — daily builds
thin LTO and 16 codegen-units balance compile speed against runtime perf. Good enough for development and most production deployments.
maxperf — validators and benchmarks
fat LTO + 1 codegen-unit. Compile times go up significantly (full cross-module inlining), but the resulting binary is genuinely faster — this is what you build for a validator that needs every cycle.
maxperf-symbols — profiling production
Same optimization as maxperf, but keeps full debug symbols. Use it when you need a flamegraph that shows actual function names instead of mangled offsets in production-grade code. This is the profile you build when something is slow in production and you need to find out why.
How to invoke
cargo build --profile maxperf --bin reth
# Or with native CPU instructions (e.g., AVX2):
RUSTFLAGS="-C target-cpu=native" cargo build --profile maxperf --bin reth
Combine with the jemalloc and asm-keccak features you saw earlier.
5. Three rules
- Measure before changing anything. "It feels faster" is not data.
- Optimize the path the profiler shows you. Anything else is busy work.
- Re-measure after. Compilers can defeat your hand-optimization.
Final check: revisit your "junior engineer wants to swap HashMap for BTreeMap" prediction from the top. Did you cite measurement, profiling, and re-verification? If you cited "well, BTreeMap is sometimes slower" — that's also wrong reasoning, just on the other side. The point isn't which container; the point is that the question is unanswerable without data.
You're now equipped to start opening Reth's perf-critical files (crates/storage/db, crates/blockchain-tree) with intent rather than just curiosity.
Summary (3 lines)
- Reth perf = profile (cargo flamegraph) + optimise hot paths (revm 70 % / MDBX 15 %) + measure (CI benchmarks).
- Tools: rayon, inline, branchless, const fn. Patterns: pre-alloc, SmallVec, cache, batch I/O.
- Production discipline; PR-time regression-blocking. Patterns transfer to any Rust system.