Lesson 1 — Performance engineering for Reth

Question

Reth's perf engineering is the production discipline. Profile + optimise + measure. The patterns transfer to any Rust system.

Principle (minimum model)

Profile first. cargo flamegraph reveals hot paths. Don't guess; measure.
Hot paths in Reth. revm interpreter (~70 % time), MDBX I/O (~15 %), serde (~5 %), other (~10 %). Optimisation effort follows.
Optimisation tools. rayon (parallelism), inline (avoid function call), branchless (avoid mispredict), const fn (compile-time eval).
Common patterns. Pre-allocate Vec capacity; use SmallVec<T, N> for fixed-size collections; cache hot computations; batch I/O.
Profile-guided. Reth's CI runs benchmarks; regressions blocked at PR-time. Production discipline.
Cross-cutting metrics. Sync time, peak memory, per-stage throughput. Monitored in production.

Worked example + steps

Performance engineering for Reth

A Reth fork ships. Block import was 12ms in your benches; in production it's 80ms. Where did the 68ms go? You don't know — because nobody profiled. This is the failure mode this lesson exists to prevent: invisible slowdowns, the kind that compound silently until a validator falls 200 blocks behind.

If you're going to ship a Reth fork or write hot-path code in Revm, profiling and benchmarking are non-negotiable. Premature optimization is bad; invisible slowdowns are worse.

1. Profile first, optimize second

The discipline: never optimize anything you haven't measured. Two tools cover the two questions you'll ever ask:

Tool	Purpose
flamegraph	"Where is time being spent overall?"
Criterion	"Did this specific change make function X faster?"

Flamegraph in 30 seconds

cargo install flamegraph
cargo flamegraph --bin reth -- node --chain mainnet
# Open flamegraph.svg in a browser

Wide bars at the top of the flamegraph = your hot paths. Don't optimize anything that isn't visible there.

Criterion microbenchmarks

// Cargo.toml
// [dev-dependencies]
// criterion = "0.5"

// benches/my_bench.rs
use criterion::{criterion_group, criterion_main, Criterion};

fn bench_my_thing(c: &mut Criterion) {
    c.bench_function("hash 1KB", |b| {
        let data = vec![0u8; 1024];
        b.iter(|| keccak256(&data))
    });
}

criterion_group!(benches, bench_my_thing);
criterion_main!(benches);

cargo bench produces statistical comparisons. Always commit your benchmark results when claiming a perf improvement.

Whole-node benchmarks: snapshot the disk

The anti-fluency prompt above lands the real problem: microbenchmarks don't capture system effects. The fix is to benchmark a whole-node workload, but that needs the same on-disk state every run — and a 500 GB Reth database takes hours to rebuild between runs. Paradigm uses tempoxyz/schelk for exactly this: a block-device snapshot/rollback tool that restores the scratch volume by copying only the blocks that the benchmark wrote (tracked via dm-era), so rollback takes seconds instead of hours, and the workload still runs against plain ext4 on real NVMe with no overlay or CoW filesystem in the read path.

The pedagogical point: when you read perf claims about Reth ("we shaved 15% off staged sync"), assume the authors are using something like schelk between runs. A benchmark without rollback discipline is unrepeatable; a benchmark with the wrong rollback (LVM thin overlays, btrfs snapshots) measures the rollback machinery as much as the workload.

🔍 Find in repo. Open tempoxyz/schelk and read docs/SKILL.md. Three things will surprise you about how it does rollback. Name them before continuing — then verify against the repo.

2. Cache lines, not lines of code

On a modern CPU, reading from RAM is ~100x slower than doing arithmetic on a register. So "make the code shorter" is the wrong knob — make the memory layout friendlier is the right one. The unit the CPU actually loads is a 64-byte cache line (not a byte, not a struct field — a fixed 64-byte chunk).

Implications

Struct of Arrays > Array of Structs for hot loops
Pad hot fields to a cache line to avoid false sharing
Sort data for predictable access patterns

// Bad: every iteration touches 200 bytes
struct Row {
    id: u64,
    big_blob: [u8; 192],
}

// Better: separate hot and cold fields
struct Hot { id: u64, version: u32 }
struct Cold { big_blob: [u8; 192] }

3. Allocator choice

Every Vec::push and Box::new eventually calls into the global allocator. Which allocator you use changes the latency distribution — not the throughput, the tails. Reth picks jemalloc (Facebook's allocator, now under the tikv-jemallocator crate) over the system default (glibc malloc on Linux) because jemalloc keeps p99 stable under heavy fragmentation.

# Cargo.toml
[dependencies]
tikv-jemallocator = "0.5"

// main.rs
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

This single line frequently shaves 10-30% off tail latency in I/O-heavy services.

4. Reth's actual production build profiles

From the paradigmxyz/reth Cargo.toml:

[profile.release]
opt-level = 3
lto = "thin"
debug = "none"
strip = "symbols"
panic = "unwind"
codegen-units = 16

[profile.maxperf]
inherits = "release"
lto = "fat"
codegen-units = 1

[profile.maxperf-symbols]
inherits = "maxperf"
debug = "full"
strip = "none"

This is what the Paradigm team actually ships. Three profiles, three trade-offs:

`release` — daily builds

thin LTO and 16 codegen-units balance compile speed against runtime perf. Good enough for development and most production deployments.

`maxperf` — validators and benchmarks

fat LTO + 1 codegen-unit. Compile times go up significantly (full cross-module inlining), but the resulting binary is genuinely faster — this is what you build for a validator that needs every cycle.

`maxperf-symbols` — profiling production

Same optimization as maxperf, but keeps full debug symbols. Use it when you need a flamegraph that shows actual function names instead of mangled offsets in production-grade code. This is the profile you build when something is slow in production and you need to find out why.

How to invoke

cargo build --profile maxperf --bin reth
# Or with native CPU instructions (e.g., AVX2):
RUSTFLAGS="-C target-cpu=native" cargo build --profile maxperf --bin reth

Combine with the jemalloc and asm-keccak features you saw earlier.

5. Three rules

Measure before changing anything. "It feels faster" is not data.
Optimize the path the profiler shows you. Anything else is busy work.
Re-measure after. Compilers can defeat your hand-optimization.

Final check: revisit your "junior engineer wants to swap HashMap for BTreeMap" prediction from the top. Did you cite measurement, profiling, and re-verification? If you cited "well, BTreeMap is sometimes slower" — that's also wrong reasoning, just on the other side. The point isn't which container; the point is that the question is unanswerable without data.

You're now equipped to start opening Reth's perf-critical files (crates/storage/db, crates/blockchain-tree) with intent rather than just curiosity.

Summary (3 lines)

Reth perf = profile (cargo flamegraph) + optimise hot paths (revm 70 % / MDBX 15 %) + measure (CI benchmarks).
Tools: rayon, inline, branchless, const fn. Patterns: pre-alloc, SmallVec, cache, batch I/O.
Production discipline; PR-time regression-blocking. Patterns transfer to any Rust system.