Lesson 11 — Running a Reth fork in production

Question

Running a custom Reth fork in production. Disciplines: chainspec + upgrade path + monitoring + emergency response.

Principle (minimum model)

Chainspec. Custom genesis + custom forks. Deterministic; reproducible.
Upgrade path. Add a new fork at a future block. Coordinate with users; staged rollout.
Monitoring. Prometheus + Grafana. Metrics: sync time, peer count, mempool size, gas usage.
Alerts. Page on critical failures (sync stalled > 1 hour, peer count < 10, OOM).
Emergency response. Documented procedures for: stuck chain, fork resolution, key compromise. Practiced.
Coordination. Telegram / Discord channels for validators. Status page for users.
Validator discipline. Documented procedures + on-call rotation + post-mortem culture.
Production examples. OP Mainnet + Hyperliquid + Tempo + Berachain. Each runs ~10-50 validators; ~24/7 ops.

Worked example + steps

Running a Reth fork in production

It's 3 a.m. Your validator stopped producing blocks 40 minutes ago. The dashboard shows: file-descriptor exhaustion, MDBX page-cache pressure, peer count at 2. None of these would have shown up in unit tests. None of them would have shown up the day you shipped. They show up at month 3, all at once. This lesson is the ops checklist that prevents that 3 a.m. page — build flags, systemd limits, diff testing against vanilla, the deployment topology that lets your fork survive contact with reality.

1. Build & release pipeline

# Reproducible release builds
RUSTFLAGS="-C target-cpu=native -C codegen-units=1" \
  cargo build --release --features jemalloc,asm-keccak

strip target/release/reth   # or use objcopy for separated debug

Flag	Why
`-C target-cpu=native`	use AVX2/AVX512 if your validators have it
`codegen-units=1`	better optimization at the cost of build time
`features = [jemalloc]`	tail-latency stability under load
`features = [asm-keccak]`	hand-tuned assembly for keccak — measurable on hot path

2. Systemd unit (or equivalent)

[Service]
ExecStart=/usr/local/bin/reth node --chain custom --datadir /var/lib/reth
Restart=on-failure
LimitNOFILE=1048576
LimitNPROC=infinity
TasksMax=infinity

The file-descriptor limit matters: Reth holds many MDBX pages and many P2P connections.

3. Storage discipline

Separate volumes for DB and logs. Never let logs fill the DB partition.
NVMe SSDs only. Spinning rust will not keep up with execution.
Snapshot regularly. reth db checkpoint (or filesystem-level snapshots if you can pause writes).
Plan for growth. Reth's full state is hundreds of GB and growing.

4. Monitoring

What to alert on:

Metric	Alert if
Sync lag (head vs network)	> N blocks for > N minutes
Peer count	< 5
MDBX free pages	< 5%
Process RSS	trending up monotonically
Block import time	p99 > target
ExEx height behind tip	depends on ExEx

Reth exposes Prometheus metrics out of the box; wire them to Grafana with alerting on rates of change, not just absolute values.

5. Diff testing

If your fork modifies execution, run continuous diff testing against vanilla Reth on the same blocks:

# Pseudo-code for a diff harness
for block in mainnet[recent_1000]:
    s1 = reth_vanilla.execute(block)
    s2 = reth_fork.execute(block)
    if s1.stateRoot != s2.stateRoot:
        alert("divergence at block", block, s1, s2)

Any unintended divergence — even one storage slot — means a consensus bug. Bug = chain halt for an App-chain.

6. Deployment topology for an App-chain

Minimum:

≥ 4 validators in 3 datacenters
2 sentries in front of each validator
A separate archive node (not a validator) for analytics queries
A separate RPC fleet with rate-limiting and a CDN

Don't run the validator and the public RPC on the same machine. One DDoS and your chain stalls.

7. Upgrade procedure

The hardest part of running a fork is upgrading it without halting the chain.

Announce a target block height for activation
Ship the new binary to validators with a config flag, NOT enabled
At the activation block, the consensus rule changes — guarded by the height check
Validators that haven't upgraded fall off — that's why height-gating + announcement matters

This is exactly how Ethereum hard forks work; an App-chain is no different, just smaller scale.

8. Reading list

Reth Book "Run a node" + "Custom chain" sections
The validator ops post-mortem from any major chain incident — they're gold for ops intuition

You now have a complete picture: develop, profile, extend, deploy, monitor. Welcome to the small club.

Final check: in one sentence, why is "diff testing against vanilla Reth" the highest-value test you can write for a fork? What class of bug does it catch that no unit test ever will? If your answer doesn't mention "consensus" or "the only output that matters is stateRoot," re-read Section 5.

Expert continuation

Running a Reth fork in production sits at the intersection of three Expert lessons in this course:

Systems-code auditing — reviewing your fork diff with auditor discipline before every upgrade
Chaos engineering for Rust EVM nodes — controlled failure injection so your fork survives the same failure modes upstream Reth survives
Open-source contributor workflow — upstreaming generic fixes back to Paradigm/Reth so your fork's diff stays small

Summary (3 lines)

Production Reth fork = chainspec + upgrade path + monitoring + emergency response.
Monitoring: Prometheus + Grafana + alerts. Emergency procedures documented + practiced.
Coordination via Telegram/Discord + status page. Production examples: OP / Hyperliquid / Tempo / Berachain.