Lesson 11 — Running a Reth fork in production
Question
Running a custom Reth fork in production. Disciplines: chainspec + upgrade path + monitoring + emergency response.
Principle (minimum model)
- Chainspec. Custom genesis + custom forks. Deterministic; reproducible.
- Upgrade path. Add a new fork at a future block. Coordinate with users; staged rollout.
- Monitoring. Prometheus + Grafana. Metrics: sync time, peer count, mempool size, gas usage.
- Alerts. Page on critical failures (sync stalled > 1 hour, peer count < 10, OOM).
- Emergency response. Documented procedures for: stuck chain, fork resolution, key compromise. Practiced.
- Coordination. Telegram / Discord channels for validators. Status page for users.
- Validator discipline. Documented procedures + on-call rotation + post-mortem culture.
- Production examples. OP Mainnet + Hyperliquid + Tempo + Berachain. Each runs ~10-50 validators; ~24/7 ops.
Worked example + steps
Running a Reth fork in production
It's 3 a.m. Your validator stopped producing blocks 40 minutes ago. The dashboard shows: file-descriptor exhaustion, MDBX page-cache pressure, peer count at 2. None of these would have shown up in unit tests. None of them would have shown up the day you shipped. They show up at month 3, all at once. This lesson is the ops checklist that prevents that 3 a.m. page — build flags, systemd limits, diff testing against vanilla, the deployment topology that lets your fork survive contact with reality.
1. Build & release pipeline
# Reproducible release builds
RUSTFLAGS="-C target-cpu=native -C codegen-units=1" \
cargo build --release --features jemalloc,asm-keccak
strip target/release/reth # or use objcopy for separated debug
| Flag | Why |
|---|---|
-C target-cpu=native | use AVX2/AVX512 if your validators have it |
codegen-units=1 | better optimization at the cost of build time |
features = [jemalloc] | tail-latency stability under load |
features = [asm-keccak] | hand-tuned assembly for keccak — measurable on hot path |
2. Systemd unit (or equivalent)
[Service]
ExecStart=/usr/local/bin/reth node --chain custom --datadir /var/lib/reth
Restart=on-failure
LimitNOFILE=1048576
LimitNPROC=infinity
TasksMax=infinity
The file-descriptor limit matters: Reth holds many MDBX pages and many P2P connections.
3. Storage discipline
- Separate volumes for DB and logs. Never let logs fill the DB partition.
- NVMe SSDs only. Spinning rust will not keep up with execution.
- Snapshot regularly.
reth db checkpoint(or filesystem-level snapshots if you can pause writes). - Plan for growth. Reth's full state is hundreds of GB and growing.
4. Monitoring
What to alert on:
| Metric | Alert if |
|---|---|
| Sync lag (head vs network) | > N blocks for > N minutes |
| Peer count | < 5 |
| MDBX free pages | < 5% |
| Process RSS | trending up monotonically |
| Block import time | p99 > target |
| ExEx height behind tip | depends on ExEx |
Reth exposes Prometheus metrics out of the box; wire them to Grafana with alerting on rates of change, not just absolute values.
5. Diff testing
If your fork modifies execution, run continuous diff testing against vanilla Reth on the same blocks:
# Pseudo-code for a diff harness
for block in mainnet[recent_1000]:
s1 = reth_vanilla.execute(block)
s2 = reth_fork.execute(block)
if s1.stateRoot != s2.stateRoot:
alert("divergence at block", block, s1, s2)
Any unintended divergence — even one storage slot — means a consensus bug. Bug = chain halt for an App-chain.
6. Deployment topology for an App-chain
Minimum:
- ≥ 4 validators in 3 datacenters
- 2 sentries in front of each validator
- A separate archive node (not a validator) for analytics queries
- A separate RPC fleet with rate-limiting and a CDN
Don't run the validator and the public RPC on the same machine. One DDoS and your chain stalls.
7. Upgrade procedure
The hardest part of running a fork is upgrading it without halting the chain.
- Announce a target block height for activation
- Ship the new binary to validators with a config flag, NOT enabled
- At the activation block, the consensus rule changes — guarded by the height check
- Validators that haven't upgraded fall off — that's why height-gating + announcement matters
This is exactly how Ethereum hard forks work; an App-chain is no different, just smaller scale.
8. Reading list
- Reth Book "Run a node" + "Custom chain" sections
- The validator ops post-mortem from any major chain incident — they're gold for ops intuition
You now have a complete picture: develop, profile, extend, deploy, monitor. Welcome to the small club.
Final check: in one sentence, why is "diff testing against vanilla Reth" the highest-value test you can write for a fork? What class of bug does it catch that no unit test ever will? If your answer doesn't mention "consensus" or "the only output that matters is stateRoot," re-read Section 5.
Expert continuation
Running a Reth fork in production sits at the intersection of three Expert lessons in this course:
- Systems-code auditing — reviewing your fork diff with auditor discipline before every upgrade
- Chaos engineering for Rust EVM nodes — controlled failure injection so your fork survives the same failure modes upstream Reth survives
- Open-source contributor workflow — upstreaming generic fixes back to Paradigm/Reth so your fork's diff stays small
Summary (3 lines)
- Production Reth fork = chainspec + upgrade path + monitoring + emergency response.
- Monitoring: Prometheus + Grafana + alerts. Emergency procedures documented + practiced.
- Coordination via Telegram/Discord + status page. Production examples: OP / Hyperliquid / Tempo / Berachain.