FABRKNT
Reth Expert — Production Engineering
Production Engineering
Lesson 11 of 25·CONTENT18 min40 XP

Treat this page as a workbench, not a blog post. The goal is to extract a reusable mental model from the source and carry it into the rest of the Fabrknt stack.

Course
Reth Expert — Production Engineering
Lesson role
CONTENT
Sequence
11 / 25

Lesson 11 — Running a Reth fork in production

Question

Running a custom Reth fork in production. Disciplines: chainspec + upgrade path + monitoring + emergency response.

Principle (minimum model)

  • Chainspec. Custom genesis + custom forks. Deterministic; reproducible.
  • Upgrade path. Add a new fork at a future block. Coordinate with users; staged rollout.
  • Monitoring. Prometheus + Grafana. Metrics: sync time, peer count, mempool size, gas usage.
  • Alerts. Page on critical failures (sync stalled > 1 hour, peer count < 10, OOM).
  • Emergency response. Documented procedures for: stuck chain, fork resolution, key compromise. Practiced.
  • Coordination. Telegram / Discord channels for validators. Status page for users.
  • Validator discipline. Documented procedures + on-call rotation + post-mortem culture.
  • Production examples. OP Mainnet + Hyperliquid + Tempo + Berachain. Each runs ~10-50 validators; ~24/7 ops.

Worked example + steps

Running a Reth fork in production

It's 3 a.m. Your validator stopped producing blocks 40 minutes ago. The dashboard shows: file-descriptor exhaustion, MDBX page-cache pressure, peer count at 2. None of these would have shown up in unit tests. None of them would have shown up the day you shipped. They show up at month 3, all at once. This lesson is the ops checklist that prevents that 3 a.m. page — build flags, systemd limits, diff testing against vanilla, the deployment topology that lets your fork survive contact with reality.

1. Build & release pipeline

# Reproducible release builds
RUSTFLAGS="-C target-cpu=native -C codegen-units=1" \
  cargo build --release --features jemalloc,asm-keccak

strip target/release/reth   # or use objcopy for separated debug
FlagWhy
-C target-cpu=nativeuse AVX2/AVX512 if your validators have it
codegen-units=1better optimization at the cost of build time
features = [jemalloc]tail-latency stability under load
features = [asm-keccak]hand-tuned assembly for keccak — measurable on hot path

2. Systemd unit (or equivalent)

[Service]
ExecStart=/usr/local/bin/reth node --chain custom --datadir /var/lib/reth
Restart=on-failure
LimitNOFILE=1048576
LimitNPROC=infinity
TasksMax=infinity

The file-descriptor limit matters: Reth holds many MDBX pages and many P2P connections.

3. Storage discipline

  • Separate volumes for DB and logs. Never let logs fill the DB partition.
  • NVMe SSDs only. Spinning rust will not keep up with execution.
  • Snapshot regularly. reth db checkpoint (or filesystem-level snapshots if you can pause writes).
  • Plan for growth. Reth's full state is hundreds of GB and growing.

4. Monitoring

What to alert on:

MetricAlert if
Sync lag (head vs network)> N blocks for > N minutes
Peer count< 5
MDBX free pages< 5%
Process RSStrending up monotonically
Block import timep99 > target
ExEx height behind tipdepends on ExEx

Reth exposes Prometheus metrics out of the box; wire them to Grafana with alerting on rates of change, not just absolute values.

5. Diff testing

If your fork modifies execution, run continuous diff testing against vanilla Reth on the same blocks:

# Pseudo-code for a diff harness
for block in mainnet[recent_1000]:
    s1 = reth_vanilla.execute(block)
    s2 = reth_fork.execute(block)
    if s1.stateRoot != s2.stateRoot:
        alert("divergence at block", block, s1, s2)

Any unintended divergence — even one storage slot — means a consensus bug. Bug = chain halt for an App-chain.

6. Deployment topology for an App-chain

Minimum:

  • ≥ 4 validators in 3 datacenters
  • 2 sentries in front of each validator
  • A separate archive node (not a validator) for analytics queries
  • A separate RPC fleet with rate-limiting and a CDN

Don't run the validator and the public RPC on the same machine. One DDoS and your chain stalls.

7. Upgrade procedure

The hardest part of running a fork is upgrading it without halting the chain.

  1. Announce a target block height for activation
  2. Ship the new binary to validators with a config flag, NOT enabled
  3. At the activation block, the consensus rule changes — guarded by the height check
  4. Validators that haven't upgraded fall off — that's why height-gating + announcement matters

This is exactly how Ethereum hard forks work; an App-chain is no different, just smaller scale.

8. Reading list

You now have a complete picture: develop, profile, extend, deploy, monitor. Welcome to the small club.

Final check: in one sentence, why is "diff testing against vanilla Reth" the highest-value test you can write for a fork? What class of bug does it catch that no unit test ever will? If your answer doesn't mention "consensus" or "the only output that matters is stateRoot," re-read Section 5.

Expert continuation

Running a Reth fork in production sits at the intersection of three Expert lessons in this course:

Summary (3 lines)

  • Production Reth fork = chainspec + upgrade path + monitoring + emergency response.
  • Monitoring: Prometheus + Grafana + alerts. Emergency procedures documented + practiced.
  • Coordination via Telegram/Discord + status page. Production examples: OP / Hyperliquid / Tempo / Berachain.