Lesson 2 — MDBX & storage internals
Question
Reth uses MDBX (libmdbx) for state storage. B+tree on disk; mmap-backed; COW transactions. Why this choice + production tuning.
Principle (minimum model)
- MDBX = libmdbx, a forked Berkeley DB-style B+tree. Mmap-backed for OS-managed paging.
- COW (copy-on-write) transactions. Writers see fresh snapshot; readers see consistent old snapshot. No locks for readers.
- Why MDBX over RocksDB / LMDB? Simpler ops profile; better Rust bindings (
reth-mdbx); deterministic perf. - Disk format. Pages (typically 4 KB) organised in a B+tree. Indexes hot data first via key access patterns.
- Tuning. map_size (max mmap), sync_period (durability vs perf trade), max readers (concurrency).
- Production gotchas. Mmap exhaustion on 32-bit; transaction expiry on long-running reads; cache size for high-traffic keys.
- Reth conventions. Table per data type (accounts / storage / receipts); per-table compaction.
Worked example + steps
MDBX & storage internals
Every account balance, every storage slot, every receipt — Reth keeps all of it in one key-value store: MDBX. Not Postgres, not RocksDB, not a custom format. MDBX is a memory-mapped B+tree (a balanced tree where each node holds multiple keys to fit a disk page) descended from LMDB. The whole 500GB database is exposed to your Rust code as if it were a giant in-memory slice — the OS handles the disk-vs-RAM dance through mmap.
Understanding MDBX is what separates "I can use Reth" from "I can extend Reth."
1. Why MDBX (not LevelDB / RocksDB)?
| Feature | RocksDB | MDBX |
|---|---|---|
| Architecture | LSM tree | B+tree, mmap'd |
| Read latency | Variable (compactions) | Predictable |
| Write amplification | High | ~1x |
| Crash safety | Manual flush | ACID via MVCC |
| Read concurrency | Locks | Lock-free reads |
Reth picks MDBX because Ethereum is read-heavy and latency-sensitive. LSM trees (the log-structured-merge design RocksDB and LevelDB use — fast writes, periodic background rewrites) do well at writes but stall on compactions — the moments where they pause everything to rewrite tiers — and those stalls are fatal for sync speed and validator latency.
2. Reth's actual Database trait
From crates/storage/db-api/src/database.rs:
pub trait Database: Send + Sync + Debug {
type TX: DbTx + Send + Sync + Debug + 'static;
type TXMut: DbTxMut + DbTx + TableImporter + Send + Sync + Debug + 'static;
#[track_caller]
fn tx(&self) -> Result<Self::TX, DatabaseError>;
#[track_caller]
fn tx_mut(&self) -> Result<Self::TXMut, DatabaseError>;
fn path(&self) -> PathBuf;
fn oldest_reader_txnid(&self) -> Option<u64>;
fn last_txnid(&self) -> Option<u64>;
}
Read this carefully:
- Two associated transaction types —
TX(read-only) andTXMut(read-write). Different methods on each. The split prevents you from accidentally callingputon a read transaction at compile time. oldest_reader_txnid— exposes the oldest still-active read transaction. Operators use this to detect long-running readers that block GC.#[track_caller]— when a tx open fails, the panic shows the caller's line number, not the trait method. Real production debugging discipline.
3. DbTx and DbTxMut — the actual operations
From crates/storage/db-api/src/transaction.rs:
// DbTx (read-only)
fn get<T: Table>(&self, key: T::Key) -> Result<Option<T::Value>, DatabaseError>;
fn get_by_encoded_key<T: Table>(
&self,
key: &<T::Key as Encode>::Encoded,
) -> Result<Option<T::Value>, DatabaseError>;
fn commit(self) -> Result<(), DatabaseError>;
fn abort(self);
fn cursor_read<T: Table>(&self) -> Result<Self::Cursor<T>, DatabaseError>;
fn cursor_dup_read<T: DupSort>(&self) -> Result<Self::DupCursor<T>, DatabaseError>;
fn entries<T: Table>(&self) -> Result<usize, DatabaseError>;
fn disable_long_read_transaction_safety(&mut self);
// DbTxMut (read-write)
fn put<T: Table>(&self, key: T::Key, value: T::Value) -> Result<(), DatabaseError>;
fn append<T: Table>(&self, key: T::Key, value: T::Value) -> Result<(), DatabaseError>;
fn delete<T: Table>(&self, key: T::Key, value: Option<T::Value>) -> Result<bool, DatabaseError>;
fn clear<T: Table>(&self) -> Result<(), DatabaseError>;
fn cursor_write<T: Table>(&self) -> Result<Self::CursorMut<T>, DatabaseError>;
fn cursor_dup_write<T: DupSort>(&self) -> Result<Self::DupCursorMut<T>, DatabaseError>;
Four things matter most:
<T: Table> — table is a type, not a string
Each table is a Rust type that implements the Table trait. The compiler enforces "key/value types must match this table's schema." A typo in a table name is a compile error.
append vs put
put works for any key. append is only valid when the key is greater than the current max — but it's faster because it skips a B+tree search. When you're processing blocks sequentially, you use append; when reorging, you fall back to put.
Cursors
For range scans, you use a cursor instead of repeated get calls. A cursor positions itself in the B+tree once and walks neighboring entries — orders of magnitude faster than independent gets, because adjacent keys likely share the same page.
disable_long_read_transaction_safety
A real-life ergonomic detail. Long read tx blocks GC, which grows the DB. Reth normally aborts read txs that have been open too long. Set this when you really need a long snapshot (and accept the cost).
4. Why this matters for hot paths
Because reads are mmap'd:
- A "warm" header lookup is a pointer dereference, not a syscall
- The OS page cache becomes your read cache for free
- Locality matters: keep related data on the same page
Reth's tables are designed so that Execution-stage reads (account → storage → code) hit pages that are already warm.
5. Pitfalls
- Long read transactions block writers' garbage collection. Don't keep a read tx open for hours; the DB grows.
- Page size and key ordering matter. B+tree fanout depends on key size; a 200-byte key is a different beast than a 32-byte one.
- mmap means OS pressure. A 500GB DB on a 16GB machine will thrash unless your access pattern is local.
6. The comparator: MegaETH's SALT
MDBX is the right default for a vanilla Reth node. But "right default" is not the same as "right for every chain." MegaETH replaced MDBX entirely with SALT (Small Authentication Large Trie) to push throughput beyond what a disk-backed B+tree allows.
The design contrast is worth holding in your head when you read either:
| Aspect | MDBX (Reth default) | SALT (MegaETH) |
|---|---|---|
| Form | Memory-mapped B+tree | Two-tier: 4-level complete 256-ary trie + SHI hash-table buckets |
| Storage model | All data on disk, OS pages it in via mmap | Authentication layer lives fully in memory (~1 GB per 3 B items); data sits in buckets |
| State-root update | Walks the MPT, touches many random disk pages | Bucket-local updates; eliminates random disk I/O during root recomputation |
| Trie shape | None — Reth maintains the MPT separately on top of MDBX | Trie is the storage; commitments are intrinsic |
| Insertion-order invariance | N/A (KV agnostic) | SHI (Strongly History-Independent) — canonical commitment regardless of insertion order |
| Strengths | Mature, crash-safe, ACID, deep tool ecosystem | Memory-efficient authentication at billion-scale, no random disk I/O on state roots |
| Trade-offs | Random I/O during state root updates becomes the bottleneck at high TPS | New (~2026 design), narrower deployment, sensitive to memory pressure |
The pedagogical point is not "SALT is better." It's that MDBX's design assumptions become visible only when you see what someone else chose differently and why. If you've only ever read one storage layer, you can't tell which decisions are essential vs. accidental.
Read megaeth-labs/salt alongside Reth's MDBX wrapper. The questions that surface when you do — "where does Reth pay for crash safety we don't need at high TPS?" "what does SALT give up to fit authentication in memory?" — are the design questions you'll face when extending Reth's storage layer for your own chain.
Drill
Open crates/storage/db-api/src/tables in the repo:
- Find the
Headerstable — note its key (BlockNumber) and value (Header) - Find a
DupSorttable — these are tables where one key has multiple values. Why doesDupSortexist? What kind of data needs it? - Trace one Execution-stage read through: which tables does it consult, in what order?
You'll come out the other side knowing where every byte of Ethereum state lives in Reth.
Final check: in one sentence, why does mmap let you treat a 500GB DB like a Rust slice? Where does the OS fit in? If you can't explain the page-fault → page-load mechanism, the "pointer dereference, not syscall" claim is words to you, not understanding.
Summary (3 lines)
- MDBX = libmdbx (forked Berkeley DB-style B+tree). Mmap-backed; COW transactions; no reader locks.
- Tuning: map_size + sync_period + max readers. Gotchas: mmap on 32-bit + long-running reads.
- Reth: table per data type + per-table compaction. Deterministic perf.