Lesson 2 — MDBX & storage internals

Question

Reth uses MDBX (libmdbx) for state storage. B+tree on disk; mmap-backed; COW transactions. Why this choice + production tuning.

Principle (minimum model)

MDBX = libmdbx, a forked Berkeley DB-style B+tree. Mmap-backed for OS-managed paging.
COW (copy-on-write) transactions. Writers see fresh snapshot; readers see consistent old snapshot. No locks for readers.
Why MDBX over RocksDB / LMDB? Simpler ops profile; better Rust bindings (reth-mdbx); deterministic perf.
Disk format. Pages (typically 4 KB) organised in a B+tree. Indexes hot data first via key access patterns.
Tuning. map_size (max mmap), sync_period (durability vs perf trade), max readers (concurrency).
Production gotchas. Mmap exhaustion on 32-bit; transaction expiry on long-running reads; cache size for high-traffic keys.
Reth conventions. Table per data type (accounts / storage / receipts); per-table compaction.

Worked example + steps

MDBX & storage internals

Every account balance, every storage slot, every receipt — Reth keeps all of it in one key-value store: MDBX. Not Postgres, not RocksDB, not a custom format. MDBX is a memory-mapped B+tree (a balanced tree where each node holds multiple keys to fit a disk page) descended from LMDB. The whole 500GB database is exposed to your Rust code as if it were a giant in-memory slice — the OS handles the disk-vs-RAM dance through mmap.

Understanding MDBX is what separates "I can use Reth" from "I can extend Reth."

1. Why MDBX (not LevelDB / RocksDB)?

Feature	RocksDB	MDBX
Architecture	LSM tree	B+tree, mmap'd
Read latency	Variable (compactions)	Predictable
Write amplification	High	~1x
Crash safety	Manual flush	ACID via MVCC
Read concurrency	Locks	Lock-free reads

Reth picks MDBX because Ethereum is read-heavy and latency-sensitive. LSM trees (the log-structured-merge design RocksDB and LevelDB use — fast writes, periodic background rewrites) do well at writes but stall on compactions — the moments where they pause everything to rewrite tiers — and those stalls are fatal for sync speed and validator latency.

2. Reth's actual `Database` trait

From crates/storage/db-api/src/database.rs:

pub trait Database: Send + Sync + Debug {
    type TX: DbTx + Send + Sync + Debug + 'static;
    type TXMut: DbTxMut + DbTx + TableImporter + Send + Sync + Debug + 'static;

    #[track_caller]
    fn tx(&self) -> Result<Self::TX, DatabaseError>;

    #[track_caller]
    fn tx_mut(&self) -> Result<Self::TXMut, DatabaseError>;

    fn path(&self) -> PathBuf;

    fn oldest_reader_txnid(&self) -> Option<u64>;

    fn last_txnid(&self) -> Option<u64>;
}

Read this carefully:

Two associated transaction types — TX (read-only) and TXMut (read-write). Different methods on each. The split prevents you from accidentally calling put on a read transaction at compile time.
oldest_reader_txnid — exposes the oldest still-active read transaction. Operators use this to detect long-running readers that block GC.
#[track_caller] — when a tx open fails, the panic shows the caller's line number, not the trait method. Real production debugging discipline.

3. `DbTx` and `DbTxMut` — the actual operations

From crates/storage/db-api/src/transaction.rs:

// DbTx (read-only)
fn get<T: Table>(&self, key: T::Key) -> Result<Option<T::Value>, DatabaseError>;
fn get_by_encoded_key<T: Table>(
    &self,
    key: &<T::Key as Encode>::Encoded,
) -> Result<Option<T::Value>, DatabaseError>;
fn commit(self) -> Result<(), DatabaseError>;
fn abort(self);
fn cursor_read<T: Table>(&self) -> Result<Self::Cursor<T>, DatabaseError>;
fn cursor_dup_read<T: DupSort>(&self) -> Result<Self::DupCursor<T>, DatabaseError>;
fn entries<T: Table>(&self) -> Result<usize, DatabaseError>;
fn disable_long_read_transaction_safety(&mut self);

// DbTxMut (read-write)
fn put<T: Table>(&self, key: T::Key, value: T::Value) -> Result<(), DatabaseError>;
fn append<T: Table>(&self, key: T::Key, value: T::Value) -> Result<(), DatabaseError>;
fn delete<T: Table>(&self, key: T::Key, value: Option<T::Value>) -> Result<bool, DatabaseError>;
fn clear<T: Table>(&self) -> Result<(), DatabaseError>;
fn cursor_write<T: Table>(&self) -> Result<Self::CursorMut<T>, DatabaseError>;
fn cursor_dup_write<T: DupSort>(&self) -> Result<Self::DupCursorMut<T>, DatabaseError>;

Four things matter most:

`<T: Table>` — table is a type, not a string

Each table is a Rust type that implements the Table trait. The compiler enforces "key/value types must match this table's schema." A typo in a table name is a compile error.

`append` vs `put`

put works for any key. append is only valid when the key is greater than the current max — but it's faster because it skips a B+tree search. When you're processing blocks sequentially, you use append; when reorging, you fall back to put.

Cursors

For range scans, you use a cursor instead of repeated get calls. A cursor positions itself in the B+tree once and walks neighboring entries — orders of magnitude faster than independent gets, because adjacent keys likely share the same page.

`disable_long_read_transaction_safety`

A real-life ergonomic detail. Long read tx blocks GC, which grows the DB. Reth normally aborts read txs that have been open too long. Set this when you really need a long snapshot (and accept the cost).

4. Why this matters for hot paths

Because reads are mmap'd:

A "warm" header lookup is a pointer dereference, not a syscall
The OS page cache becomes your read cache for free
Locality matters: keep related data on the same page

Reth's tables are designed so that Execution-stage reads (account → storage → code) hit pages that are already warm.

5. Pitfalls

Long read transactions block writers' garbage collection. Don't keep a read tx open for hours; the DB grows.
Page size and key ordering matter. B+tree fanout depends on key size; a 200-byte key is a different beast than a 32-byte one.
mmap means OS pressure. A 500GB DB on a 16GB machine will thrash unless your access pattern is local.

6. The comparator: MegaETH's SALT

MDBX is the right default for a vanilla Reth node. But "right default" is not the same as "right for every chain." MegaETH replaced MDBX entirely with SALT (Small Authentication Large Trie) to push throughput beyond what a disk-backed B+tree allows.

The design contrast is worth holding in your head when you read either:

Aspect	MDBX (Reth default)	SALT (MegaETH)
Form	Memory-mapped B+tree	Two-tier: 4-level complete 256-ary trie + SHI hash-table buckets
Storage model	All data on disk, OS pages it in via mmap	Authentication layer lives fully in memory (~1 GB per 3 B items); data sits in buckets
State-root update	Walks the MPT, touches many random disk pages	Bucket-local updates; eliminates random disk I/O during root recomputation
Trie shape	None — Reth maintains the MPT separately on top of MDBX	Trie is the storage; commitments are intrinsic
Insertion-order invariance	N/A (KV agnostic)	SHI (Strongly History-Independent) — canonical commitment regardless of insertion order
Strengths	Mature, crash-safe, ACID, deep tool ecosystem	Memory-efficient authentication at billion-scale, no random disk I/O on state roots
Trade-offs	Random I/O during state root updates becomes the bottleneck at high TPS	New (~2026 design), narrower deployment, sensitive to memory pressure

The pedagogical point is not "SALT is better." It's that MDBX's design assumptions become visible only when you see what someone else chose differently and why. If you've only ever read one storage layer, you can't tell which decisions are essential vs. accidental.

Read megaeth-labs/salt alongside Reth's MDBX wrapper. The questions that surface when you do — "where does Reth pay for crash safety we don't need at high TPS?" "what does SALT give up to fit authentication in memory?" — are the design questions you'll face when extending Reth's storage layer for your own chain.

Drill

Open crates/storage/db-api/src/tables in the repo:

Find the Headers table — note its key (BlockNumber) and value (Header)
Find a DupSort table — these are tables where one key has multiple values. Why does DupSort exist? What kind of data needs it?
Trace one Execution-stage read through: which tables does it consult, in what order?

You'll come out the other side knowing where every byte of Ethereum state lives in Reth.

Final check: in one sentence, why does mmap let you treat a 500GB DB like a Rust slice? Where does the OS fit in? If you can't explain the page-fault → page-load mechanism, the "pointer dereference, not syscall" claim is words to you, not understanding.

Summary (3 lines)

MDBX = libmdbx (forked Berkeley DB-style B+tree). Mmap-backed; COW transactions; no reader locks.
Tuning: map_size + sync_period + max readers. Gotchas: mmap on 32-bit + long-running reads.
Reth: table per data type + per-table compaction. Deterministic perf.