Chapter 12: Storage & Indexes | The Blockchain Chief Bitcoin Book

The Data Directory

Everything Bitcoin Core persists lives in the data directory (default: ~/.bitcoin/ on Linux, ~/Library/Application Support/Bitcoin/ on macOS). Here's what's inside:

blocks/ Raw block data (blk*.dat) and undo data (rev*.dat)

blocks/index/ LevelDB: maps block hashes → file positions (the "block index")

chainstate/ LevelDB: the UTXO set (all unspent outputs)

indexes/txindex/ Optional: maps txid → block file position

indexes/blockfilter/ Optional: BIP 157/158 compact block filters

indexes/coinstats/ Optional: cumulative UTXO set hash (MuHash)

mempool.dat Serialized mempool (persisted across restarts)

peers.dat Known peer addresses (AddrMan serialization)

wallets/ Wallet databases (SQLite for descriptor wallets)

Block Storage: FlatFile

Raw block data is stored in a sequence of flat files named blk00000.dat, blk00001.dat, etc. This design is managed by the FlatFileSeq class.

How It Works

Pre-allocated chunks: each file is pre-allocated in 16 MiB chunks to reduce filesystem fragmentation.
Append-only: new blocks are appended to the current file. When a file is full, a new one begins.
Position tracking: each block's position is recorded as a FlatFilePos (file number + byte offset within that file).
Network magic prefix: each block on disk is preceded by the 4-byte network magic (0xf9beb4d9 for mainnet) and a 4-byte size field, so blocks can be located even if the index is lost.

int nFile File number (blk00042.dat → nFile = 42)

unsigned int nPos Byte offset within the file

BlockManager

The BlockManager class (in node/blockstorage.h) sits between validation and the raw files. It handles:

Reading blocks: ReadBlock() takes a FlatFilePos and returns the deserialized CBlock.
Writing blocks: SaveBlockToDisk() serializes a block to the current blk file and returns the position.
File management: tracks which files have space, creates new files when needed.
Block Tree DB: maintains a LevelDB database (BlockTreeDB) that maps block hashes to their FlatFilePos locations, plus metadata like height, version, and total work.

uint32_t nBlocks Number of blocks stored in this file

uint32_t nSize Bytes of block data used

uint32_t nUndoSize Bytes of undo data in corresponding rev*.dat

uint32_t nHeightFirst / nHeightLast Height range of blocks in this file

The Block Index

The block index is an in-memory tree of CBlockIndex objects (one per known block header). It's loaded from the blocks/index/ LevelDB at startup.

What CBlockIndex Stores

Each CBlockIndex records everything known about a block header without needing the full block data:

Block hash and height
Previous block pointer: pprev links to the parent, forming a tree
Cumulative chain work: nChainWork (total PoW from genesis to this block)
Validation status: which stages of validation this block has passed (nStatus)
Disk position: nFile + nDataPos pointing into the blk files

💡 Headers vs. Blocks

The block index contains entries for every header the node has ever seen, including orphaned or invalid branches. A header can exist in the index before its full block data has been downloaded or validated. The nStatus flags track which stages each block has completed.

Best Chain Selection

The "best chain" (active chain) is the chain of blocks with the most cumulative proof of work from genesis. ChainstateManager maintains a pointer to the tip of this chain. When a new block arrives with more work, the node switches to the new chain (potentially triggering a reorganization).

The UTXO Database

The UTXO set (Unspent Transaction Output set) tracks every bitcoin that exists and hasn't been spent. This is the single most performance-critical data structure, accessed on every transaction validation.

CCoinsView Hierarchy

Bitcoin Core uses a layered cache architecture for UTXO access:

CCoinsViewDBLevelDB on disk (chainstate/). The authoritative UTXO set.

↑ cache miss

CCoinsViewCache (base)In-memory cache. Batches writes to reduce disk I/O.

↑ cache miss

CCoinsViewCache (per-block)Temporary cache for validating a single block. Discarded on failure.

Coin Representation

CTxOut out The output: amount + scriptPubKey

uint32_t nHeight Block height where this output was created

bool fCoinBase Whether from a coinbase tx (100-block maturity rule)

UTXOs are keyed by COutPoint (txid + output index). The LevelDB key uses a compact serialization with obfuscation (XOR with a random key) to make the database resistant to compression-based attacks.

Undo Data

For each block file (blk?????.dat), there's a corresponding undo file (rev?????.dat). The undo data stores the information needed to reverse a block: the inputs that were consumed when the block was connected.

Why Undo Data Exists

When a chain reorganization happens, the node needs to "disconnect" blocks and restore the UTXO set to its previous state. The undo data for each block contains the Coin objects for every input spent in that block, so they can be put back.

vector<CTxUndo> vtxundo One CTxUndo per transaction (except coinbase)

Each CTxUndo contains a vector of Coin objects: one for each input of the transaction, representing the UTXO that was consumed. To disconnect a block, Bitcoin Core replays these coins back into the UTXO set.

Pruning

A fully synced Bitcoin node stores ~600+ GB of block data. Pruning allows nodes to delete old block and undo files while keeping the UTXO set, which is all that's needed for validation going forward.

How Pruning Works

Manual pruning: -prune=N keeps only the most recent N MiB of block data.
File-level deletion: entire blk/rev file pairs are deleted, not individual blocks within a file.
Minimum kept: the last 288 blocks (~2 days) are always kept to handle potential reorganizations.
What's preserved: the block index (headers), UTXO set, and all optional indexes remain intact. Only raw block/undo data is deleted.

⚠️ Pruning Trade-offs

A pruned node cannot serve historical blocks to peers, rescan the blockchain for new wallet keys, or rebuild indexes from block data it has deleted. It can still fully validate new blocks and transactions.

Optional Indexes

Bitcoin Core provides an indexing framework (BaseIndex) for building secondary indexes on top of the blockchain. These are optional and configurable via CLI flags.

BaseIndex Framework

All indexes inherit from BaseIndex, which provides:

Sequential processing: blocks are indexed in order, following the active chain.
Reorg handling: when the active chain changes, the index automatically rolls back and re-indexes.
Background threading: indexing runs on its own thread, so it doesn't block the main validation pipeline.
Sync tracking: stores a CBlockLocator recording how far the index has progressed.

TxIndex (Transaction Index)

Enabled with -txindex. Maps every transaction ID to its position on disk (CDiskTxPos: block file + offset within the block). This is required for the getrawtransaction RPC to work on confirmed transactions that the node's wallet doesn't track.

CoinStatsIndex

Enabled with -coinstatsindex. Tracks cumulative UTXO statistics at each block height using MuHash (a rolling hash). Powers the gettxoutsetinfo RPC to return the total UTXO count, total amount, and a hash of the entire UTXO set, without scanning the UTXO database each time.

Enabled with -blockfilterindex and -peerblockfilters. This is the modern mechanism for light client support, replacing the deprecated BIP 37 bloom filters.

How It Works

Filter construction (BIP 158): for each block, a compact probabilistic filter is built using a Golomb-Coded Set (GCS). The filter encodes all scriptPubKeys spent or created in the block.
Filter serving (BIP 157): light clients download these small filters (a few KB per block instead of the full ~1-2 MB block), check if any of their addresses match, and only download the full block if there's a match.
Filter chain: filters are chained via header hashes (getcfheaders) so clients can verify they have the correct filter for each block.

Why Not Bloom Filters?

BIP 37 bloom filters had the client send its filter to the server, which revealed information about the client's addresses (privacy leak). BIP 157/158 reverses this: the server builds one filter per block and serves it to all clients identically. The client tests the filter locally, revealing nothing to the server about which addresses it's interested in.

uint8_t m_filter_type Filter type (Basic = 0, currently the only type)

uint256 m_block_hash Hash of the block this filter covers

GCSFilter m_filter The Golomb-Coded Set: probabilistic set membership test

P2P Messages for Block Filters

getcfilters: request filters for a range of blocks
cfilter: response containing one block filter
getcfheaders: request filter header chain (for verification)
cfheaders: response containing filter headers
getcfcheckpt: request evenly-spaced checkpoints for efficient sync

The Data Directory

Block Storage: FlatFile

How It Works

BlockManager

The Block Index

What CBlockIndex Stores

Best Chain Selection

The UTXO Database

CCoinsView Hierarchy

Coin Representation

Undo Data

Why Undo Data Exists

Pruning

How Pruning Works

Optional Indexes

BaseIndex Framework

TxIndex (Transaction Index)

CoinStatsIndex

Compact Block Filters (BIP 157/158)

How It Works

Why Not Bloom Filters?

P2P Messages for Block Filters