The Blockchain Chief Bitcoin Book / Part III: Validation & Consensus
Chapter 12

Storage & Indexes

How Bitcoin Core stores blocks on disk, manages the UTXO database, builds optional indexes, and supports pruning: the persistent data layer beneath validation.

The Data Directory

Everything Bitcoin Core persists lives in the data directory (default: ~/.bitcoin/ on Linux, ~/Library/Application Support/Bitcoin/ on macOS). Here's what's inside:

blocks/ Raw block data (blk*.dat) and undo data (rev*.dat)
blocks/index/ LevelDB: maps block hashes → file positions (the "block index")
chainstate/ LevelDB: the UTXO set (all unspent outputs)
indexes/txindex/ Optional: maps txid → block file position
indexes/blockfilter/ Optional: BIP 157/158 compact block filters
indexes/coinstats/ Optional: cumulative UTXO set hash (MuHash)
mempool.dat Serialized mempool (persisted across restarts)
peers.dat Known peer addresses (AddrMan serialization)
wallets/ Wallet databases (SQLite for descriptor wallets)

Block Storage: FlatFile

Raw block data is stored in a sequence of flat files named blk00000.dat, blk00001.dat, etc. This design is managed by the FlatFileSeq class.

How It Works

int nFile File number (blk00042.dat → nFile = 42)
unsigned int nPos Byte offset within the file

BlockManager

The BlockManager class (in node/blockstorage.h) sits between validation and the raw files. It handles:

uint32_t nBlocks Number of blocks stored in this file
uint32_t nSize Bytes of block data used
uint32_t nUndoSize Bytes of undo data in corresponding rev*.dat
uint32_t nHeightFirst / nHeightLast Height range of blocks in this file

The Block Index

The block index is an in-memory tree of CBlockIndex objects (one per known block header). It's loaded from the blocks/index/ LevelDB at startup.

What CBlockIndex Stores

Each CBlockIndex records everything known about a block header without needing the full block data:

💡 Headers vs. Blocks

The block index contains entries for every header the node has ever seen, including orphaned or invalid branches. A header can exist in the index before its full block data has been downloaded or validated. The nStatus flags track which stages each block has completed.

Best Chain Selection

The "best chain" (active chain) is the chain of blocks with the most cumulative proof of work from genesis. ChainstateManager maintains a pointer to the tip of this chain. When a new block arrives with more work, the node switches to the new chain (potentially triggering a reorganization).

The UTXO Database

The UTXO set (Unspent Transaction Output set) tracks every bitcoin that exists and hasn't been spent. This is the single most performance-critical data structure, accessed on every transaction validation.

CCoinsView Hierarchy

Bitcoin Core uses a layered cache architecture for UTXO access:

CCoinsViewDBLevelDB on disk (chainstate/). The authoritative UTXO set.
↑ cache miss
CCoinsViewCache (base)In-memory cache. Batches writes to reduce disk I/O.
↑ cache miss
CCoinsViewCache (per-block)Temporary cache for validating a single block. Discarded on failure.

Coin Representation

CTxOut out The output: amount + scriptPubKey
uint32_t nHeight Block height where this output was created
bool fCoinBase Whether from a coinbase tx (100-block maturity rule)

UTXOs are keyed by COutPoint (txid + output index). The LevelDB key uses a compact serialization with obfuscation (XOR with a random key) to make the database resistant to compression-based attacks.

Undo Data

For each block file (blk?????.dat), there's a corresponding undo file (rev?????.dat). The undo data stores the information needed to reverse a block: the inputs that were consumed when the block was connected.

Why Undo Data Exists

When a chain reorganization happens, the node needs to "disconnect" blocks and restore the UTXO set to its previous state. The undo data for each block contains the Coin objects for every input spent in that block, so they can be put back.

vector<CTxUndo> vtxundo One CTxUndo per transaction (except coinbase)

Each CTxUndo contains a vector of Coin objects: one for each input of the transaction, representing the UTXO that was consumed. To disconnect a block, Bitcoin Core replays these coins back into the UTXO set.

Pruning

A fully synced Bitcoin node stores ~600+ GB of block data. Pruning allows nodes to delete old block and undo files while keeping the UTXO set, which is all that's needed for validation going forward.

How Pruning Works

⚠️ Pruning Trade-offs

A pruned node cannot serve historical blocks to peers, rescan the blockchain for new wallet keys, or rebuild indexes from block data it has deleted. It can still fully validate new blocks and transactions.

Optional Indexes

Bitcoin Core provides an indexing framework (BaseIndex) for building secondary indexes on top of the blockchain. These are optional and configurable via CLI flags.

BaseIndex Framework

All indexes inherit from BaseIndex, which provides:

TxIndex (Transaction Index)

Enabled with -txindex. Maps every transaction ID to its position on disk (CDiskTxPos: block file + offset within the block). This is required for the getrawtransaction RPC to work on confirmed transactions that the node's wallet doesn't track.

CoinStatsIndex

Enabled with -coinstatsindex. Tracks cumulative UTXO statistics at each block height using MuHash (a rolling hash). Powers the gettxoutsetinfo RPC to return the total UTXO count, total amount, and a hash of the entire UTXO set, without scanning the UTXO database each time.

Compact Block Filters (BIP 157/158)

Enabled with -blockfilterindex and -peerblockfilters. This is the modern mechanism for light client support, replacing the deprecated BIP 37 bloom filters.

How It Works

  1. Filter construction (BIP 158): for each block, a compact probabilistic filter is built using a Golomb-Coded Set (GCS). The filter encodes all scriptPubKeys spent or created in the block.
  2. Filter serving (BIP 157): light clients download these small filters (a few KB per block instead of the full ~1-2 MB block), check if any of their addresses match, and only download the full block if there's a match.
  3. Filter chain: filters are chained via header hashes (getcfheaders) so clients can verify they have the correct filter for each block.

Why Not Bloom Filters?

BIP 37 bloom filters had the client send its filter to the server, which revealed information about the client's addresses (privacy leak). BIP 157/158 reverses this: the server builds one filter per block and serves it to all clients identically. The client tests the filter locally, revealing nothing to the server about which addresses it's interested in.

uint8_t m_filter_type Filter type (Basic = 0, currently the only type)
uint256 m_block_hash Hash of the block this filter covers
GCSFilter m_filter The Golomb-Coded Set: probabilistic set membership test

P2P Messages for Block Filters