Skip to main content

πŸ”Ž Under the Hood: Flushing an LSM Tree

πŸ”Ž Under the Hood: Flushing an LSM Tree​

1. The in-memory structure​

  • The MemTable in an LSM tree can indeed be any ordered dictionary (balanced BST, skiplist, even a B-tree).

  • What matters is that it supports fast inserts and can be iterated in sorted order.

  • Most implementations use a skiplist (LevelDB, RocksDB) or red–black tree/AVL tree (Cassandra Java memtables), because they give:

    • O(log n) inserts and lookups
    • O(n) iteration in sorted order

So yes β€” conceptually it can be a BST, but in practice you often see skiplist because of simpler concurrency and cache friendliness.


2. Flushing to disk​

  • When the MemTable fills up, it’s frozen and written to disk as an SSTable (Sorted String Table).
  • No sorting step is needed at flush time, because the MemTable is already kept sorted by the BST/skiplist property.
  • The flush is just an inorder traversal of the MemTable β†’ sequential O(n) walk β†’ write out to disk.

This is a big advantage: you don’t pay an O(n log n) sort cost on every flush, only a linear scan.


3. Why it works​

  • The magic of the LSM design is that sorted runs accumulate cheaply (each flush is sorted by construction).
  • Later, compaction is the place where merging happens β€” that’s where you pay O(n log k) style costs (merging k runs).
  • But compaction is done in the background and amortized across many writes.

Summary:

  • MemTable can be any BST-like structure (or skiplist).
  • Flushing is just an inorder traversal (O(n)), because the data is already sorted.
  • The real heavy lifting happens in compaction, not in the flush.

Perfect β€” let’s do it. Here’s the full lifecycle of a key in an LSM tree system step by step, with notes on where the costs show up.


πŸ”„ Lifecycle of a Key in an LSM Tree​

1. Insert​

  • Action: A client writes a new key–value pair.

    • Append to the WAL (sequential write β†’ fsync for durability).
    • Insert into the MemTable (balanced BST/skiplist β†’ O(log n)).
  • Cost profile: βœ… Very cheap β€” sequential log write + memory insertion.

    • This is why LSMs excel at high write throughput.

2. Flush​

  • Action: When MemTable reaches size threshold:

    • It is frozen (becomes an immutable memtable).
    • Traversed in order (O(n)) to convert from tree to list.
    • Written out sequentially to disk as an SSTable.
  • Cost profile: βœ… Still cheap β€” just linear scan + sequential disk write. ❌ But creates more files on disk.


3. Compaction​

  • Action: Periodic background process merges SSTables:

    • Resolves duplicate keys (latest version wins).
    • Removes deleted entries (tombstones).
    • Produces larger, fewer sorted SSTables.
  • Cost profile: ⚠️ Expensive β€” data is rewritten multiple times.

    • This is the main source of write amplification.
    • Compaction policies (tiered, leveled, hybrid) balance space, read cost, and write cost.

Read​

  • Action: To serve a lookup:

    • Check the MemTable.
    • Check any immutable MemTables waiting for flush.
    • Filter SSTables down to candidates - First, use per-file key-range metadata (smallest_key, largest_key) to exclude files whose ranges don’t cover the target key. Only overlapping files become candidates.
    • Iterate candidates newest β†’ oldest (e.g., L0 first, then L1 … Lmax) and stop as soon as you find a value or a tombstone (delete marker).

πŸ“š Read path through SSTables​

  1. Bloom filter check

    • Each SSTable has a Bloom filter (usually at the block level).
    • If the filter says β€œkey definitely not here,” you skip that SSTable altogether.
    • This keeps you from even opening the file or doing a binary search in most cases.
    • Cost: O(1) per SSTable, with a tiny chance of false positive.
  2. Per-file index lookup (if Bloom filter says β€œmaybe here”)

    • SSTables are written in blocks (e.g. 4–16 KB).
    • Each file has an index mapping β€œsmallest key in block β†’ block offset.”
    • To find the right block: binary search the index = O(log F), where F = number of blocks in the file.
      • Engines keep a sparse in-memory index of fence keys (one per block), so you binary-search that tiny array first and jump straight to the target blockβ€”same O(log F) asymptotics, but fewer comparisons and usually just one I/O; some add partitioned indexes and prefix-compressed restart points to cut intra-block comparisons further.
  3. Block search

    • Once you read the block from disk into memory, you search inside it.
    • Usually a small binary search over dozens–hundreds of entries = O(log B), where B = entries per block.

🚦 Summary​

  • Inserts: fast, append-friendly.
  • Flush: linear, cheap.
  • Compaction: heavy background cost, amortized.
  • Reads: efficient for ranges, more expensive for point lookups unless well-compacted.

==============