Skip to main content

flushing-lsm-frm-mem-to-disk

๐Ÿ”Ž Under the Hood: Flushing an LSM Treeโ€‹

1. The in-memory structureโ€‹

  • The MemTable in an LSM tree can indeed be any ordered dictionary (balanced BST, skiplist, even a B-tree).

  • What matters is that it supports fast inserts and can be iterated in sorted order.

  • Most implementations use a skiplist (LevelDB, RocksDB) or redโ€“black tree/AVL tree (Cassandra Java memtables), because they give:

    • O(log n) inserts and lookups
    • O(n) iteration in sorted order

So yes โ€” conceptually it can be a BST, but in practice you often see skiplist because of simpler concurrency and cache friendliness.


2. Flushing to diskโ€‹

  • When the MemTable fills up, itโ€™s frozen and written to disk as an SSTable (Sorted String Table).
  • No sorting step is needed at flush time, because the MemTable is already kept sorted by the BST/skiplist property.
  • The flush is just an inorder traversal of the MemTable โ†’ sequential O(n) walk โ†’ write out to disk.

This is a big advantage: you donโ€™t pay an O(n log n) sort cost on every flush, only a linear scan.


3. Why it worksโ€‹

  • The magic of the LSM design is that sorted runs accumulate cheaply (each flush is sorted by construction).
  • Later, compaction is the place where merging happens โ€” thatโ€™s where you pay O(n log k) style costs (merging k runs).
  • But compaction is done in the background and amortized across many writes.

Summary:

  • MemTable can be any BST-like structure (or skiplist).
  • Flushing is just an inorder traversal (O(n)), because the data is already sorted.
  • The real heavy lifting happens in compaction, not in the flush.

Perfect โ€” letโ€™s do it. Hereโ€™s the full lifecycle of a key in an LSM tree system step by step, with notes on where the costs show up.


๐Ÿ”„ Lifecycle of a Key in an LSM Treeโ€‹

1. Insertโ€‹

  • Action: A client writes a new keyโ€“value pair.

    • Append to the WAL (sequential write โ†’ fsync for durability).
    • Insert into the MemTable (balanced BST/skiplist โ†’ O(log n)).
  • Cost profile: โœ… Very cheap โ€” sequential log write + memory insertion.

    • This is why LSMs excel at high write throughput.

2. Flushโ€‹

  • Action: When MemTable reaches size threshold:

    • It is frozen (becomes an immutable memtable).
    • Traversed in order (O(n)) to convert from tree to list.
    • Written out sequentially to disk as an SSTable.
  • Cost profile: โœ… Still cheap โ€” just linear scan + sequential disk write. โŒ But creates more files on disk.


3. Compactionโ€‹

  • Action: Periodic background process merges SSTables:

    • Resolves duplicate keys (latest version wins).
    • Removes deleted entries (tombstones).
    • Produces larger, fewer sorted SSTables.
  • Cost profile: โš ๏ธ Expensive โ€” data is rewritten multiple times.

    • This is the main source of write amplification.
    • Compaction policies (tiered, leveled, hybrid) balance space, read cost, and write cost.

Readโ€‹

  • Action: To serve a lookup:

    • Check the MemTable.
    • Check any immutable MemTables waiting for flush.
    • Filter SSTables down to candidates - First, use per-file key-range metadata (smallest_key, largest_key) to exclude files whose ranges donโ€™t cover the target key. Only overlapping files become candidates.
    • Iterate candidates newest โ†’ oldest (e.g., L0 first, then L1 โ€ฆ Lmax) and stop as soon as you find a value or a tombstone (delete marker).

๐Ÿ“š Read path through SSTablesโ€‹

  1. Bloom filter check

    • Each SSTable has a Bloom filter (usually at the block level).
    • If the filter says โ€œkey definitely not here,โ€ you skip that SSTable altogether.
    • This keeps you from even opening the file or doing a binary search in most cases.
    • Cost: O(1) per SSTable, with a tiny chance of false positive.
  2. Per-file index lookup (if Bloom filter says โ€œmaybe hereโ€)

    • SSTables are written in blocks (e.g. 4โ€“16 KB).
    • Each file has an index mapping โ€œsmallest key in block โ†’ block offset.โ€
    • To find the right block: binary search the index = O(log F), where F = number of blocks in the file.
      • Engines keep a sparse in-memory index of fence keys (one per block), so you binary-search that tiny array first and jump straight to the target blockโ€”same O(log F) asymptotics, but fewer comparisons and usually just one I/O; some add partitioned indexes and prefix-compressed restart points to cut intra-block comparisons further.
  3. Block search

    • Once you read the block from disk into memory, you search inside it.
    • Usually a small binary search over dozensโ€“hundreds of entries = O(log B), where B = entries per block.

๐Ÿšฆ Summaryโ€‹

  • Inserts: fast, append-friendly.
  • Flush: linear, cheap.
  • Compaction: heavy background cost, amortized.
  • Reads: efficient for ranges, more expensive for point lookups unless well-compacted.

==============