flushing-lsm-frm-mem-to-disk
๐ Under the Hood: Flushing an LSM Treeโ
1. The in-memory structureโ
-
The MemTable in an LSM tree can indeed be any ordered dictionary (balanced BST, skiplist, even a B-tree).
-
What matters is that it supports fast inserts and can be iterated in sorted order.
-
Most implementations use a skiplist (LevelDB, RocksDB) or redโblack tree/AVL tree (Cassandra Java memtables), because they give:
O(log n)
inserts and lookupsO(n)
iteration in sorted order
So yes โ conceptually it can be a BST, but in practice you often see skiplist because of simpler concurrency and cache friendliness.
2. Flushing to diskโ
- When the MemTable fills up, itโs frozen and written to disk as an SSTable (Sorted String Table).
- No sorting step is needed at flush time, because the MemTable is already kept sorted by the BST/skiplist property.
- The flush is just an inorder traversal of the MemTable โ sequential
O(n)
walk โ write out to disk.
This is a big advantage: you donโt pay an O(n log n)
sort cost on every flush, only a linear scan.
3. Why it worksโ
- The magic of the LSM design is that sorted runs accumulate cheaply (each flush is sorted by construction).
- Later, compaction is the place where merging happens โ thatโs where you pay
O(n log k)
style costs (merging k runs). - But compaction is done in the background and amortized across many writes.
Summary:
- MemTable can be any BST-like structure (or skiplist).
- Flushing is just an inorder traversal (
O(n)
), because the data is already sorted. - The real heavy lifting happens in compaction, not in the flush.
Perfect โ letโs do it. Hereโs the full lifecycle of a key in an LSM tree system step by step, with notes on where the costs show up.
๐ Lifecycle of a Key in an LSM Treeโ
1. Insertโ
-
Action: A client writes a new keyโvalue pair.
- Append to the WAL (sequential write โ fsync for durability).
- Insert into the MemTable (balanced BST/skiplist โ
O(log n)
).
-
Cost profile: โ Very cheap โ sequential log write + memory insertion.
- This is why LSMs excel at high write throughput.
2. Flushโ
-
Action: When MemTable reaches size threshold:
- It is frozen (becomes an immutable memtable).
- Traversed in order (
O(n)
) to convert from tree to list. - Written out sequentially to disk as an SSTable.
-
Cost profile: โ Still cheap โ just linear scan + sequential disk write. โ But creates more files on disk.
3. Compactionโ
-
Action: Periodic background process merges SSTables:
- Resolves duplicate keys (latest version wins).
- Removes deleted entries (tombstones).
- Produces larger, fewer sorted SSTables.
-
Cost profile: โ ๏ธ Expensive โ data is rewritten multiple times.
- This is the main source of write amplification.
- Compaction policies (tiered, leveled, hybrid) balance space, read cost, and write cost.
Readโ
-
Action: To serve a lookup:
- Check the MemTable.
- Check any immutable MemTables waiting for flush.
- Filter SSTables down to candidates - First, use per-file key-range metadata (smallest_key, largest_key) to exclude files whose ranges donโt cover the target key. Only overlapping files become candidates.
- Iterate candidates newest โ oldest (e.g., L0 first, then L1 โฆ Lmax) and stop as soon as you find a value or a tombstone (delete marker).
๐ Read path through SSTablesโ
-
Bloom filter check
- Each SSTable has a Bloom filter (usually at the block level).
- If the filter says โkey definitely not here,โ you skip that SSTable altogether.
- This keeps you from even opening the file or doing a binary search in most cases.
- Cost: O(1) per SSTable, with a tiny chance of false positive.
-
Per-file index lookup (if Bloom filter says โmaybe hereโ)
- SSTables are written in blocks (e.g. 4โ16 KB).
- Each file has an index mapping โsmallest key in block โ block offset.โ
- To find the right block: binary search the index = O(log F), where F = number of blocks in the file.
- Engines keep a sparse in-memory index of fence keys (one per block), so you binary-search that tiny array first and jump straight to the target blockโsame O(log F) asymptotics, but fewer comparisons and usually just one I/O; some add partitioned indexes and prefix-compressed restart points to cut intra-block comparisons further.
-
Block search
- Once you read the block from disk into memory, you search inside it.
- Usually a small binary search over dozensโhundreds of entries = O(log B), where B = entries per block.
๐ฆ Summaryโ
- Inserts: fast, append-friendly.
- Flush: linear, cheap.
- Compaction: heavy background cost, amortized.
- Reads: efficient for ranges, more expensive for point lookups unless well-compacted.
==============