🔎 Under the Hood: Flushing an LSM Tree

1. The in-memory structure

The MemTable in an LSM tree can indeed be any ordered dictionary (balanced BST, skiplist, even a B-tree).
What matters is that it supports fast inserts and can be iterated in sorted order.
Most implementations use a skiplist (LevelDB, RocksDB) or red–black tree/AVL tree (Cassandra Java memtables), because they give:
- O(log n) inserts and lookups
- O(n) iteration in sorted order

So yes — conceptually it can be a BST, but in practice you often see skiplist because of simpler concurrency and cache friendliness.

When the MemTable fills up, it’s frozen and written to disk as an SSTable (Sorted String Table).
No sorting step is needed at flush time, because the MemTable is already kept sorted by the BST/skiplist property.
The flush is just an inorder traversal of the MemTable → sequential O(n) walk → write out to disk.

This is a big advantage: you don’t pay an O(n log n) sort cost on every flush, only a linear scan.

The magic of the LSM design is that sorted runs accumulate cheaply (each flush is sorted by construction).
Later, compaction is the place where merging happens — that’s where you pay O(n log k) style costs (merging k runs).
But compaction is done in the background and amortized across many writes.

Summary:

MemTable can be any BST-like structure (or skiplist).
Flushing is just an inorder traversal (O(n)), because the data is already sorted.
The real heavy lifting happens in compaction, not in the flush.

Perfect — let’s do it. Here’s the full lifecycle of a key in an LSM tree system step by step, with notes on where the costs show up.

Action: A client writes a new key–value pair.
- Append to the WAL (sequential write → fsync for durability).
- Insert into the MemTable (balanced BST/skiplist → O(log n)).
Cost profile: ✅ Very cheap — sequential log write + memory insertion.
- This is why LSMs excel at high write throughput.

Action: When MemTable reaches size threshold:
- It is frozen (becomes an immutable memtable).
- Traversed in order (O(n)) to convert from tree to list.
- Written out sequentially to disk as an SSTable.
Cost profile: ✅ Still cheap — just linear scan + sequential disk write. ❌ But creates more files on disk.

Action: Periodic background process merges SSTables:
- Resolves duplicate keys (latest version wins).
- Removes deleted entries (tombstones).
- Produces larger, fewer sorted SSTables.
Cost profile: ⚠️ Expensive — data is rewritten multiple times.
- This is the main source of write amplification.
- Compaction policies (tiered, leveled, hybrid) balance space, read cost, and write cost.

Action: To serve a lookup:
- Check the MemTable.
- Check any immutable MemTables waiting for flush.
- Filter SSTables down to candidates - First, use per-file key-range metadata (smallest_key, largest_key) to exclude files whose ranges don’t cover the target key. Only overlapping files become candidates.
- Iterate candidates newest → oldest (e.g., L0 first, then L1 … Lmax) and stop as soon as you find a value or a tombstone (delete marker).

Bloom filter check
- Each SSTable has a Bloom filter (usually at the block level).
- If the filter says “key definitely not here,” you skip that SSTable altogether.
- This keeps you from even opening the file or doing a binary search in most cases.
- Cost: O(1) per SSTable, with a tiny chance of false positive.
Per-file index lookup (if Bloom filter says “maybe here”)
- SSTables are written in blocks (e.g. 4–16 KB).
- Each file has an index mapping “smallest key in block → block offset.”
- To find the right block: binary search the index = O(log F), where F = number of blocks in the file.
  - Engines keep a sparse in-memory index of fence keys (one per block), so you binary-search that tiny array first and jump straight to the target block—same O(log F) asymptotics, but fewer comparisons and usually just one I/O; some add partitioned indexes and prefix-compressed restart points to cut intra-block comparisons further.
Block search
- Once you read the block from disk into memory, you search inside it.
- Usually a small binary search over dozens–hundreds of entries = O(log B), where B = entries per block.

Inserts: fast, append-friendly.
Flush: linear, cheap.
Compaction: heavy background cost, amortized.
Reads: efficient for ranges, more expensive for point lookups unless well-compacted.

==============