Distributed Systems Interview Gotchas Cheat Sheet (Updated)
1. Leader-based vs Leaderless Replication
Common Trap: "Only one node can accept writes." → true only in leader-based systems.
Fix:
-
Leader-based (Raft, Paxos, Postgres streaming)
- One leader handles writes.
- Followers replicate from leader.
- Failover triggers leader election.
-
Leaderless (Dynamo, Cassandra, Riak)
- Any node can accept writes.
- Receiving node acts as temporary coordinator.
- Conflicts resolved via vector clocks, LWW, or CRDTs.
- No cluster-wide election on node failure.
2. CAP Theorem
Common Trap: "You can only pick 2 out of 3 always."
- Actually: During a partition, you must choose either Consistency or Availability (Partition Tolerance is non-negotiable in distributed systems).
Fix (interview phrasing):
*“In a network partition, you must choose:
- CP: Consistency + Partition Tolerance (block or fail requests)
- AP: Availability + Partition Tolerance (allow temporary inconsistencies)”*
3. ACID vs CAP
Common Trap: "ACID and CAP are similar because they both have A and C." → leads to confusion about scope.
Fix:
-
ACID = think database transactions:
- What does a DB engine guarantee for one transaction?
- Atomicity (all or nothing)
- Consistency (valid state after transaction)
- Isolation (transactions don’t step on each other)
- Durability (persists after crash)
- A raw DB instance (like Postgres) doesn’t handle availability—that’s outside ACID’s scope.
-
CAP = think distributed system behavior when network communication is unreliable:
- It answers “who can do what, and when?” when nodes might not see each other.
- CAP can apply even within one physical system (multiple processes, sandboxing, or IPC boundaries).
-
Mental Anchor:
ACID = properties of a transaction inside one node > CAP = properties of operations across nodes (or processes) under unreliable communication
4. Quorum Math (N, W, R)
Common Trap: "W + R > N always means strong consistency."
- Actually: Only in absence of partitions. During a partition, you still pick between stale reads (AP) or blocking (CP).
Fix (phrasing):
“When W + R > N, at least one replica in any read has the latest write, but CAP still applies during partitions.”
5. Vector Clocks vs LWW
Common Trap:
- Assuming Last Write Wins solves conflicts perfectly.
- Reality: LWW drops one concurrent write silently.
Fix:
- Vector Clocks detect concurrency explicitly.
- CRDTs merge changes meaningfully without coordination.
6. Consensus vs Replication
Common Trap:
- Treating consensus (Raft, Paxos) as the same as replication (leaderless, multi-master).
Fix:
- Consensus: Agreement on one sequence of operations (used for leader election, consistent logs).
- Replication: Copying state across nodes (may be eventually consistent or strongly consistent depending on design).
7. Client-Side vs Server-Side Read Repair
Common Trap:
- Assuming read repair always happens on the server.
- Dynamo uses client-side repair; Cassandra uses server-side repair.
8. Sharding vs Partitioning
Common Trap:
- Thinking they’re synonyms.
- Sharding = horizontal scaling (split keys).
- Partitioning = data separation for fault tolerance (e.g., availability zones).
9. Clocks & Time
Common Trap:
- Relying on wall-clock timestamps for ordering.
- NTP drift breaks causality.
Fix:
- Use logical clocks (Lamport, vector clocks) when ordering matters.