Skip to main content

Distributed Systems Interview Gotchas Cheat Sheet (Updated)


1. Leader-based vs Leaderless Replication

Common Trap: "Only one node can accept writes."true only in leader-based systems.

Fix:

  • Leader-based (Raft, Paxos, Postgres streaming)

    • One leader handles writes.
    • Followers replicate from leader.
    • Failover triggers leader election.
  • Leaderless (Dynamo, Cassandra, Riak)

    • Any node can accept writes.
    • Receiving node acts as temporary coordinator.
    • Conflicts resolved via vector clocks, LWW, or CRDTs.
    • No cluster-wide election on node failure.

2. CAP Theorem

Common Trap: "You can only pick 2 out of 3 always."

  • Actually: During a partition, you must choose either Consistency or Availability (Partition Tolerance is non-negotiable in distributed systems).

Fix (interview phrasing):

*“In a network partition, you must choose:

  • CP: Consistency + Partition Tolerance (block or fail requests)
  • AP: Availability + Partition Tolerance (allow temporary inconsistencies)”*

3. ACID vs CAP

Common Trap: "ACID and CAP are similar because they both have A and C." → leads to confusion about scope.

Fix:

  • ACID = think database transactions:

    • What does a DB engine guarantee for one transaction?
    • Atomicity (all or nothing)
    • Consistency (valid state after transaction)
    • Isolation (transactions don’t step on each other)
    • Durability (persists after crash)
    • A raw DB instance (like Postgres) doesn’t handle availability—that’s outside ACID’s scope.
  • CAP = think distributed system behavior when network communication is unreliable:

    • It answers “who can do what, and when?” when nodes might not see each other.
    • CAP can apply even within one physical system (multiple processes, sandboxing, or IPC boundaries).
  • Mental Anchor:

    ACID = properties of a transaction inside one node > CAP = properties of operations across nodes (or processes) under unreliable communication


4. Quorum Math (N, W, R)

Common Trap: "W + R > N always means strong consistency."

  • Actually: Only in absence of partitions. During a partition, you still pick between stale reads (AP) or blocking (CP).

Fix (phrasing):

“When W + R > N, at least one replica in any read has the latest write, but CAP still applies during partitions.”


5. Vector Clocks vs LWW

Common Trap:

  • Assuming Last Write Wins solves conflicts perfectly.
  • Reality: LWW drops one concurrent write silently.

Fix:

  • Vector Clocks detect concurrency explicitly.
  • CRDTs merge changes meaningfully without coordination.

6. Consensus vs Replication

Common Trap:

  • Treating consensus (Raft, Paxos) as the same as replication (leaderless, multi-master).

Fix:

  • Consensus: Agreement on one sequence of operations (used for leader election, consistent logs).
  • Replication: Copying state across nodes (may be eventually consistent or strongly consistent depending on design).

7. Client-Side vs Server-Side Read Repair

Common Trap:

  • Assuming read repair always happens on the server.
  • Dynamo uses client-side repair; Cassandra uses server-side repair.

8. Sharding vs Partitioning

Common Trap:

  • Thinking they’re synonyms.
  • Sharding = horizontal scaling (split keys).
  • Partitioning = data separation for fault tolerance (e.g., availability zones).

9. Clocks & Time

Common Trap:

  • Relying on wall-clock timestamps for ordering.
  • NTP drift breaks causality.

Fix:

  • Use logical clocks (Lamport, vector clocks) when ordering matters.