Skip to main content

Hash + TTL: The 80/20 Dedup Pattern

You don’t always need global sequence numbers, vector clocks, or heavyweight comparisons to stop duplicates.
Most of the time, hash + TTL gets you 80% of the value with 5% of the effort.

The idea:
Compute a compact identity (a hash) for the thing you care about (payload, idempotency key, canonicalized request, etc.).
Store that hash in a fast cache with a TTL. If you ever see the same hash within the TTL, drop it.

When to Use It

  • Webhook receivers suppressing retries from providers.
  • Log/metric ingestion avoiding double counts from replays or at‑least‑once pipelines.
  • Distributed job queues preventing duplicate enqueues or executions.
  • Email/SMS notification systems suppressing accidental re-sends.
  • Idempotent HTTP endpoints (e.g., POST /payments with an Idempotency-Key).

How It Works

  1. Build a stable canonical string for the thing, for example:
METHOD
PATH
SHA256(body)
tenant_id
logical_key
  1. Compute a content hash (e.g., SHA-256).
  2. SET key=<hash> value=1 EX <ttl> NX (Redis)
  • If success → first time seeing it, proceed and record.
  • If not → hash already present, drop as duplicate.

That’s it.

Choosing What to Hash

  • Canonicalization matters: normalize header case, trim whitespace, sort JSON keys, normalize timestamps you intend to ignore.
  • Include partitioning context if duplicates must be unique per user/tenant/stream.
  • Exclude fields you expect to vary (e.g., retry counters, delivery ids from upstream) if they don’t change semantics.

TTL Tuning

  • Set the TTL to the worst‑case replay window you care about.
  • Webhooks: 5–15 minutes is common.
  • Ingestion: align with batch/commit/retry windows.
  • Longer TTL → fewer dups but more memory.
  • If you can’t bound it, this pattern alone may not be enough (consider sequence numbers or durable idempotency keys).

Storage Options

  • Redis: SET ... NX EX is the usual.
  • Redis Cluster: works fine; key hashing spreads load.
  • Memcached: add(key, ttl) gives you the same predicate.
  • RDBMS: unique index on (hash, bucket) where bucket = floor(now/TTL). Rotate/cleanup old buckets.

Collisions & Hashing

  • Use SHA‑256 (or xxhash64 if you accept a tiny collision risk for speed).
  • If truncating, keep >= 64 bits unless you can tolerate rare false positives (dropping legit events occasionally).
  • If collisions are unacceptable, store (hash, short_fingerprint) or (hash, small payload digest).

Examples

Redis (idempotent webhooks)

# pseudo-steps
H = sha256(canonical_string)
# Only proceed if this returns OK (created new)
SET webhook:seen:{H} 1 EX 600 NX

Python (requests dedup)

import hashlib
import redis

r = redis.Redis()

def seen_recently(key: str, ttl: int = 600) -> bool:
h = hashlib.sha256(key.encode()).hexdigest()
# Returns True if duplicate, False if first-time
return not r.set(f"seen:{h}", "1", ex=ttl, nx=True)

# usage: key must be canonicalized upstream
if seen_recently(canonical_key):
return 200 # drop silently
process()

Go (jobs)

h := sha256.Sum256([]byte(canonicalKey))
k := "seen:" + hex.EncodeToString(h[:])
ok, _ := r.SetNX(ctx, k, 1, 10*time.Minute).Result()
if !ok {
// duplicate
return
}
// do work

Pattern Variants

  • Time-bucketed unique index: UNIQUE(hash, bucket) with bucket = floor(ts / TTL).
  • Bloom filter with TTL: lower memory, probabilistic false positives, good for massive streams.
  • Two-phase: cache‑gate first, then write durable idempotency record if the operation succeeds.

Compare to Other Approaches

  • HMAC + timestamp + nonce: proves authenticity + freshness (prevents replay), but you still need a nonce store; hash+TTL is about dedup regardless of auth. Use both when relevant.
  • Sequence numbers / monotonic ids: stronger guarantees, but requires per‑producer coordination and durable tracking.
  • Full object comparison: expensive; hash+TTL avoids payload reads/DB hits in the hot path.

Gotchas

  • Clock isn’t required (TTL is maintained by the cache), but ensure consistent canonicalization or you’ll miss duplicates.

  • Idempotency vs. de‑dup: this prevents reprocessing the same logical input within a window; it doesn’t guarantee that your side effects are idempotent. Make writes idempotent too.

  • Partial success: if you set the key before doing work and then fail, you may suppress a legitimate retry. Options:

    • Set the key after success (but then you may do work twice).
    • Set seen:{hash} with a short TTL, and on success, extend TTL or write a durable idempotency record.

TL;DR

Compute a stable hash for what “counts as the same,” stick it in a fast store with a TTL, and drop repeats. It’s boring. It works. It scales.