Hash + TTL: The 80/20 Dedup Pattern
You don’t always need global sequence numbers, vector clocks, or heavyweight comparisons to stop duplicates.
Most of the time, hash + TTL gets you 80% of the value with 5% of the effort.
The idea:
Compute a compact identity (a hash) for the thing you care about (payload, idempotency key, canonicalized request, etc.).
Store that hash in a fast cache with a TTL. If you ever see the same hash within the TTL, drop it.
When to Use It
- Webhook receivers suppressing retries from providers.
- Log/metric ingestion avoiding double counts from replays or at‑least‑once pipelines.
- Distributed job queues preventing duplicate enqueues or executions.
- Email/SMS notification systems suppressing accidental re-sends.
- Idempotent HTTP endpoints (e.g., POST /payments with an Idempotency-Key).
How It Works
- Build a stable canonical string for the thing, for example:
METHOD
PATH
SHA256(body)
tenant_id
logical_key
- Compute a content hash (e.g., SHA-256).
SET key=<hash> value=1 EX <ttl> NX
(Redis)
- If success → first time seeing it, proceed and record.
- If not → hash already present, drop as duplicate.
That’s it.
Choosing What to Hash
- Canonicalization matters: normalize header case, trim whitespace, sort JSON keys, normalize timestamps you intend to ignore.
- Include partitioning context if duplicates must be unique per user/tenant/stream.
- Exclude fields you expect to vary (e.g., retry counters, delivery ids from upstream) if they don’t change semantics.
TTL Tuning
- Set the TTL to the worst‑case replay window you care about.
- Webhooks: 5–15 minutes is common.
- Ingestion: align with batch/commit/retry windows.
- Longer TTL → fewer dups but more memory.
- If you can’t bound it, this pattern alone may not be enough (consider sequence numbers or durable idempotency keys).
Storage Options
- Redis:
SET ... NX EX
is the usual. - Redis Cluster: works fine; key hashing spreads load.
- Memcached:
add(key, ttl)
gives you the same predicate. - RDBMS: unique index on
(hash, bucket)
wherebucket = floor(now/TTL)
. Rotate/cleanup old buckets.
Collisions & Hashing
- Use SHA‑256 (or xxhash64 if you accept a tiny collision risk for speed).
- If truncating, keep >= 64 bits unless you can tolerate rare false positives (dropping legit events occasionally).
- If collisions are unacceptable, store
(hash, short_fingerprint)
or(hash, small payload digest)
.
Examples
Redis (idempotent webhooks)
# pseudo-steps
H = sha256(canonical_string)
# Only proceed if this returns OK (created new)
SET webhook:seen:{H} 1 EX 600 NX
Python (requests dedup)
import hashlib
import redis
r = redis.Redis()
def seen_recently(key: str, ttl: int = 600) -> bool:
h = hashlib.sha256(key.encode()).hexdigest()
# Returns True if duplicate, False if first-time
return not r.set(f"seen:{h}", "1", ex=ttl, nx=True)
# usage: key must be canonicalized upstream
if seen_recently(canonical_key):
return 200 # drop silently
process()
Go (jobs)
h := sha256.Sum256([]byte(canonicalKey))
k := "seen:" + hex.EncodeToString(h[:])
ok, _ := r.SetNX(ctx, k, 1, 10*time.Minute).Result()
if !ok {
// duplicate
return
}
// do work
Pattern Variants
- Time-bucketed unique index:
UNIQUE(hash, bucket)
withbucket = floor(ts / TTL)
. - Bloom filter with TTL: lower memory, probabilistic false positives, good for massive streams.
- Two-phase: cache‑gate first, then write durable idempotency record if the operation succeeds.
Compare to Other Approaches
- HMAC + timestamp + nonce: proves authenticity + freshness (prevents replay), but you still need a nonce store; hash+TTL is about dedup regardless of auth. Use both when relevant.
- Sequence numbers / monotonic ids: stronger guarantees, but requires per‑producer coordination and durable tracking.
- Full object comparison: expensive; hash+TTL avoids payload reads/DB hits in the hot path.
Gotchas
-
Clock isn’t required (TTL is maintained by the cache), but ensure consistent canonicalization or you’ll miss duplicates.
-
Idempotency vs. de‑dup: this prevents reprocessing the same logical input within a window; it doesn’t guarantee that your side effects are idempotent. Make writes idempotent too.
-
Partial success: if you set the key before doing work and then fail, you may suppress a legitimate retry. Options:
- Set the key after success (but then you may do work twice).
- Set
seen:{hash}
with a short TTL, and on success, extend TTL or write a durable idempotency record.
TL;DR
Compute a stable hash for what “counts as the same,” stick it in a fast store with a TTL, and drop repeats. It’s boring. It works. It scales.