Hash + TTL: The 80/20 Dedup Pattern

You don’t always need global sequence numbers, vector clocks, or heavyweight comparisons to stop duplicates.
Most of the time, hash + TTL gets you 80% of the value with 5% of the effort.

The idea:
Compute a compact identity (a hash) for the thing you care about (payload, idempotency key, canonicalized request, etc.).
Store that hash in a fast cache with a TTL. If you ever see the same hash within the TTL, drop it.

When to Use It

Webhook receivers suppressing retries from providers.
Log/metric ingestion avoiding double counts from replays or at‑least‑once pipelines.
Distributed job queues preventing duplicate enqueues or executions.
Email/SMS notification systems suppressing accidental re-sends.
Idempotent HTTP endpoints (e.g., POST /payments with an Idempotency-Key).

How It Works

Build a stable canonical string for the thing, for example:

METHOD
PATH
SHA256(body)
tenant_id
logical_key

Compute a content hash (e.g., SHA-256).
SET key=<hash> value=1 EX <ttl> NX (Redis)

If success → first time seeing it, proceed and record.
If not → hash already present, drop as duplicate.

That’s it.

Choosing What to Hash

Canonicalization matters: normalize header case, trim whitespace, sort JSON keys, normalize timestamps you intend to ignore.
Include partitioning context if duplicates must be unique per user/tenant/stream.
Exclude fields you expect to vary (e.g., retry counters, delivery ids from upstream) if they don’t change semantics.

TTL Tuning

Set the TTL to the worst‑case replay window you care about.
Webhooks: 5–15 minutes is common.
Ingestion: align with batch/commit/retry windows.
Longer TTL → fewer dups but more memory.
If you can’t bound it, this pattern alone may not be enough (consider sequence numbers or durable idempotency keys).

Storage Options

Redis: SET ... NX EX is the usual.
Redis Cluster: works fine; key hashing spreads load.
Memcached: add(key, ttl) gives you the same predicate.
RDBMS: unique index on (hash, bucket) where bucket = floor(now/TTL). Rotate/cleanup old buckets.

Collisions & Hashing

Use SHA‑256 (or xxhash64 if you accept a tiny collision risk for speed).
If truncating, keep >= 64 bits unless you can tolerate rare false positives (dropping legit events occasionally).
If collisions are unacceptable, store (hash, short_fingerprint) or (hash, small payload digest).

Examples

Redis (idempotent webhooks)

# pseudo-steps
H = sha256(canonical_string)
# Only proceed if this returns OK (created new)
SET webhook:seen:{H} 1 EX 600 NX

Python (requests dedup)

import hashlib
import redis

r = redis.Redis()

def seen_recently(key: str, ttl: int = 600) -> bool:
    h = hashlib.sha256(key.encode()).hexdigest()
    # Returns True if duplicate, False if first-time
    return not r.set(f"seen:{h}", "1", ex=ttl, nx=True)

# usage: key must be canonicalized upstream
if seen_recently(canonical_key):
    return 200  # drop silently
process()

Go (jobs)

h := sha256.Sum256([]byte(canonicalKey))
k := "seen:" + hex.EncodeToString(h[:])
ok, _ := r.SetNX(ctx, k, 1, 10*time.Minute).Result()
if !ok {
    // duplicate
    return
}
// do work

Pattern Variants

Time-bucketed unique index: UNIQUE(hash, bucket) with bucket = floor(ts / TTL).
Bloom filter with TTL: lower memory, probabilistic false positives, good for massive streams.
Two-phase: cache‑gate first, then write durable idempotency record if the operation succeeds.

Compare to Other Approaches

HMAC + timestamp + nonce: proves authenticity + freshness (prevents replay), but you still need a nonce store; hash+TTL is about dedup regardless of auth. Use both when relevant.
Sequence numbers / monotonic ids: stronger guarantees, but requires per‑producer coordination and durable tracking.
Full object comparison: expensive; hash+TTL avoids payload reads/DB hits in the hot path.

Gotchas

Clock isn’t required (TTL is maintained by the cache), but ensure consistent canonicalization or you’ll miss duplicates.
Idempotency vs. de‑dup: this prevents reprocessing the same logical input within a window; it doesn’t guarantee that your side effects are idempotent. Make writes idempotent too.
Partial success: if you set the key before doing work and then fail, you may suppress a legitimate retry. Options:
- Set the key after success (but then you may do work twice).
- Set seen:{hash} with a short TTL, and on success, extend TTL or write a durable idempotency record.

TL;DR

Compute a stable hash for what “counts as the same,” stick it in a fast store with a TTL, and drop repeats. It’s boring. It works. It scales.

When to Use It​

How It Works​

Choosing What to Hash​

TTL Tuning​

Storage Options​

Collisions & Hashing​

Examples​

Redis (idempotent webhooks)​

Python (requests dedup)​

Go (jobs)​

Pattern Variants​

Compare to Other Approaches​

Gotchas​

TL;DR​