Caching, Idempotency & Retries at Scale

The three patterns that keep high-traffic systems correct under load — and the subtle bugs each one introduces.

Article 4 of 712 minAdvanced

✦

Key Takeaway

Caching, retries, and idempotency are not independent reliability tools — they compound each other's bugs. Stale cache data makes retry-induced duplicate charges possible. Immediate retries turn a struggling service into a dead one. The fix is not any single pattern in isolation; it's understanding the chain and getting all three right simultaneously.

Why These Three Travel Together

Under sustained load, the failure chain is predictable. Latency creeps up, so you cache aggressively to reduce pressure on the database. Services start timing out intermittently, so clients retry. Some of those retries hit an operation the server already completed — so you need idempotency keys to prevent duplication. Each step feels reasonable. The problem is what happens when one of them breaks.

A stale cache that passes a balance check creates the window. A retry without an idempotency key exploits it. An immediate retry with no backoff then collapses the downstream service that would have caught it. You don't have three separate bugs. You have one compound failure with three contributing causes, and fixing any single layer without the others leaves the system vulnerable.

The right mental model: these three patterns form a triangle. They exist because distributed systems have partial failures, and each pattern closes a different gap. Remove one and the other two stop working correctly.

Caching — Not the Basics, the Bugs

You know what a cache is. What you may not have thought through carefully enough is the exact moment your system has two sources of truth: the cache and the database. That moment is unavoidable. The question is how long it lasts and what breaks during the window.

The staleness budget problem. Not all data has the same tolerance for staleness. A user's display name — 60-second cache is fine. A user's account balance before a payment authorization — stale is a production incident. A user's permission record after an admin revokes access — stale means a terminated employee still has access for however long your TTL runs. The failure mode differs per data type, but the root cause is the same: you set a single TTL policy without thinking about what breaks during the window.

The fix is explicit staleness budgets, not intuition. Classify your data: What is the maximum acceptable staleness for this record type? User preferences: minutes. Session state: seconds. Post-write reads of financial records: zero — skip the cache or bust it synchronously on write. This is a product and architecture decision, not a cache configuration decision.

Cache stampedes. You set a TTL on a hot key. One thousand concurrent users are reading that key. The TTL expires. All one thousand clients get a cache miss simultaneously. All one thousand hit the database at the same moment. The database buckles under load it wasn't designed to handle — especially since you added the cache specifically to protect it from that load.

Three practical mitigations:

Jittered TTL: instead of expiring all keys at the same wall-clock time, add random jitter (e.g., TTL = base ± 20%). Stampedes only happen when expiration is synchronized.
Probabilistic Early Expiration (PER): before the key expires, some fraction of requests recompute it early, making the eventual expiry a non-event. The math is simple; the implementation takes an afternoon.
Cache leader lock: on a miss, one request acquires a distributed lock, recomputes, writes the result back, and releases. Other requests wait and read the freshly computed value. The trade-off is latency for the waiting requests during the lock window — acceptable if your recompute is fast, not acceptable if it takes 2 seconds.

Cache invalidation: the consistency window. On a write, you have two choices: write-through (update the cache synchronously with the DB write) or cache-bust (delete the cache entry and let it recompute on next read). Write-through sounds clean but is brittle when the write succeeds and the cache update fails. Cache-bust is safer — a miss is expensive but correct. The permission revocation scenario is the clearest case: when access is revoked, you must bust the cache immediately, not wait for TTL expiry. This is a security boundary, not a performance tuning decision.

Idempotency Keys: The Exactly-Once Illusion

You cannot achieve exactly-once delivery in a distributed system. The network will partition. The client will timeout. The server will process the request and then fail before it can confirm. The client doesn't know whether to retry.

What you can achieve is at-least-once delivery with idempotent consumers. This is not a consolation prize. It is the correct model.

The mechanics for a payment API: the client generates a unique key (UUID v4, not a timestamp) before the first attempt. It sends the key in the request header. The server checks a deduplication table: has this key been seen? If yes, return the stored result without reprocessing. If no, process the operation, persist the result alongside the key, and return. Every subsequent retry with the same key returns the stored result immediately.

Two failure modes that get ignored:

The dedup window expiry. How long do you store idempotency keys? 24 hours is common. What happens when a client retries 25 hours later with the same key? You've lost the dedup record. The server processes it as a new request. This is not a bug — this is the documented contract. But you have to document it and you have to communicate the retry window to the client. If your clients are mobile apps with aggressive retry policies, make sure your dedup window is longer than their longest retry cycle.

Missing idempotency keys on the client. The idempotency key only works if the client sends the same key on every retry. This sounds obvious. In practice, clients that generate the key inside the retry loop will generate a new key on each attempt — breaking the dedup entirely. The key must be generated once, before the first attempt, and persisted across retries. This is a client-side discipline problem, and it surfaces in mobile SDKs far more often than in backend-to-backend calls.

Retries — The Bugs

Retrying a failed request is not a reliability improvement by default. It's only a reliability improvement when the retry strategy accounts for why the request failed.

Naive immediate retry. Request fails. Client retries instantly. For a transient network hiccup, this works fine. For an overloaded downstream service, this is exactly the wrong behavior. The downstream service is slow because it has too many requests. Immediate retries add more requests. The service gets slower. More clients timeout. More retries. This is a thundering herd, and naive retry logic is the amplifier.

Exponential backoff. The canonical fix. First retry after 100ms, second after 200ms, third after 400ms, up to a cap. The service gets breathing room between waves of retries. But backoff alone is not enough.

Why jitter matters. Without jitter, all clients who started their requests at the same time will retry at the same time — just with a delay. The thundering herd doesn't disappear; it shifts. With jitter (delay = base * 2^attempt + random(0, base)), retry attempts spread across time. The downstream service sees a steady stream instead of a spike. This is a small implementation detail with a significant operational impact.

Retry storms and circuit breakers. Even with backoff and jitter, retries make a struggling service worse. The circuit breaker is the escape valve.

The state machine has three states:

Closed: requests flow normally. The circuit monitors failure rate over a rolling window. Open: the failure threshold was crossed. All requests fail-fast without hitting the downstream service. The service gets space to recover. Half-open: after a recovery window, one probe request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit stays open and resets the window.

The parameters that matter: failure threshold (what percentage of requests in what time window triggers the open state?), recovery window (how long before the circuit tries again?), and probe count (how many successful probes before closing?). These are not defaults you can copy from a library README. They have to be tuned for your specific service's recovery characteristics.

The Compound Failure, Walked Through

Here is the scenario. A user taps "Pay Now" in a mobile app. The payment service is slow — 95th percentile response time has climbed to 8 seconds due to database write contention.

The mobile client has a 5-second timeout. The request times out. The client retries — immediately, no backoff, no idempotency key. The client retries three times before giving up.

The payment service received all three requests. It processed all three. The cache has a stale balance record — TTL was set to 30 seconds, and the balance hasn't been refreshed since before the first payment was deducted. The balance check passes on all three retries. The user is charged three times.

The fix at each layer:

Cache layer: the balance check before a payment must not read from a stale cache. Either bust the balance cache on every successful payment write, or bypass the cache entirely for pre-payment balance reads. This is where your staleness budget analysis matters — account balance before authorization is a zero-tolerance staleness case.

Idempotency layer: the client must generate a single idempotency key before the first attempt and include it on all retries. The server must deduplicate on that key. The second and third requests would have returned the result of the first, with no additional charges.

Retry layer: the client must not retry immediately. Exponential backoff with jitter gives the payment service space. Better: the client should treat a timeout on a payment request as potentially-succeeded, not definitely-failed, and surface an ambiguous state to the user rather than retrying blindly. "Your payment is processing — do not retry" is a UX pattern, not just a technical one.

Fix all three and the scenario becomes: user taps Pay, request times out, client retries with the same idempotency key, server deduplicates, user is charged once, stale balance data is never used in the authorization path.

For a deeper look at the specific retry configurations and cache invalidation strategies that hold up at high concurrency, the blog post on caching, idempotency, and retries at scale covers the implementation patterns with more detail on the dedup table schema and backoff parameter selection.

Multi-Tenant SaaS Architecture Patterns

Scaling to Millions: An Architecture Teardown