software architecture

Scaling to Millions of Users: A Real-World Architecture Teardown

An anonymized teardown of a consumer platform I scaled to several million users. The architecture that carried ~30K req/s at peak, the four walls we hit on the way up — database connections, a cache stampede that caused a 19-minute outage, payment double-charges, and a credential-stuffing attack that looked like organic growth — and the trade-offs behind each fix. Topology, layered caching, the data tier, WAF and rate-limiting stack, and four real ADRs. No vendor named; the engineering is exactly as it happened.

Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

24 min read
Scaling to Millions of Users: A Real-World Architecture Teardown
Key Takeaway

I spent four years scaling a consumer platform from a few thousand users to several million. This is the anonymized teardown — the architecture that carried the traffic, the four walls we hit on the way up (database connections, cache stampedes, payment double-charges, and a credential-stuffing attack that looked like organic growth), and the trade-offs behind each decision. You'll get the topology, the caching layers, the data-tier strategy, the security and rate-limiting stack, the WAF rules that actually mattered, and four ADRs written the way we wrote them at the time. No vendor is named, no logo is on it. The engineering is exactly as it happened.

Scaling to Millions of Users: A Real-World Architecture Teardown


It was a Saturday evening in the middle of a marketing campaign. We'd planned for a 3x traffic bump. We got 11x.

For about ninety seconds the platform held. Then the dashboards went red in a very specific order — first the p99 latency on the catalog endpoint, then the database connection pool saturation, then a cascade of 500s rolling across every service that shared that database. We were fully down for nineteen minutes during the single highest-intent traffic window of the quarter. I have the incident doc. I reread it before writing this.

I'm not going to name the company or the product. I can't, and frankly the name doesn't teach you anything. What teaches you is the shape of the system, the exact points where it bent, and the reasoning we used to decide what to do about it. So this is a composite-but-real teardown: a consumer-facing platform with a transactional core, drawn from a system I actually ran. Every number here is either real or rounded from real. Every failure happened.

If you've read my pieces on caching, idempotency, and retries or architecture decision records, this is the long-form case study those patterns were extracted from.

The system, in one paragraph

Call it the Platform. A B2C application — web and mobile — where users browse a large, frequently-updated catalog (read-heavy), maintain an account, and make payments (write-heavy, money-touching, zero tolerance for error). Traffic is spiky: most days hovered around a steady baseline, but marketing campaigns, sales events, and the occasional press hit would drive 5–15x surges within minutes. At peak we served roughly 4.2 million monthly actives, around 30,000 requests per second at the front door during a campaign, against a catalog of a few hundred thousand items and a payments flow that processed real money. The read:write ratio sat near 95:5 — which, as you'll see, dictated almost every decision we made.

That last sentence is the whole article. Know your read:write ratio before you design anything. A 95:5 system and a 55:45 system want completely different architectures, and most scaling advice on the internet is implicitly written for one without telling you which.

The scaling timeline: where it actually broke

Scaling isn't a smooth curve. It's a series of cliffs. You run fine, then you hit a wall that has nothing to do with the last wall, you fix it, and you run fine again until the next one. Here's the order we hit them — and the order is typical, which is the useful part.

Notice what's not on this list: "we ran out of CPU." We almost never did. At every stage the wall was a shared, contended, stateful resource — connections, a hot cache key, a money mutation, an auth endpoint — not raw compute. Compute you buy. Contention you have to design away. This is the single most important lesson in the whole teardown: at scale, your bottleneck is almost always state and contention, not horsepower.

Let me take the walls one at a time, because each one taught a different part of the architecture.

The target architecture (where we ended up)

Before the walls, here's the topology we converged on, so you can see where each fix lives.

Three services, not thirty. We were deliberate about that — more on the monolith-vs-microservices trade-off below. The point of the diagram is that every layer exists to protect the layer behind it: the CDN protects the WAF, the WAF protects the gateway, the gateway protects the services, the cache protects the database, the queue protects everything from slow synchronous work. Defense in depth, but for load instead of attackers. (Though, as Wall 4 shows, the same layers do double duty.)

Wall 1: the database connection pool (~100K users)

This is the wall nobody warns you about because it's so unglamorous. We hadn't run out of database CPU. We hadn't run out of storage. We ran out of connections.

Here's the mechanism. Every app instance keeps a pool of open database connections. We autoscaled app instances on CPU. Under load, we scaled to 40 instances, each holding a pool of 20 connections — 800 connections demanded against a database configured for 500. Postgres (and most relational databases) allocate a non-trivial chunk of memory per connection, so you can't just raise the cap to 5,000; you'll OOM the database instead. The autoscaler, trying to help, was actively making it worse: more instances meant more connections meant more contention meant higher latency meant the autoscaler added more instances.

The fix was a connection pooler (a PgBouncer-style proxy) sitting between the app fleet and the database in transaction-pooling mode. Now thousands of app-side "connections" multiplexed onto a couple hundred real database connections. The app fleet could scale freely; the database saw a stable, bounded connection count.

The trade-off: transaction-pooling mode breaks anything that relies on session state — prepared statements, session-level SET, some advisory locks. We had to audit the codebase for those patterns and remove them. That's real work, and it's the kind of constraint that should be a decision, written down, not a surprise someone discovers six months later. Which is why it became our first real ADR (below).

What I'd tell my earlier self: put the pooler in on day one. It costs almost nothing at low scale and saves you a Saturday-night outage at high scale. The cheapest scaling fixes are the ones you install before you need them.

Wall 2: the cache stampede (~500K users)

This is the wall that caused the nineteen-minute outage I opened with.

The catalog was cached in Redis with a TTL. Most items had their own keys, but the homepage — the single most-requested object in the system — was one hot key with a 60-second TTL. During the campaign, that key expired. At that instant, several thousand concurrent requests all missed simultaneously and all stampeded the database to recompute the exact same homepage payload. The database, sized for a trickle of misses, fell over. Then every other service sharing that database fell over with it.

I've written the full pattern up in caching, idempotency, and retries, so here I'll just give the three defenses we shipped, in priority order:

  1. Single-flight (request coalescing): on a miss, only the first request recomputes; the rest wait for that one result. One database hit instead of five thousand. This alone would have prevented the outage.
  2. Stale-while-revalidate: serve the slightly-stale cached value while one background task refreshes it. Nobody waits on a cold recompute, and the database sees exactly one refresh per interval.
  3. Jittered TTLs: never let a whole class of keys expire on the same second. Spread expiries across a window so misses arrive as a trickle, not a thundering herd.

But the deeper fix was architectural: a layered cache, so a single expiry could never again reach all the way to the database in one hop.

The numbers afterward: catalog reads hit the database for well under 1% of requests. The other 99% were served from one of the three cache layers. In a 95:5 read-heavy system, your caching strategy is your scaling strategy — the database almost stops being a read bottleneck at all, which frees it to do the one thing only it can do: be the source of truth for writes.

The trade-off: every cache layer is a copy of the truth, and copies go stale. We accepted bounded staleness (seconds) on the catalog because nobody is harmed by a 5-second-old price display — but the actual price was always re-validated at the payment step against the primary. You cache the read path aggressively and verify on the write path absolutely. Staleness tolerance is a per-field decision, never a global one.

Wall 3: the payment double-charge (~2M users)

Money is different. A stale catalog price annoys someone. A double-charge makes them never trust you again, and may be a regulatory problem.

The failure: a user taps "Pay." The request is slow (mobile network). They tap again. Or the client times out and retries automatically. Two requests arrive, both succeed, the card is charged twice. At a few thousand payments a day this is a rare support ticket. At a few hundred thousand payments a day it's a daily fire and a trust crisis.

The fix is idempotency keys, done strictly:

  1. The client generates a unique key per logical payment attempt and sends it as a header.
  2. The server reserves the key atomically before doing any work — a unique constraint in the database, so two concurrent requests with the same key can't both proceed. This is the part people get wrong: they check-then-act, which is a race under exactly the double-tap conditions that cause the problem.
  3. If the key is new, do the charge once, store the full response against the key.
  4. If the key is seen, return the stored original response — not an error, not "already processed," the actual same result the first attempt produced.

The trade-off: correctness over latency. The atomic reservation adds a write and a synchronous check to the hottest, most latency-sensitive path in the product. We paid that cost without hesitation, because on the money path correctness is non-negotiable and "fast and occasionally double-charges" is not a system, it's a liability.

We also moved everything that didn't have to happen synchronously off the payment path entirely — confirmation emails, ledger updates, partner webhooks, analytics — onto a message queue with at-least-once delivery (which is exactly why the downstream consumers also had to be idempotent). The synchronous payment path did the minimum: reserve, authorize, capture, record, respond. Everything else became an async event. This is the event-driven backbone, and it cut payment-path p99 by more than half while making the whole thing more resilient — a slow email provider could no longer slow down a checkout.

Wall 4: the attack that looked like growth (~4M users)

One morning, sign-in traffic was up 60% overnight. For about an hour, the growth team was thrilled. Then we looked at the shape of it: the requests were hitting the login endpoint specifically, from a wide spread of IPs, each trying a handful of email/password pairs before moving on. It was a credential-stuffing attack — someone replaying a breached username/password list against our login to find reused credentials. The "growth" was an attacker, and the load was real enough to threaten the auth database.

This is where the security layers stopped being theoretical. Here's the full stack we ended up running, front to back, because at millions of users you are a target whether you feel like one or not:

  • WAF (web application firewall): the first programmable layer. Beyond the OWASP baseline (SQLi, XSS, path traversal), the rules that actually earned their keep were the boring ones — blocking known-bad bot signatures, geo/ASN rules for traffic from datacenter ranges that had no business hitting a consumer login, and request-shape rules (a login POST with a missing User-Agent or an impossible header order). The WAF caught the bulk of the credential-stuffing volume because the attacker's client didn't look like a real browser.
  • Rate limiting, at two layers. A coarse limit at the edge/gateway per-IP to blunt volumetric floods, and a fine limit keyed on the thing that actually mattered: failed logins per account and per IP, with exponential backoff. Rate limiting by IP alone is nearly useless against a botnet spread across thousands of IPs — the key has to be the resource you're protecting (the account), not just the source. After N failures we required a CAPTCHA, then a step-up challenge.
  • Bot management / DDoS protection in front of the WAF for the volumetric layer — the stuff that's about packets-per-second, not request semantics.
  • Idempotency and replay protection on state-changing endpoints (the same machinery from Wall 3 doubles as a defense against replayed requests).
  • Secrets hygiene: no credentials in code or config files, short-lived rotated tokens, least-privilege service identities. An attacker who gets one service should not get the keys to all of them.

The trade-off in security is always friction vs. safety. Every challenge you add — CAPTCHA, step-up auth, stricter rate limits — costs real users a little friction and costs you a few false positives (a legitimate user on a shared corporate IP who gets rate-limited). We tuned for "invisible to the 99.9%, painful for the attacker": progressive challenges that only escalated when the behavior looked wrong, not blanket friction on everyone. The right amount of security is the amount your real users never notice and your attacker can't get past — and finding it is continuous tuning, not a one-time config. I go deeper on the mindset in the security mindset for product engineers.

The data tier: the part that can't just scale out

App servers are easy: they're stateless, you add more. The database is hard, because it holds the truth and the truth can't be in two places disagreeing.

Our progression, which is the standard one and works:

  1. Vertical first. Buy a bigger database box. Unsexy, but it buys you a year and a relational database on good hardware goes further than Twitter threads suggest. Don't shard before you have to.
  2. Read replicas for the 95% that's reads. The catalog and most account reads went to replicas; only writes and read-your-own-writes-critical paths hit the primary. The catch every team hits: replication lag. A user updates their profile, the read goes to a replica that hasn't caught up, and they see their old data and file a bug. The fix is read-your-writes routing — pin a user to the primary for a short window after they write. A decision, not an accident.
  3. Connection pooling (Wall 1) so the fleet can grow without drowning the primary.
  4. Sharding / partitioning — last, and only where forced. We sharded exactly one thing: the highest-volume write table, partitioned by a key that kept each tenant's data co-located so we almost never needed a cross-shard query. Sharding is the most expensive scaling move there is; it complicates every query, every migration, and every transaction. Postpone it as long as honestly possible, and when you do it, pick the shard key by studying your actual query patterns, not by guessing. Get the shard key wrong and every read becomes a scatter-gather across all shards — you've added all the cost and kept none of the benefit. I cover the cheaper failure modes in database design mistakes that haunt you at scale.

The meta-trade-off: every step up this ladder trades simplicity for capacity. A single primary is the easiest system in the world to reason about. Read replicas add staleness. Sharding adds operational and query complexity to everything. So you climb the ladder one rung at a time, only when the current rung is genuinely exhausted — never preemptively, because the complexity you add today is paid back with interest every day after.

The trade-offs, on one page

The whole teardown, as a decision table — because the value of experience is mostly knowing what you're giving up:

DecisionWhat we gainedWhat we gave up
Three services, not microservicesSimple to reason about, deploy, debug; shared transactions where they matteredIndependent scaling/deploy of each domain; some coupling
Aggressive layered cachingDatabase freed from 99% of reads; survived spikesBounded staleness; cache invalidation complexity
Connection poolerApp fleet scales independently of DBLost session-level DB features; audit cost
Idempotency on writesZero double-charges; safe retriesExtra write + latency on the hot money path
Async via queueFast checkout; resilient to slow downstreamsEventual consistency; consumers must be idempotent
Read replicasRead scale at low costReplication lag; read-your-writes routing needed
Sharding (one table, late)Write scale past a single boxQuery/migration/transaction complexity forever
Layered security + rate limitingSurvived credential stuffing invisiblyTuning burden; rare false positives on real users

There is no row in this table without a cost. Anyone who sells you a scaling decision as pure upside is selling you the part they haven't hit yet.

Four ADRs, the way we actually wrote them

We kept lightweight architecture decision records — context, decision, status, consequences. The point of an ADR isn't ceremony; it's that eighteen months later, when someone asks "why on earth is it built like this," the answer is a file, not a folk legend. Here are four real ones, lightly anonymized.

ADR-014: Introduce a transaction-mode connection pooler

  • Status: Accepted
  • Context: App autoscaling drove total DB connections past the primary's safe limit during campaigns, causing connection-exhaustion outages. Raising the DB connection cap risks OOM due to per-connection memory.
  • Decision: Place a transaction-mode connection pooler between the app fleet and the primary. App pools target the pooler; the pooler maintains a bounded real-connection count to the DB.
  • Consequences: App fleet scales independently of DB connections. We forbid session-level state (session SET, server-side prepared statements via the affected driver path, session advisory locks). Codebase audited and migrated. New constraint added to the service template and PR checklist.

ADR-021: Cache the read path, verify on the write path

  • Status: Accepted
  • Context: 95:5 read:write ratio. Catalog reads dominate load; a cache stampede on a hot key caused a 19-minute outage.
  • Decision: Layered cache (edge → Redis with single-flight + stale-while-revalidate → in-process) on all catalog/account reads. Money-relevant values (price, availability) are always re-validated against the primary at the point of the transaction, never trusted from cache.
  • Consequences: Database serves <1% of catalog reads. Bounded staleness (seconds) accepted on display. Stampede protection mandatory for any new hot key. Staleness tolerance documented per field.

ADR-027: Strict idempotency on all state-changing endpoints

  • Status: Accepted
  • Context: Client retries and double-taps caused duplicate payments at scale. At-least-once queue delivery means downstream consumers also see duplicates.
  • Decision: All mutating endpoints require an idempotency key, reserved atomically via a unique constraint before work begins; the full response is stored and replayed on repeat. All queue consumers are idempotent on the event ID.
  • Consequences: Duplicate-charge incidents → zero. Added one write + a synchronous check to the payment path (accepted). Key lifetime scoped to 24h to bound storage. Idempotency is now a definition-of-done item for any write endpoint.

ADR-033: Rate-limit on the protected resource, not just the source IP

  • Status: Accepted
  • Context: Credential-stuffing attack distributed across thousands of IPs evaded per-IP limits and threatened the auth database.
  • Decision: Two-layer rate limiting. Coarse per-IP limit at the edge for volumetric defense; fine-grained limit keyed on failed logins per account with exponential backoff and progressive challenge (CAPTCHA → step-up). WAF rules block datacenter ASNs and non-browser request shapes on auth endpoints.
  • Consequences: Attack absorbed without user-visible impact. Rare false positives on shared-IP legitimate users, mitigated by behavior-based escalation. Ongoing rule tuning owned by the platform team.

What I'd do differently

Honesty is the point of a teardown, so:

  • Install the connection pooler and stampede protection before you need them. Both are nearly free at low scale and prevent your worst outage at high scale. We learned both in production. You don't have to.
  • Decide your staleness tolerance per field, early, and write it down. Most of our cache bugs were really "nobody decided how fresh this needs to be" bugs.
  • Treat security load-shaping as a first-class scaling concern, not a separate workstream. The credential-stuffing attack was a capacity event as much as a security event. WAF, rate limiting, and bot management belong in your scaling architecture, not bolted on after an incident.
  • We over-invested in dashboards and under-invested in alerting on the right leading indicators. We watched p99 latency (a lagging symptom) when we should have alerted on connection-pool saturation and cache hit-rate (leading causes). Watch the cause, not just the symptom — there's more on this in observability beyond dashboards.

What to do Monday morning

You don't need millions of users to apply any of this. Pick the ones that match where you are:

  1. Write down your read:write ratio. If you don't know it, instrument it this week. It decides everything downstream.
  2. Find your single hottest cache key and confirm it has stampede protection. If a hot key can expire and reach your database with no coalescing, you have a loaded gun pointed at it.
  3. Check whether your write endpoints are idempotent. Take the most important one — payments, orders, sign-ups — and ask "what happens if this runs twice?" If the answer is bad, you have a Wall 3 in your future.
  4. Look at how your app connects to your database. Count instances × pool size and compare it to your database's connection limit. If the first number can exceed the second under autoscaling, you have a Wall 1.
  5. Rate-limit your login endpoint on failed-attempts-per-account, not just per-IP. This is a few hours of work and it's the difference between absorbing a credential-stuffing attack and being taken down by one.

Key takeaways

  • At scale, the bottleneck is contention and state, not compute. Connections, hot keys, money mutations, auth endpoints — not CPU.
  • Your read:write ratio dictates your architecture. A 95:5 system is a caching problem; design it as one.
  • Scaling is a series of cliffs, not a curve. Fix the wall in front of you; don't pre-build for the one three stages away.
  • Every scaling decision is a trade-off. Caching buys speed with staleness; sharding buys write-scale with permanent complexity. Know the cost before you take the win.
  • Security is a capacity concern. WAF, layered rate limiting, and bot management are part of how you survive traffic, not a separate problem.
  • Write the ADR. The reasoning is the asset. Code shows what; only the decision record shows why.

Your next step

Take one production system you own and write a single ADR for the most load-bearing decision in it — the one a new hire would question first. If you can't fill in the "consequences" section honestly, you don't yet understand the trade-off you made, and that's exactly the gap that becomes a Saturday-night outage. Start there.

Frequently asked questions

What is the most common bottleneck when scaling to millions of users?

Contention over shared, stateful resources — not raw compute. In practice the walls show up as database connection exhaustion, cache stampedes on hot keys, unsafe write operations under retries, and abuse of auth endpoints. Compute you can buy by adding stateless instances; contention you have to design away with pooling, caching, idempotency, and rate limiting. If your scaling plan is mostly "add more servers," you're planning for the wrong bottleneck.

How important is the read-to-write ratio in architecture decisions?

It's the single most decisive number. A read-heavy system (say 95:5) is fundamentally a caching and read-replica problem — the database almost stops being a read bottleneck once you cache aggressively. A write-heavy system can't cache its way out and pushes you toward partitioning, queues, and write-optimized stores much sooner. Most generic scaling advice silently assumes one ratio or the other, so measure yours before applying anyone's playbook.

What is a cache stampede and how do you prevent it?

A cache stampede happens when a popular cache key expires and many concurrent requests all miss at once, simultaneously hammering the database to recompute the same value — often causing a worse outage than having no cache at all. Prevent it with single-flight (request coalescing) so only one request recomputes, stale-while-revalidate so requests are served the slightly-stale value during a background refresh, and jittered TTLs so a whole class of keys never expires on the same second.

How do you prevent double charges in a payment system at scale?

Use idempotency keys, strictly. The client sends a unique key per logical payment; the server reserves that key atomically (via a database unique constraint) before doing any work, so two concurrent requests can't both proceed. If the key is new, charge once and store the full response against it; if it's already seen, return the stored response rather than charging again. The critical detail is reserving the key atomically before the work — a check-then-act approach races under exactly the double-tap conditions that cause duplicate charges.

Should rate limiting be done per IP address?

Per-IP rate limiting is necessary for blunting volumetric floods but is nearly useless on its own against distributed attacks like credential stuffing, where requests come from thousands of IPs. Effective rate limiting keys on the resource being protected — for a login endpoint, that's failed attempts per account — combined with exponential backoff and progressive challenges (CAPTCHA, step-up auth). Run both layers: coarse per-IP at the edge, fine-grained per-resource at the gateway.

When should you shard your database?

As late as honestly possible. Exhaust vertical scaling, then read replicas, then connection pooling first — these carry most systems much further than people expect. Shard only the specific tables forced by write volume that a single primary genuinely can't handle, and choose the shard key by studying your real query patterns so related data stays co-located and you avoid cross-shard queries. Sharding adds permanent complexity to every query, migration, and transaction, so it should be the last rung on the ladder, never a preemptive one.

How do you handle traffic spikes that are far larger than forecast?

Assume your forecast is wrong and design the layers to fail gracefully. Layered caching keeps spikes off the database; a connection pooler keeps an autoscaling fleet from drowning the primary; a message queue absorbs non-critical work so the synchronous path stays fast; and rate limiting plus load shedding protect the core when demand still exceeds capacity. The goal isn't to serve infinite load perfectly — it's to degrade in a controlled way (shed the non-essential, protect the money path) instead of cascading into a full outage.

#software-architecture#system-design#scalability#high-traffic#caching#rate-limiting#waf#architecture-decision-records#database-scaling#case-study#2026
Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Enjoyed this article?

Get more like it — straight to your inbox.

Get new articles in your inbox — no spam, unsubscribe anytime.

Continue Reading

Software Architecture Patterns: A Reference Catalog with Diagrams, Failure Modes, and Code
software architecture

Software Architecture Patterns: A Reference Catalog with Diagrams, Failure Modes, and Code

A practical reference catalog of the eight architectures worth knowing — layered, modular monolith, hexagonal, event-driven, CQRS + event sourcing, microservices, serverless, and the strangler fig. Each with a diagram, the forces that make it the right call, the failure mode that makes it the wrong one, and a link to runnable reference code. Plus a decision flowchart so you pick on fit, not hype.

·18 min read
LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist
software architecture

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

Adding an LLM to your product is a distributed-systems problem with a non-deterministic dependency, not a single API call. When RAG actually helps (and when a prompt will do), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point production checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — to put in place before a real user touches it.

·15 min read
Caching, Idempotency, and Retries: The Three Things That Break at Scale
software architecture

Caching, Idempotency, and Retries: The Three Things That Break at Scale

Three patterns separate systems that survive scale from systems that get paged at 3am. Cache invalidation and the stampede problem, idempotency keys done right, and retries with exponential backoff, jitter, and circuit breakers — plus how the three fit together into one reliability story. Get them right and most of your 3am pages quietly disappear.

·12 min read