Patterns That Scale and Anti-Patterns That Hurt

The architectural decisions that determine whether your system thrives or fails.

Article 4 of 814 minIntermediate

✦

Key Takeaway

Patterns are tools, not trophies. I've watched teams adopt event-driven architecture, CQRS, and microservices as proof of sophistication — and end up with systems that were harder to deploy, debug, and own than the monolith they escaped. This article covers six patterns that genuinely scale, five anti-patterns that quietly destroy systems, and — more importantly — the honest conditions under which each pattern earns its complexity cost.

I once joined a team that had spent eighteen months building what they proudly called a "microservices architecture."

In practice, every deployment required coordinating changes across at least four services. A database schema change in one service would cascade failures into six others. Engineers dreaded Fridays. The team had built a distributed monolith and convinced themselves it was progress.

That experience taught me something I now consider fundamental: knowing patterns is not enough. You need to know when they apply, when they don't, and what happens when you get it wrong.

This article covers both sides.

See these patterns run. Each pattern below links to a runnable reference with diagrams and tests, across two companion repos: this pathway's patterns/ (chatty-vs-BFF, bulkhead) and the system-architecture-catalog (event-driven, CQRS + event sourcing, saga, strangler fig).

Patterns That Scale

Event-Driven Architecture

When to use it: When multiple parts of your system need to react to something that happened, but the thing that happened shouldn't care about who's listening.

Why it works: Event-driven architecture decouples producers from consumers. The order service publishes an "OrderPlaced" event. It doesn't know or care that the inventory service, the notification service, and the analytics pipeline are all subscribed. Each consumer processes the event on its own schedule, at its own pace, with its own failure handling.

Real example: At a fintech company I advised, the payment processing system originally called the fraud detection service, the ledger service, and the notification service synchronously — one after another. When the fraud service had a slow day, checkout latency spiked for everyone. We moved to event-driven: the payment service published a "PaymentInitiated" event, and each downstream service consumed it independently. Checkout p99 latency dropped by 60%. When the fraud service went down for maintenance, payments still processed. Fraud checks ran when the service came back, with no data lost.

The key insight: Events represent immutable facts about what happened. That property alone makes systems easier to reason about, replay, and debug.

The honest cost: Event-driven systems introduce complexity around ordering guarantees, idempotency, and eventual consistency. If you need strong consistency and synchronous feedback, this pattern works against you. Don't reach for it just because it sounds modern.

Runnable reference: event-driven/ in the catalog — a producer/consumer with idempotency keys and a dead-letter queue over an at-least-once broker.

CQRS (Command Query Responsibility Segregation)

When to use it: When your read and write patterns are fundamentally different — different data shapes, different performance requirements, different scaling needs.

Why it works: Most applications read far more than they write, but we force reads and writes through the same data model, the same database, the same code paths. CQRS says: split them. Your write model is optimised for consistency and validation. Your read model is optimised for fast, denormalised queries.

Real example: An e-commerce platform I worked on had a product catalogue. Writes were rare — merchants updated products a few times a day. Reads were constant — millions of customers browsing, searching, filtering. The single PostgreSQL instance handling both was buckling under read load. We split the write side (PostgreSQL with strict validation) from the read side (Elasticsearch with denormalised product documents). Writes published events that updated the read model asynchronously. The write side stayed consistent. The read side scaled horizontally to handle any traffic level.

The honest cost: CQRS adds the complexity of two separate models and the infrastructure to keep them in sync. If your read and write patterns are similar, you're paying a significant cost for no benefit. Don't use this until the pain of a single model is concrete and measurable.

Runnable reference: cqrs-event-sourcing/ in the catalog — separate write and read models with an event store projecting into queryable read models.

Circuit Breaker

When to use it: When your service depends on another service that might be slow or unavailable, and you'd rather fail fast than drag your own performance down.

Why it works: A circuit breaker monitors calls to a dependency. When failures exceed a threshold (say, 50% of requests over 30 seconds), the circuit "opens" — subsequent calls fail immediately without attempting the request. After a cooldown period, the circuit "half-opens," allowing a test call through. If it succeeds, the circuit closes. Normal operation resumes.

Real example: A travel booking platform I consulted for called a third-party hotel availability API. When that API degraded — which was weekly — the availability service would stack up thousands of threads waiting for timeouts. This cascaded into the entire booking flow. With circuit breakers in place at 50% failure threshold, the circuit opened within seconds of the API degrading. The availability service returned cached results instead of blocking. The rest of the platform stayed healthy.

The principle: Fail fast, recover gracefully. A slow cascading failure is almost always worse than a fast isolated one.

Runnable reference: circuit-breaker.ts in the notification-system build — closed → open → half-open transitions, exercised by the channel-worker tests.

Bulkhead Isolation

When to use it: When a failure in one part of your system should not consume resources needed by unrelated parts.

Why it works: Named after the bulkheads in ship hulls that prevent a breach in one compartment from flooding the entire vessel. In software, this means isolating resources — thread pools, connection pools, memory allocations — so that one misbehaving component can't starve everything else.

Real example: A SaaS platform I helped scale had a single HTTP thread pool serving all API endpoints. When a data export endpoint ran heavy queries for enterprise customers, it consumed all available threads. The login endpoint, dashboard, and webhook delivery all became unresponsive. We created separate thread pools: one for real-time API calls, one for data exports, one for webhook delivery. When exports saturated their pool, login kept working. The blast radius of any single failure shrank dramatically.

Think of it this way: You wouldn't wire your entire house on a single circuit breaker. Why wire your entire service on a single resource pool?

Runnable reference: bulkhead.ts — a concurrency-limited pool; bulkhead.test.ts proves that saturating the export pool leaves login fully available.

Saga Pattern

When to use it: When you need to coordinate a business transaction that spans multiple services, and a traditional database transaction isn't possible.

Why it works: Instead of a single ACID transaction, a saga breaks the work into a sequence of local transactions, each with a compensating action that undoes its effect if a later step fails.

Real example: An order fulfilment flow: reserve inventory → charge payment → schedule shipping. If payment fails after inventory is reserved, you need to release the inventory. If shipping fails after payment, you need to refund and release inventory. At one company I worked with, we moved from a monolithic transaction that locked rows across three databases to a saga-based flow. The system became more resilient — a payment gateway outage no longer blocked inventory operations. Each service could be deployed independently.

The hard part: Compensating actions aren't always simple or instantaneous. You can't "unsend" an email. Design your sagas with idempotency and eventual consistency in mind before you start building them.

Runnable reference: microservices/ in the catalog — an order saga that compensates (releases inventory, refunds) when a later step fails.

Strangler Fig (Migration Pattern)

When to use it: When you need to migrate from a legacy system to a new one without a risky big-bang cutover.

Why it works: Named after the strangler fig tree that grows around a host tree and eventually replaces it. You place a facade in front of the legacy system. New features go into the new system. Existing features migrate incrementally. Traffic routes gradually from old to new. At some point, the legacy system handles nothing and can be decommissioned.

Real example: A financial services company I worked with had a 15-year-old monolithic core banking system. A full rewrite was estimated at three years. Instead, we placed an API gateway in front of it. New loan products built separately. Over 18 months: account lookups migrated, then transaction history, then statement generation — one capability at a time. The legacy system was fully decommissioned in two years with zero downtime and no big-bang risk.

Why this beats a rewrite: Rewrites fail because they try to replicate years of accumulated business logic in one shot. The strangler fig lets you validate each piece incrementally, with a working fallback at every step.

Runnable reference: strangler-fig/ in the catalog — a facade that routes a growing share of traffic from legacy to new, one capability at a time.

Anti-Patterns That Hurt

The Distributed Monolith

What it looks like: You have microservices, but they must be deployed together, share a database, or require synchronous calls in a specific order to function.

Why it hurts: You've taken on all the operational complexity of distributed systems — network latency, partial failures, data consistency challenges — without any of the benefits. You can't deploy independently, can't scale independently, and debugging requires tracing calls across multiple services.

How it happens: Teams split a monolith by code module rather than by business domain. The "user service" calls the "order service" which calls the "inventory service" synchronously, and they all read from the same MySQL database. Service boundaries were drawn, but nothing was actually decoupled.

The fix: If your services can't be deployed and operated independently, they shouldn't be separate services. Merge them back, or invest in true decoupling — separate data stores, async communication, explicit contracts. Half measures make this worse.

The God Service

What it looks like: One service that does everything important. Authentication, core business logic, data transformation, notification dispatch, report generation. Every other service depends on it.

Why it hurts: Single point of failure. Impossible to scale specific capabilities independently. Deployments are terrifying because any change could affect any functionality. The team that owns it becomes an organisational bottleneck.

How it happens: It starts as the "core" service — the first one built, the one that handles the main flow. Over time, every new feature goes into it because "it already has the data" or "it's simpler to add it here." Two years later, it's 200,000 lines of code and nobody wants to touch it.

The fix: Identify bounded contexts within the god service. Extract them one at a time using the strangler fig pattern. Start with the capability that has the clearest boundary and the most independent data model.

Shared Database Coupling

What it looks like: Multiple services read from and write to the same database tables. The order service and the billing service both query the orders table directly.

Why it hurts: Database schema becomes a shared contract nobody explicitly owns. A column rename in one service breaks another. Performance tuning for one access pattern degrades another. You can never migrate one service to a different data store without coordinating with every consumer simultaneously.

How it happens: It feels pragmatic. Why duplicate data when you can just share the table? The answer surfaces six months later when the billing team adds an index that tanks the order service's write performance, and nobody can deploy without a coordinated maintenance window.

The fix: Each service owns its own data store. When Service B needs data from Service A, it calls an API or consumes events. Never reads a database it doesn't own. This is the bounded context mental model made operational.

Chatty Microservices

What it looks like: Rendering a single page requires 15 service-to-service calls. Loading a user dashboard means calling the user service, the preferences service, the notifications service, the activity service, and the permissions service — all synchronously, some dependent on previous responses.

Why it hurts: Latency accumulates. If each call takes 50ms, 15 calls take 750ms minimum — that's with zero failures and perfect parallelisation. Network reliability degrades with each additional hop. A single slow service degrades the entire page.

How it happens: Services are split too granularly, often along data entity boundaries rather than use-case boundaries. A "pure" microservice per entity sounds clean but performs terribly under real user flows.

The fix: Consider the BFF (Backend for Frontend) pattern — a service that aggregates data from multiple downstream services into a single response shaped for the UI. Or reconsider your service boundaries. If two services are always called together, they may belong as one service.

Runnable reference: chatty-vs-bff.ts — the test proves 15 sequential 50ms calls cost 750ms while the parallel BFF pays only 50ms.

Premature Optimisation

What it looks like: Adding caching layers, message queues, read replicas, and event sourcing to a system that serves 100 requests per minute. Building for "future scale" that may never materialise.

Why it hurts: Every piece of infrastructure you add is a piece you have to operate, monitor, debug, and pay for. A cache introduces invalidation complexity. A message queue introduces ordering and delivery guarantees. Read replicas introduce replication lag. Each adds surface area for bugs and operational overhead that slows development.

How it happens: Engineers love building for scale. It feels responsible. But building for hypothetical scale at the cost of real velocity is a bad trade. I've seen teams spend months building an event-sourced architecture for a system that never exceeded 50 concurrent users.

The fix: Start simple. Measure. Optimise the actual bottleneck. A well-tuned monolith on a single database can handle more traffic than most startups will ever see. Add complexity when the load data justifies it, not when the architecture diagram looks impressive.

Key Takeaways

Patterns are tools, not goals. Applying a pattern without understanding its trade-offs is how you create new problems while solving old ones.
Event-driven architecture decouples producers from consumers but requires careful thinking about ordering, idempotency, and eventual consistency.
CQRS earns its complexity cost only when read and write patterns diverge significantly — don't use it when they're similar.
Circuit breakers and bulkheads are non-negotiable for systems that call external dependencies. Fail fast, isolate failures, protect the whole.
Sagas replace distributed transactions with compensating actions — but compensating actions must be designed from day one, not discovered during an incident.
Strangler fig is the only migration pattern I've seen work reliably at scale. Avoid big-bang rewrites.
The distributed monolith is the most costly anti-pattern in organisations that adopted microservices without understanding why. If your services can't deploy independently, you haven't decoupled anything.
Start simple, measure, then optimise. Premature optimisation costs more than the performance problems it prevents.

Your First Practical System Design: URL Shortener

Real-World Case Study: Designing a Notification System