Event-Driven Architecture Without the Hype: When Queues Help and When They Hurt

Events are a powerful tool and a terrible default. The three legitimate reasons to go event-driven, the anti-pattern that wrecks most implementations (events for request/response), the non-negotiables (idempotent consumers, dead-letter queues, versioned schemas), and choreography vs orchestration — with a decision rule so you reach for events when they earn their keep, not because a talk said microservices need a bus.

May 28, 202613 min read

Event-Driven Architecture Without the Hype: When Queues Help and When They Hurt

✦

Key Takeaway

Events are a powerful tool and a terrible default. They buy you loose coupling, buffering, and an audit trail — and they charge you in observability, debugging, and an entire class of bugs (duplicate delivery, ordering, eventual consistency) that synchronous calls never had. This is the practitioner's guide: the three legitimate reasons to go event-driven, the one anti-pattern that wrecks most implementations (using events for request/response), the non-negotiables (idempotent consumers, dead-letter queues, explicit schemas), and choreography vs orchestration. Plus a decision rule so you reach for events when they earn their keep — not because a conference talk said microservices need a message bus.

Event-Driven Architecture Without the Hype: When Queues Help and When They Hurt

May 2024. A team I was advising had a bug they'd been chasing for three weeks: occasionally, a customer got charged twice. Not often — maybe one in two thousand orders. Enough to be a real problem, rare enough to be unreproducible on demand.

The architecture was event-driven and, on paper, beautiful. OrderPlaced went onto a queue, a payment consumer picked it up, charged the card, emitted PaymentCompleted. Clean. Decoupled. Scalable.

The bug was one line of missing logic. Their message broker — like every production broker — guaranteed at-least-once delivery. Under certain network conditions, the payment consumer received the same OrderPlaced event twice. It charged the card twice. There was no idempotency check, because in the synchronous version of this code that had existed a year earlier, the problem couldn't happen — one HTTP request, one charge.

They hadn't chosen event-driven architecture to solve a problem. They'd adopted it because it was the "scalable" thing to do, and inherited a whole category of failure modes they didn't know they'd signed up for. The double-charge was the tax, arriving late.

Events are genuinely powerful. They're also one of the most over-applied patterns in our field, reached for as a default when a function call would do. This is how to tell the difference — and how to do events right when you actually need them.

What event-driven actually means

Components communicate by producing and consuming events (facts about something that happened) through a broker — Kafka, RabbitMQ, SQS/SNS, NATS — instead of calling each other directly. The producer doesn't know who consumes; the consumer doesn't know who produced.

The defining property is inversion of dependency: in a synchronous world, the order service has to know about and call inventory, email, analytics, and fraud. In the event world, it announces one fact and walks away. New consumers can be added — a loyalty-points service next quarter — without touching the order service at all.

That property is the whole value proposition. Everything good about event-driven flows from it, and everything hard about it is the price of it.

The three legitimate reasons to go event-driven

Reach for events when one of these is genuinely true. If none is, you probably want a function call or a synchronous API.

1. Genuinely independent reactions to the same fact. "An order was placed" legitimately triggers several independent things: reserve stock, send a confirmation email, update analytics, run a fraud check. They don't depend on each other, none needs to block the user, and you want to add more over time. This is the textbook fit — fan-out to independent consumers.

2. Temporal decoupling / load buffering. The producer and consumer run at different rates or different times. A spike of 10,000 orders in a flash sale shouldn't topple the payment processor; the queue absorbs the spike and the consumer drains it at a sustainable pace. The broker becomes a shock absorber.

3. An audit trail / replayability is valuable. An append-only event log is a record of everything that happened, in order. You can replay it to rebuild state, onboard a new consumer that processes history, or debug "what actually happened to this order." When the log of facts is itself an asset, events shine (this is the doorway to event sourcing, a heavier commitment).

Notice what's not on this list: "we have microservices" and "it's more scalable." Microservices can communicate synchronously and often should. Scalability is a property you measure and design for, not a synonym for "put a queue in front of it."

The anti-pattern that wrecks most implementations

Here's the single biggest mistake, and it caused the double-charge above in spirit: using events for request/response flows that actually need an answer now.

If Service A sends a message and then waits for a reply to continue, you haven't built an event-driven system — you've built a slow, fragile RPC with extra moving parts: correlation IDs, reply queues, timeouts, and no clean error path. You took a problem that an HTTP call solves in 20ms with a clear 500-on-failure, and turned it into a distributed saga with mystery latency.

The rule: if the caller needs the result to proceed, make a synchronous call. Use events for "this happened, react if you care," not for "do this and tell me the answer." The tell that you've crossed the line: a consumer that publishes a reply the original producer is blocking on.

The non-negotiables

Once you've decided events are the right call, three things are not optional. Every one of them maps to a class of production incident I've seen.

1. Every consumer must be idempotent

Brokers deliver at-least-once. Networks hiccup, consumers crash after processing but before acknowledging, redeliveries happen. Your consumer will see the same event more than once. Processing it twice must be safe.

The standard fix is an idempotency key: every event carries a unique ID; the consumer records processed IDs and skips duplicates. Critically, record the ID in the same transaction as the side effect — otherwise you crash in the gap and the guarantee evaporates. For the double-charge team, this one pattern closed the bug permanently.

2. A dead-letter queue (DLQ)

Some events will fail to process no matter how many times you retry — malformed payload, a bug, a downstream that's permanently rejecting. Without a DLQ, these either get dropped silently (data loss) or retry forever (a poison message that clogs the queue and burns money).

A DLQ is where events go after N failed attempts: somewhere visible, with alerting, so a human can inspect and decide. The DLQ is your safety net and your debugging surface. A queue without a DLQ is an outage waiting to happen.

3. Explicit, versioned schemas

Events are a contract between teams who, by design, don't talk to each other directly. The moment a producer adds a required field or renames one, every consumer can break — silently, asynchronously, discovered hours later. Treat event schemas like API contracts: a schema registry, backward-compatible evolution (add optional fields, never repurpose), and explicit versions. "It's just JSON on a queue" is how you get a 2am incident from a field rename.

Choreography vs orchestration

When a business process spans multiple services, you have two ways to coordinate it, and choosing wrong is a common source of pain.

Choreography — each service listens for events and reacts, no central brain. Maximally decoupled, but the overall process exists nowhere; to understand "how does an order get fulfilled," you read seven services and infer. Great for simple, stable flows; painful when the process is complex or changes often.

Orchestration — a coordinator (a saga orchestrator) explicitly directs the steps and handles failures/compensation. The process lives in one readable place at the cost of a central component. Better for complex, multi-step business workflows where you need to reason about — and recover from — partial failure.

My rule of thumb: simple fan-out → choreography. A multi-step transaction that needs compensation when step 3 of 5 fails (refund the payment if shipping can't fulfill) → orchestration with an explicit saga. Don't choreograph a complex distributed transaction; you'll never be able to reason about its failure modes.

The decision: should this be an event?

The bias, as always, is toward simplicity. Every event you introduce is a thing that can be duplicated, lost, reordered, or schema-drifted. Introduce it when the decoupling, buffering, or audit value is real — and when you do, pay the full price (idempotency, DLQ, schemas), not the demo price.

What to do Monday morning

Find one event flow in your system that's secretly request/response. Look for a consumer that publishes a reply the producer waits on, or a "queue" wrapped in a synchronous-feeling API with timeouts. That's a candidate to simplify back to a direct call.
Audit every consumer for idempotency. For each, ask: "if this exact event arrives twice, what happens?" If the answer isn't "nothing bad," you have a latent double-charge waiting. Add idempotency keys.
Confirm every queue has a DLQ with alerting. If a poison message arrives at 3am, does it page someone or silently clog the pipe? You want the former.
Check your schema discipline. Can a producer add a required field and break consumers without anyone noticing until production? If yes, introduce a schema registry and backward-compatible-only evolution.

The runnable reference for an idempotent consumer with a DLQ lives in the /event-driven folder of the architecture catalog repo.

Key takeaways

Events are a powerful tool and a terrible default. They buy loose coupling, buffering, and an audit trail; they charge you in observability and a whole class of distributed-systems bugs. Reach for them deliberately.
Three legitimate reasons: genuinely independent reactions to a fact, temporal decoupling / load buffering, and valuable replayable audit trails. "We have microservices" and "it's scalable" are not on the list.
Never use events for request/response. If the caller needs the answer to proceed, make a synchronous call. Events are "this happened, react if you care," not "do this and reply."
The non-negotiables are non-negotiable: idempotent consumers (at-least-once delivery is real), a dead-letter queue with alerting, and explicit versioned schemas. Each maps directly to a production incident.
Choreography for simple fan-out, orchestration for complex transactions. Don't choreograph a multi-step distributed transaction you need to reason about and compensate — use an explicit saga.

Your next step

Pick your most important event consumer and write down what happens if it receives the same event twice. If you can't answer with confidence, you've found the same gap that double-charged real customers in the story above. Close it today — it's a few lines of idempotency-key logic, and it's the difference between a system that's resilient to the network and one that's quietly corrupting data one-in-two-thousand times.

Events let your services stop knowing about each other. That's the gift and the bill. Take it when the gift is worth the bill.

Frequently asked questions

When should I use event-driven architecture instead of direct API calls?

Use events when reactions to a fact are genuinely independent (an order placed → reserve stock, send email, update analytics), when you need to decouple producer and consumer in time to buffer load spikes, or when a replayable audit log of events is itself valuable. Use direct synchronous calls when the caller needs the result to proceed. Having microservices or wanting "scalability" are not, by themselves, reasons to introduce a message broker — events add real cost in debugging and failure modes that you should only pay when the decoupling earns it.

Why do messages get delivered more than once, and how do I handle it?

Production message brokers guarantee at-least-once delivery: network failures, consumer crashes between processing and acknowledging, and rebalancing all cause redelivery. You handle it by making every consumer idempotent — each event carries a unique ID, and the consumer records processed IDs and skips duplicates, ideally recording the ID in the same transaction as the side effect so a crash in between can't break the guarantee. Without this, duplicate delivery causes bugs like double-charging a customer.

What is a dead-letter queue and do I need one?

A dead-letter queue (DLQ) is where events go after failing to process a configured number of times — for example, a malformed payload or a permanently failing downstream. You need one: without it, failed events are either dropped silently (data loss) or retried forever (a poison message that clogs the queue and runs up cost). A DLQ should be monitored and alerted so a human can inspect failures and decide what to do. It is both your safety net and your primary debugging surface for async flows.

What's the difference between choreography and orchestration?

In choreography, each service independently listens for events and reacts, with no central coordinator — maximally decoupled, but the overall business process isn't written down anywhere and must be inferred across services. In orchestration, a central coordinator (often a saga orchestrator) explicitly directs the steps and handles failure and compensation. Use choreography for simple fan-out flows and orchestration for complex multi-step transactions that need coordinated failure handling, such as refunding a payment when shipping later fails.

Is event-driven architecture the same as microservices?

No. Microservices is a way of decomposing a system into independently deployable services; event-driven architecture is a communication style where components interact via events through a broker. Microservices can communicate synchronously (and frequently should), and a monolith can use events internally. Conflating the two leads teams to add a message bus reflexively when adopting microservices, inheriting duplicate-delivery, ordering, and eventual-consistency problems they didn't need to take on.

#software-architecture#event-driven#message-queue#kafka#idempotency#distributed-systems#saga#system-design#dead-letter-queue

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Event-Driven Architecture Without the Hype: When Queues Help and When They Hurt

Event-Driven Architecture Without the Hype: When Queues Help and When They Hurt

What event-driven actually means

The three legitimate reasons to go event-driven

The anti-pattern that wrecks most implementations

The non-negotiables

1. Every consumer must be idempotent

2. A dead-letter queue (DLQ)

3. Explicit, versioned schemas

Choreography vs orchestration

The decision: should this be an event?

What to do Monday morning

Key takeaways

Your next step

Frequently asked questions

When should I use event-driven architecture instead of direct API calls?

Why do messages get delivered more than once, and how do I handle it?

What is a dead-letter queue and do I need one?

What's the difference between choreography and orchestration?

Is event-driven architecture the same as microservices?

Continue Reading

Software Architecture Patterns: A Reference Catalog with Diagrams, Failure Modes, and Code

Caching, Idempotency, and Retries: The Three Things That Break at Scale

I Over-Engineered a SaaS for Millions. It Got 3 Users.