Article 5 of 8
Real-World Case Study: Designing a Notification System
Walk through a production-grade notification system design end to end.
A notification system is deceptively complex. Four channels — push, email, SMS, in-app — each with different latency profiles, cost structures, failure modes, and regulatory requirements. User preferences as hard constraints from day one, not bolt-on features. Millions of events per hour flowing through queues that must be idempotent, rate-limited, and observable. This end-to-end walkthrough shows how every mental model from this pathway converges into a single production system.
A few years ago I inherited a notification system that was, charitably, a crime scene.
Every service in the platform had its own email-sending logic. Push notifications fired inline during API requests, adding 400ms to checkout flows. SMS ran on a cron job every five minutes, which meant users sometimes got their two-factor codes after they'd already given up and requested a new one.
The business wanted in-app notifications. Leadership wanted delivery analytics. The mobile team wanted rich push with actions. I was staring at notification logic scattered across fourteen microservices, thinking: we need to start over.
That redesign taught me more about system design than any textbook. Not because notifications are exotic — because they force you to confront every fundamental trade-off at once: sync vs async, bounded contexts, rate limiting, failure handling, observability. There's nowhere to hide.
This article walks through the thinking behind a production-grade notification system. Not a toy example — a real system that handles millions of notifications per hour across four channels.
Define the Problem Before Designing the Solution
Before drawing a single box, be precise about what you're building.
A notification system delivers the right message, to the right user, through the right channel, at the right time. Those four "rights" seem obvious until you enumerate the channels:
- Push notifications — iOS and Android, with platform-specific delivery (APNs, FCM), token expiry, and no delivery guarantee
- Email — transactional and marketing, with deliverability reputation, SPF/DKIM/DMARC concerns, and per-provider rate limits
- SMS — expensive, regulated, per-message cost, country-specific compliance, but the only channel that reaches users without internet
- In-app — stored notifications visible when the user opens the product, requiring persistence, read-state tracking, and notification grouping
Each channel has different latency expectations, cost profiles, failure modes, and regulatory requirements. The "right channel" decision is not trivial.
Non-functional requirements that drive the architecture:
- Handle burst traffic (a flash sale triggers millions of notifications simultaneously)
- Recover gracefully from downstream channel failures
- Deduplicate aggressively (nobody should receive the same notification twice)
- Give the operations team full visibility into what's happening at every stage
User Preferences: The Constraint You Can't Add Later
This is the part most engineers skip when whiteboarding, and it generates more customer support tickets than any technical failure.
Every user has preferences: which channels they've opted into, quiet hours, frequency caps, and category-level overrides. A user might want push for order updates but only email for marketing. Another might have disabled SMS entirely except for security alerts.
I model preferences as a layered system:
- Global defaults — platform-level settings for each notification type
- Category overrides — user opt-in/opt-out per category (Marketing, Order Updates, Security)
- Channel overrides — user opt-in/opt-out per channel
- Regulatory constraints — legal requirements that override everything (you must send security alerts; you must not send unsolicited marketing)
The preference service is its own bounded context — it owns its data store, exposes a simple API, and is cached aggressively because it's called on every single notification routing decision.
The read/write ratio here is extreme: preferences are read millions of times per day and written thousands of times. A Redis cache with a short TTL in front of a PostgreSQL store handles this naturally.
The architectural principle: preferences are not a feature. They're a hard constraint that the routing layer evaluates on every notification. If you bolt preferences on after the fact, you'll end up with a tangle of if-statements in every delivery path.
High-Level Architecture: Three Stages
The system has three main stages, connected by message queues:
Stage 1: Event Ingestion — Producing services publish domain events. The order service publishes "order_shipped". The security service publishes "suspicious_login". These are facts about what happened — not instructions to send a notification. The notification system decides what to do with them.
Stage 2: Routing and Orchestration — The brain of the system. Receives events, resolves the recipient, evaluates preferences, selects channels, renders templates, and enqueues delivery jobs. This is where all business logic lives.
Stage 3: Channel Delivery — Separate workers for each channel pull from their respective queues and handle the mechanics of delivery. Each worker knows its channel deeply: retry policies, rate limits, provider APIs, token management.
The critical design decision: Event ingestion and delivery are fully decoupled through message queues.
The service that publishes "order_shipped" doesn't know or care whether that results in a push notification, an email, both, or neither. This is the sync vs async mental model in action. The publisher fires and forgets. The notification system handles the rest asynchronously.
This decoupling is what makes the system resilient. If the email provider goes down, the order service doesn't slow down. Notifications queue up and deliver when the provider recovers.
Event-Driven Design: The Message Queue Backbone
The backbone is a message broker — Kafka for high-throughput systems, SQS or RabbitMQ for lower-volume ones. The principles apply regardless.
Flow:
- A producing service publishes a notification event to a shared topic (
notification.requests) - The router service consumes from this topic, enriches the event with user data and preferences, and produces delivery jobs to channel-specific topics (
delivery.email,delivery.push,delivery.sms,delivery.in-app) - Channel workers consume from their respective topics and handle delivery
Why separate topics per channel? Because each channel has wildly different throughput characteristics and failure rates. Email might process 10,000 messages per second. SMS is rate-limited by the provider to 200 per second. If they shared a queue, a backlogged SMS pipeline would block email delivery. Separation gives each channel independent scaling and failure isolation.
Idempotency is non-negotiable here. Messages in distributed systems can be delivered more than once. A consumer crashes after processing but before acknowledging, and the message is redelivered. Every delivery worker must handle this. I use event IDs combined with a Redis deduplication cache (TTL: 24 hours) to ensure we never send the same notification twice.
Template Engine and Versioned Personalization
Hardcoding notification copy in delivery workers is a trap. It works for the first five notification types and becomes unmaintainable at fifty.
A template registry — a service or structured data store — holds templates for each notification type and channel:
- Push: title + body + optional deep link (50/120 character limits)
- Email: subject + HTML body + plain text fallback
- SMS: pure text under 160 characters
- In-app: title + body + action URL + icon reference
Templates support variable interpolation: "Hi {{user.firstName}}, your order {{order.id}} has shipped!" The routing layer resolves these from the event payload and user profile before enqueueing delivery jobs.
One pattern I consider essential: template versioning. When marketing updates a template, you don't want in-flight notifications to silently change. Version your templates and pin each routing decision to a specific template version at enqueue time. This eliminates a whole class of bugs that are nearly impossible to debug after the fact.
Beyond personalisation of names and numbers: consider timezone-aware delivery (don't send marketing at 3am in the user's timezone), language preferences, and channel-specific formatting constraints (an email can have rich HTML, an SMS cannot).
Channel-Specific Delivery: What You Learn the Hard Way
Each channel has its own failure landscape.
Push notifications are fast and free, but unreliable. APNs and FCM don't guarantee delivery. Device tokens expire when users reinstall or switch devices. You need a token registry that stays current, and you must handle token invalidation responses — when a push provider tells you a token is invalid, immediately remove it. Push is best for time-sensitive, low-stakes notifications where missing one occasionally isn't catastrophic.
Email is reliable but slow. Deliverability is a discipline: SPF, DKIM, DMARC, sending reputation, bounce handling, complaint feedback loops. I strongly recommend using a reputable ESP (SendGrid, SES, Postmark) rather than running your own mail infrastructure. The operational overhead of protecting deliverability is enormous and does not improve your product. Email is the workhorse channel: receipts, order confirmations, weekly digests.
SMS is expensive and regulated. TCPA compliance in the US, 10DLC registration, mandatory opt-out handling. SMS costs real money per message, so your rate limiting and deduplication must be bulletproof. Use SMS only for high-importance, time-critical notifications: two-factor codes, critical security alerts, delivery notifications when the user is actively expecting a parcel.
In-app is the most reliable channel (you control the entire stack), but requires the user to open your product. The interesting complexity here is read-state tracking and notification grouping — nobody wants to see 47 separate "someone liked your post" notifications.
The trade-off matrix, stated plainly:
- Push: fast, cheap, unreliable
- Email: reliable, slow, free
- SMS: reliable, fast, expensive
- In-app: reliable, cheap, requires product open
Good routing logic uses these trade-offs dynamically, selecting the appropriate channel based on message urgency and user context.
Rate Limiting and Deduplication
Nothing destroys user trust faster than notification spam. You need multiple layers:
Per-user rate limits — cap total notifications per user per hour and per day. This catches upstream bugs. If a broken order service publishes 1,000 "order_updated" events, send one or two notifications, not a thousand.
Per-channel rate limits — respect provider limits and protect your sending reputation. Email volume too high from a single IP damages deliverability.
Category-level frequency caps — at most one marketing email per day, regardless of how many campaigns are running. Security alerts should never be rate-limited.
Two-level deduplication:
- Event-level: If the same event arrives twice (upstream retry), detect it via event ID and drop the duplicate.
- Content-level: If two different events would produce the same notification to the same user within a short window, collapse them ("You have 3 new messages" instead of three separate notifications).
Apply rate limiting checks in the routing layer, before delivery jobs are enqueued. Dropping a message early is far cheaper than processing it through the complete pipeline only to discard it at delivery.
Failure Handling: Where Production Reality Meets Architecture
Things fail constantly in notification systems. Email providers have outages. Push token servers are intermittently slow. SMS providers reject messages for regulatory or rate-limit reasons.
Retry strategies must be channel-aware. A failed push notification gets 2 retries with short delays — if it doesn't succeed in the first minute, the moment has passed. A failed transactional email should retry aggressively over hours, because the user expects it. A failed SMS for a two-factor code should retry immediately but give up within 60 seconds — the user will request a new code.
Exponential backoff with jitter is the default retry policy. Without jitter, when a provider recovers from an outage, all queued retries hit simultaneously and potentially trigger another outage. Jitter spreads retries across a window.
Dead letter queues are non-negotiable. When a message exhausts its retries, it goes to a DLQ rather than being silently dropped. DLQs let you investigate and replay failures manually, and they signal systematic problems — 50,000 messages in your email DLQ is an alert, not a silent data loss.
Circuit breakers protect your workers from wasting resources on providers that are clearly down. If the push provider has failed 90% of requests in the last 60 seconds, stop trying. Check again in 30 seconds. This prevents cascading failure where a slow provider ties up all worker threads.
Observability: The Question Nobody Plans For
Ask this question in every design review: "How will we know if this is working correctly in production?"
For a notification system, observability means tracking the full lifecycle of every notification:
Key metrics:
- Ingestion rate — sudden spikes may indicate an upstream bug
- Routing decisions — how many notifications suppressed by preferences? By rate limits? By deduplication?
- Delivery latency per channel — seconds for push/SMS, minutes for email, seconds for in-app
- Delivery success rate per channel and per provider
- DLQ depth — should be near zero in steady state
- Consumer lag — are workers keeping up with incoming volume?
Structured logging on every notification with a correlation ID lets you trace a single notification from ingestion through routing to delivery. When a customer says "I never got my receipt email," you can pull up the full trace in seconds and identify exactly where it diverged.
Without this observability, you discover problems when customers complain — which is always too late.
Tying the Mental Models Together
Look back at the mental models from the earlier articles in this pathway and notice how every one showed up:
State vs Behaviour — user preferences are state, owned by the preference service. Notification routing is behaviour, stateless and independently scalable.
Sync vs Async — the entire system is asynchronous by design. No producing service waits for a notification to be delivered.
Read/Write Split — the preference service is read-heavy and cached accordingly. The in-app notification store is write-heavy and needs write-optimised data modelling.
Bounded Context — the notification system owns its own data models. It receives events and manages its own state. It never reads tables from the order service or user service.
Failure as First-Class Citizen — retries, DLQs, circuit breakers, and deduplication are not afterthoughts. They're core to the architecture because we assumed failure from the start.
This is what system design looks like in practice. Not abstract principles memorised for an interview — concrete decisions made under real constraints, guided by frameworks that have earned their place through production experience.
Key Takeaways
- Decouple ingestion from delivery with message queues. This is the most important architectural decision in the system. It makes everything else — resilience, scalability, independent deployment — possible.
- Treat user preferences as a hard constraint, not a feature. Build them into the routing layer from day one, or you'll be untangling spaghetti later.
- Each channel is a different world. Push, email, SMS, and in-app have fundamentally different profiles. Don't abstract them into one generic "send" interface.
- Rate limiting and deduplication protect user trust. Notification spam is one of the fastest ways to lose users. Multiple protection layers are not over-engineering.
- Retry strategies must be channel-aware and time-sensitive. A two-factor SMS and a marketing email have completely different urgency profiles. One-size-fits-all retry logic is lazy and ineffective.
- Dead letter queues are non-negotiable. Never let messages fail silently. DLQs give you recovery options and operational visibility.
- Observability is what separates production systems from prototypes. If you cannot trace a single notification through the entire pipeline, you do not have a production-ready system.