software architecture

I Over-Engineered a SaaS for Millions. It Got 3 Users.

I built a SaaS with multi-tenancy, event-driven architecture, and elaborate domain abstractions — for millions of users that never arrived. The product now serves two or three internal people in the same building. This is the architecture post-mortem, and the operating patterns that would have changed the outcome.

Ruchit Suthar

Ruchit Suthar

Software architect at a Fortune 500 R&D/AI group. I've over-engineered things I should have kept simple, and shipped things at scale that had no business working. Real patterns, real trade-offs.

15 min read
I Over-Engineered a SaaS for Millions. It Got 3 Users.
Key Takeaway

Over-engineering is a confession that you didn't understand the problem — and "build for scale" is premature optimisation with a respectable job title.

I Over-Engineered a SaaS for Millions. It Got 3 Users.

The diagram looked right. I still remember sitting with it — tenant isolation boundaries drawn cleanly, a domain model that could extend in any direction, an event-driven backbone that would handle whatever traffic came. It had the shape of something serious. Something that could absorb growth without flinching.

I was designing for the inflection point — the moment customer number two would appear and I would not have to retrofit multi-tenancy under live load. That's engineering discipline, I told myself. You can't bolt on isolation boundaries after the fact. The code was clean. The abstractions were coherent. The infrastructure was ready for a load I had every reason to expect would arrive.

It never arrived. The product never shipped to market. Today it quietly serves two, maybe three internal users in the same building where I wrote it. All that infrastructure, running for the three of us.

That gap between the architecture I built and the constraint I actually had — that gap is the whole lesson.


What I Actually Built (And Why It Made Complete Sense at the Time)

Multi-tenancy was the obvious first decision. Any SaaS architect will tell you the same thing: tenant isolation is not something you retrofit cleanly. You design for it upfront or you pay for it later at the worst possible moment. The reasoning is sound. The pattern is correct. The error wasn't that I applied the pattern — it was that I applied it to a product with zero customers.

The domain model followed the same logic. I was building in a domain with real complexity: different user types, different permission models, different data shapes depending on the tenant. A flat model wouldn't survive the first real customer's requirements. So I abstracted. I built an entity hierarchy that could extend cleanly in all four directions. Domain-Driven Design tells you to model the complexity of the domain, not to simplify it away — and the domain was genuinely complex, so the model reflected that complexity faithfully.

The event-driven design came next. Decoupled services let you scale individual components without dragging the whole system. I knew from my time on a legacy travel product running 1,500 requests per minute that coupling was the thing that killed you at scale — one slow downstream integration dragged everything. So I designed out the coupling early. Clean event contracts. Isolated consumers.

Every single decision was technically defensible. That's what makes this particular failure hard to explain. This was not sloppy engineering. The architecture would have handled significant scale competently. The problem was that it was the wrong architecture for zero scale, not bad architecture for the imagined scale.

This is the competence trap. The better you are at applying patterns, the more fluently you apply them to problems that don't require them yet. The pattern library becomes instinct. Multi-tenant, event-driven, bounded contexts — these trigger automatically when the domain is complex enough. The question of whether the domain is complex at this stage, for this many users, right now — that question gets skipped because the answer seems obvious. Of course it will be complex. It's a SaaS product.


When Did It Become Clear This Was Wrong?

There was a specific moment. Not a gradual dawning — a moment.

I was reviewing the deployment pipeline configuration — setting up environment segregation for the tenant isolation layer — and I stopped. Not because something broke, but because I was solving a problem with extraordinary care that did not exist. I was building a tenant isolation pipeline for a product that had not yet signed a single user. The precision of the engineering effort and the absence of any actual user to protect were suddenly, visibly, absurd.

The emotion wasn't shame. It was a specific, flat recognition: I had been pointing effort at a problem statement that I had not validated. Months of engineering work — the multi-tenancy model, the domain abstractions, the event-driven architecture — all of it was a response to a problem I had imagined rather than observed.

What bothered me more than the moment itself was the delay. The recognition came months in. The signals were there earlier — the product timeline kept slipping, the architecture kept getting more elaborate, and the user validation work kept getting deferred because "we need to get the architecture right first." That sequencing, architecture before users, is the tell. I knew it was a tell. I had known for a while before I said it.

The cost was real. Months of engineering time that could have gone into validating whether anyone wanted the product. A window. The opportunity cost of everything I was not building or learning during that period. The product didn't fail because the architecture was bad. It failed to launch because the architecture absorbed the energy that should have gone to users.


Over-Engineering Is a Confession You Didn't Understand the Problem

Every abstraction you add for imagined load is a bet on a problem statement you haven't validated.

The architecture I built was a confident, well-executed bet that users would arrive, that multi-tenant load would be real, that the domain complexity would justify its own model. None of those bets had evidence behind them. The bet wasn't wrong the way a bug is wrong — it was wrong the way a business assumption is wrong. And architects don't usually frame their technical decisions as business assumptions, which is part of why this failure mode persists.

The engineer's trained reflex is toward generality. "Do it right" means building the thing that survives the next 10x, the next requirement change, the next scaling event. The engineer who doesn't think about this is the one who gets paged at 3am because a critical code path was coupled in a way that made the simple fix impossible. Generality is a genuine virtue. The problem is that it gets applied without a trigger condition. You build for the next 10x not because you have evidence the 10x is coming — but because you can, and because the alternative feels lazy.

Nobody calls this premature optimization when an architect does it. When a junior developer adds complexity for a case that isn't in scope, it gets caught in review. When an architect designs an event-driven multi-tenant system for a product with no users, it's called robust design. The vocabulary protects the decision from scrutiny. "Scalable," "extensible," "future-proof" — these are the words that make premature optimization sound like engineering discipline.

This is not an argument for sloppy code. The code can be clean and still be appropriately scoped. The antidote to over-engineering is not cowboy code with no structure — it is discipline about what problem you are actually solving today. Simplicity, real simplicity, is harder to achieve than complexity. Adding a tenant isolation layer is straightforward. Designing a system that is genuinely simple for its current constraint and genuinely replaceable when the constraint changes — that takes more judgment, not less.

The contrast makes the point plainly:

The right column would have taken a fraction of the time. The deferred nodes carry labels — specific conditions that would justify building them. That's the discipline. Not "never build it," but "build it when the trigger arrives, not before."


What Should the Right Architecture Have Looked Like?

Constraint-first. Before the first box is drawn, write down the actual constraint you have evidence for.

Not the constraint you expect. Not the constraint the business case projects. The constraint you have confirmed, right now, in the form of real users, measured load, or signed commitments. Everything else is a projection, and projections compound their error through every downstream architectural decision.

For this product, at the time I started building, the constraint was: one internal team, no external customers, no signed contracts, no traffic data. That's the constraint. The architecture that serves that constraint is dramatically simpler than what I built.

The "10x not 1000x" rule is the governing principle. If you have evidence pointing toward 10x your current load, build for it — that's not premature, that's foresight based on signal. If you are building for 1000x your current load on the basis of a business case and an ambition, you are designing for a fiction. The 1000x might arrive. But you will pay for it in engineering time before you have a single data point confirming it.

Throwaway-readiness is not a failure state — it is a feature. A system designed to be replaced cleanly when the constraint changes is more valuable than a system designed to never need replacement. The difference is whether you treat the initial architecture as the final architecture or as the cheapest way to generate evidence about what the final architecture should be.

Several specific elements could have been deferred entirely. Multi-tenancy: build a clean module boundary, document the tenant model, defer the isolation infrastructure until the second customer is signed. The event-driven backbone: model the domain as services with clean interfaces, don't commit to event infrastructure until you have measured the coupling cost of synchronous calls. The elaborate domain hierarchy: model what you know, extend when the domain tells you to. The trigger for each is a real condition — not a timeline, not a quarter, not "when we scale."


Three Operating Patterns That Would Have Changed the Outcome

Constrain Before You Design

Write down the actual constraint before drawing the first box. Not the projected constraint — the one you have confirmed evidence for today. Users: how many, in what pattern. Load: what have you measured or what do you have a signed commitment toward. Growth rate: what signal do you have for it.

This is not a one-time exercise. Do it at the start of every significant design decision. The question is not "what will we need when we scale?" The question is "what is the actual constraint right now, and what is the simplest architecture that serves it?"

When you write it down, the imagined constraints become visible. They are no longer implicit in the diagram — they are explicit assumptions that can be questioned. The multi-tenancy decision was based on an implicit assumption that a second customer would arrive. Writing it down surfaces the fact that you have no evidence for that assumption.

Name the Scale Trigger

For every element being built for a future load, write the specific condition that would justify building it. Not "when we grow." A named trigger: "when customer 2 signs," "when synchronous call latency exceeds 200ms under measured load," "when the domain model has confirmed divergence between tenant types."

The trigger does two things. It makes the deferred decision responsible — you are not ignoring scale, you are building it when you have the evidence for it. And it makes the review honest. If you cannot name the trigger, you are building for a projection, not a constraint. If the trigger is "someday," it is not a trigger.

Price the Replacement, Not the Rewrite

The fear that drives over-engineering is the fear of the future rewrite. If I don't build multi-tenancy now, I'll have to retrofit it later under live load — and that's expensive, disruptive, and dangerous. The fear is legitimate. Retrofitting isolation boundaries into a live system is genuinely hard.

But price it. Estimate what it would actually cost to add multi-tenancy when the second customer signs — a clean migration, a clear module boundary, a well-documented domain model. That cost is real. Compare it to the cost of building multi-tenancy now, before you know whether a second customer is coming. The comparison is usually not what the over-engineering fear suggests. The replacement cost is often lower than the pre-emptive build cost, especially when the system was designed with replacement in mind.

The rewrite is what you get when you don't design for replaceability. A throwaway-ready system does not require a rewrite — it requires a promotion of the deferred elements you already documented.


What Monday Morning Looks Like if You Take This Seriously

Pull up the current architecture diagram. Draw a line between every element that is a response to an actual, evidenced constraint and every element that is a response to an imagined future state. Most diagrams have both. The question is whether you know which is which.

For every element on the imagined side, write the trigger. The specific condition that would justify its existence. If you cannot write the trigger, you have an element without an evidence base — and you should know that.

In your next design review, change the question. The question is not "will this scale?" The question is "what is this scaled for, and what evidence do we have for that constraint?" The first question is architectural — engineers know how to answer it. The second question is empirical — it requires evidence, not expertise.

For systems that are already over-engineered: start by separating what the system actually does from what it was built to handle. Map the elements that are load-bearing today against the elements that were built for load that has not arrived. The unused abstraction layers, the tenant isolation for the tenants that don't exist — these are candidates for simplification. You are not rewriting the system. You are identifying what can be safely removed because the trigger that would have justified it never arrived, and may not.

The post-mortem question worth asking is not "why did we build something too complex?" It is "what assumption were we protecting by building it?" That assumption is where the next over-engineering decision is waiting.


Frequently asked questions

What is over-engineering in software architecture?

Over-engineering is building complexity — abstractions, scaling layers, infrastructure — for constraints that don't exist yet and may never arrive. It is not the same as building a well-structured, maintainable system. The distinction is whether the design is a response to actual constraints or to imagined ones. A multi-tenant architecture for a product with zero customers is over-engineering. The same architecture at customer three may be entirely correct.

How do you know when you're over-engineering a system?

The clearest signal is that design decisions are being made to accommodate load or scale you have no evidence for. If the justification for an architectural element begins with "when we eventually..." or "we don't want to have to retrofit..." without a named trigger condition, it is probably over-engineering. A second signal: the cost of the design exceeds the cost of the problem you've confirmed exists.

Is "build for scale" always wrong?

No. Building for scale is correct when you have evidence for the scale. The error is treating "build for scale" as a general virtue rather than a specific response to a real constraint. Designing for the next 10x you have signal toward is sound. Designing for the imagined 1000x you have no evidence for is premature optimization with a respectable job title. The discipline is naming the trigger that would justify each scaling decision, not deferring it infinitely, but not building it on speculation either.

What should I build instead of a scalable architecture when starting out?

Build the simplest system that lets you learn whether your assumptions about the problem are correct. That system should be clean and replaceable — not sloppy, but not over-abstracted. The goal at the early stage is to generate evidence: what load actually arrives, what the real user behavior looks like, what the constraint actually is. Once you have evidence, you have the basis for architectural decisions. Before that, you are designing for a fiction.

How do you recover from a system that was already over-engineered?

Start by separating what the system actually does from what it was built to handle. Most over-engineered systems have a simpler core that works correctly. Map the complexity that is load-bearing versus complexity that was built for load that never arrived. The unused abstractions and scaling infrastructure are candidates for removal or simplification. The recovery path is not a rewrite — it is identifying which elements can be safely removed or simplified now, and which should stay because the load they were designed for has since arrived.

#software-architecture#over-engineering#system-design#saas#premature-optimization#architecture-decisions#software-craft#2026
Ruchit Suthar

Ruchit Suthar

Software architect at a Fortune 500 R&D/AI group. I've over-engineered things I should have kept simple, and shipped things at scale that had no business working. Real patterns, real trade-offs.

Continue Reading

Scaling to Millions of Users: A Real-World Architecture Teardown

Scaling to Millions of Users: A Real-World Architecture Teardown

An anonymized teardown of a consumer platform I scaled to several million users. The architecture that carried ~30K req/s at peak, the four walls we hit on the way up — database connections, a cache stampede that caused a 19-minute outage, payment double-charges, and a credential-stuffing attack that looked like organic growth — and the trade-offs behind each fix. Topology, layered caching, the data tier, WAF and rate-limiting stack, and four real ADRs. No vendor named; the engineering is exactly as it happened.

·24 min readRead now
Software Architecture Patterns: A Reference Catalog with Diagrams, Failure Modes, and Code

Software Architecture Patterns: A Reference Catalog with Diagrams, Failure Modes, and Code

A practical reference catalog of the eight architectures worth knowing — layered, modular monolith, hexagonal, event-driven, CQRS + event sourcing, microservices, serverless, and the strangler fig. Each with a diagram, the forces that make it the right call, the failure mode that makes it the wrong one, and a link to runnable reference code. Plus a decision flowchart so you pick on fit, not hype.

·18 min readRead now
LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

Adding an LLM to your product is a distributed-systems problem with a non-deterministic dependency, not a single API call. When RAG actually helps (and when a prompt will do), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point production checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — to put in place before a real user touches it.

·15 min readRead now