Startup to Enterprise

Data Architecture That Scales: From Single Postgres to Warehouse and Lakehouse

Your dashboard query just crashed production. Learn when to separate OLTP from OLAP, how to introduce a data warehouse, and whether you actually need event streams and a lakehouse—with a clear maturity checklist.

Ruchit Suthar
Ruchit Suthar
November 17, 202510 min read
Data Architecture That Scales: From Single Postgres to Warehouse and Lakehouse

TL;DR

Single Postgres handles OLTP (transactions) but breaks with OLAP (analytics). Evolve through stages: read replicas for separation, warehouse (Snowflake/BigQuery) for complex queries, event streaming for real-time needs, lakehouse for ML. Don't jump straight to lakehouse—match architecture to current pain, not future dreams. Marketing dashboards crashing production means you need a warehouse.

Data Architecture That Scales: From Single Postgres to Warehouse and Lakehouse

When SELECT * Starts Taking 30 Seconds

It's 3am. Your Slack is exploding.

The marketing team ran a dashboard query to calculate last quarter's revenue by region. The query took 45 seconds. Then timed out. Then crashed the database.

Now checkout is down. Customers can't buy your product. Your on-call engineer is frantically restarting Postgres while the CEO asks, "Why does a report take down production?"

Welcome to the moment every startup hits:

Your application database was designed for fast writes and lookups. "Save this order." "Fetch user #12345." Simple, fast, transactional.

But now people are running:

  • 50-table joins to calculate LTV
  • Full table scans to generate quarterly reports
  • Ad-hoc analytics queries during peak traffic

Your data architecture is still at version 1. Your business is at version 5.

The single Postgres database that got you to product-market fit is now your biggest bottleneck. Marketing can't get their reports. Data science can't train models. Engineers are scared to touch the database schema.

You need to evolve. But to what? A warehouse? A data lake? Event streaming?

Let me show you how data architecture evolves as you scale, and more importantly, when each stage makes sense.

OLTP vs OLAP in Plain Language

Before we talk about solutions, let's clarify the fundamental problem.

OLTP: Online Transaction Processing

What it does: Fast reads and writes of individual records.

Optimized for:

  • "Create this order"
  • "Update user email"
  • "Fetch product #42"
  • "Delete this session"

Examples: Your application database (Postgres, MySQL, MongoDB)

Pattern:

  • Lots of small, quick queries
  • Many concurrent users
  • Strong consistency (order total must be exact)
  • Row-by-row operations

Think: Shopping cart, user authentication, inventory updates.

OLAP: Online Analytical Processing

What it does: Complex queries across large datasets.

Optimized for:

  • "Total revenue by region for Q3"
  • "Customer cohort retention analysis"
  • "Which products are frequently bought together?"
  • "Predict churn risk for all users"

Examples: Data warehouses (BigQuery, Snowflake, Redshift)

Pattern:

  • Few, expensive queries
  • Scanning millions of rows
  • Eventual consistency is fine (yesterday's numbers are good enough)
  • Column-oriented storage

Think: Business intelligence dashboards, financial reports, ML training data.

The Core Problem

OLTP databases are terrible at OLAP workloads.

Running analytical queries on your production database is like:

  • Using a sports car to move furniture
  • Using a blender to chop firewood
  • Using a microwave to bake bread

It'll work... once. Then something breaks.

The solution: Separate your operational workloads from your analytical workloads.

Let's see how.

Stage 1 – Single Database with Light Read Scaling

The Setup

┌─────────────────┐
│   Application   │
│                 │
└────────┬────────┘
         │
         ▼
  ┌──────────────┐
  │   Postgres   │◄────── (Analysts run queries here)
  │              │
  └──────────────┘

Who's here:

  • Startups with <10 engineers
  • Single-region, monolithic architecture
  • <100k users, <1TB data

What works:

  • Application writes/reads are fast
  • Light reporting with indexed queries
  • Simple architecture, easy to reason about

What breaks:

  • Someone runs an unindexed SELECT * and locks tables
  • Dashboard queries compete with customer transactions
  • Backups take hours and slow down writes
  • Database CPU spikes during business hours when dashboards refresh

The Band-Aid: Read Replicas

                  ┌──────────────┐
                  │ Primary DB   │◄────── App writes here
                  │              │
                  └──────┬───────┘
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
       ┌──────────────┐      ┌──────────────┐
       │ Read Replica │      │ Read Replica │◄─ Analytics queries here
       │              │      │              │
       └──────────────┘      └──────────────┘

Benefits:

  • Offloads read traffic from primary
  • Analytics queries don't block writes
  • Buys you 6–12 months

Limitations:

  • Replicas lag behind primary (eventual consistency)
  • Complex analytical queries still slow
  • Doesn't solve schema design mismatch (OLTP vs OLAP)

When Stage 1 Breaks Down

You're ready to move on when:

  • Dashboards regularly time out
  • Read replicas are maxed out at 90%+ CPU
  • Data team writes increasingly complex queries that take minutes
  • Schema changes blocked by "analytics team needs this column"
  • Analysts asking for 2-year historical data you don't retain

The breaking point: Your business wants to ask complex questions your OLTP database can't answer efficiently.

Stage 2 – Introducing a Data Warehouse

The Setup

┌─────────────┐
│     App     │
└──────┬──────┘
       │
       ▼
┌──────────────┐       ┌─────────────┐       ┌────────────────┐
│  Postgres    │──────►│  ETL Jobs   │──────►│ Data Warehouse │
│  (OLTP)      │       │  (nightly)  │       │   (OLAP)       │
└──────────────┘       └─────────────┘       └────────┬───────┘
                                                       │
                                                       ▼
                                              ┌────────────────┐
                                              │   BI Tools     │
                                              │ (Looker, etc)  │
                                              └────────────────┘

Key components:

1. Data Warehouse: Snowflake, BigQuery, Redshift, or ClickHouse

  • Columnar storage (fast for aggregations)
  • Designed for large scans
  • Scales to petabytes
  • Costs money, but predictable

2. ETL/ELT Pipeline: Fivetran, Airbyte, dbt, custom scripts

  • Extracts data from production DB
  • Transforms it for analytics
  • Loads into warehouse
  • Runs on schedule (hourly, daily)

3. BI Tools: Looker, Tableau, Metabase

  • Connect to warehouse, not production
  • Engineers define metrics once
  • Business users self-serve reports

When to Make the Move

Strong signals:

  • 5+ people regularly querying production database
  • Reports taking >30 seconds
  • Business asking for historical trend analysis
  • Data team hired or data analysts spending 30%+ time on data wrangling

Rough thresholds:

  • 10–50 engineers
  • 100k–1M users
  • 500GB+ database
  • Multiple dashboards refreshed daily

What You Gain

Separation of concerns:

  • Production database handles transactions
  • Warehouse handles analytics
  • No more 3am incidents from dashboards

Performance:

  • Reports that took 2 minutes now take 2 seconds
  • Can scan billions of rows without sweating

Historical data:

  • Keep years of data without bloating prod DB
  • Time-series analysis becomes feasible

Self-service analytics:

  • Business teams build their own dashboards
  • Engineers freed from "can you run this query for me?"

ETL vs ELT: What's the Difference?

ETL (Extract, Transform, Load):

Production DB → Transform in scripts → Load to Warehouse
  • Transform data before loading (filter, aggregate, clean)
  • Used when warehouse is expensive or transformation is complex
  • Traditional approach

ELT (Extract, Load, Transform):

Production DB → Load raw to Warehouse → Transform in Warehouse
  • Dump everything into warehouse
  • Transform using SQL (dbt is popular for this)
  • Modern approach, enabled by cheap warehouse storage

Which to use?

  • ELT for most startups (simpler, more flexible)
  • ETL if you have complex, stateful transformations or legacy systems

Common Pitfalls

Pitfall 1: Over-engineering ETL

  • Don't build custom ETL from scratch
  • Use Fivetran, Airbyte, or managed connectors
  • Focus on business logic, not data plumbing

Pitfall 2: No data contracts

  • Backend team changes schema → dashboard breaks
  • Define schema ownership and versioning upfront

Pitfall 3: Treating warehouse like OLTP

  • Don't replicate every microservice table 1:1
  • Design for analytical queries (denormalized, wide tables)

Pitfall 4: Ignoring costs

  • Warehouse scans can get expensive fast
  • Monitor query costs
  • Set up alerts for runaway queries

Stage 2 Works Until...

You hit one of these limits:

  • You have 5+ source systems (not just one Postgres database)
  • You need near real-time analytics (<1 hour latency)
  • You're building ML models that need fresh data constantly
  • You have 100TB+ data and need something cheaper than a warehouse

Then you might need Stage 3.

(But most companies never do. Stage 2 can scale to $100M+ ARR.)

Stage 3 – Event Streams and Lake/Lakehouse

The Setup

┌──────────┐  ┌──────────┐  ┌──────────┐
│  App 1   │  │  App 2   │  │  App 3   │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │             │
     └─────────┬───┴─────────────┘
               ▼
        ┌──────────────┐
        │ Event Stream │ (Kafka, Kinesis, Pub/Sub)
        └──────┬───────┘
               │
       ┌───────┴───────┐
       ▼               ▼
┌──────────────┐  ┌──────────────┐
│ Data Lake/   │  │  Warehouse   │
│ Lakehouse    │  │              │
│ (S3/Delta)   │  └──────────────┘
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ ML Models,   │
│ Notebooks    │
└──────────────┘

What's Different?

Event Streaming (Kafka, Kinesis):

  • Captures every event as it happens (order created, user clicked, payment processed)
  • Multiple consumers can read the same stream
  • Enables near real-time analytics

Data Lake:

  • Store raw data in cheap object storage (S3, GCS, Azure Blob)
  • Schema-on-read (define structure when querying, not when storing)
  • Great for unstructured data (logs, images, JSON blobs)

Lakehouse (Databricks, Delta Lake, Iceberg):

  • Combines lake storage with warehouse query performance
  • ACID transactions on object storage
  • Unified platform for batch and streaming

When You Actually Need This

Strong signals:

  • 10+ microservices producing data
  • Real-time dashboards (< 5 minute latency)
  • ML models in production needing live features
  • Data science team running large-scale experimentation
  • 100TB+ data with tight cost constraints

Rough thresholds:

  • 100+ engineers
  • 10M+ users
  • Multiple product lines
  • Mature data science function

What You Gain

Real-time insights:

  • Dashboard updates as events happen
  • Fraud detection in milliseconds
  • Live personalization

Multiple consumers:

  • Analytics, ML, monitoring all read same events
  • No need to duplicate ETL pipelines

Cost efficiency:

  • Store petabytes cheaply in object storage
  • Only pay for compute when querying

Flexibility:

  • Keep raw, unprocessed data forever
  • Reprocess historical data as business logic changes

The Complexity Tax

You're now managing:

  • Stream processing (Kafka, Flink, Spark Streaming)
  • Schema registries and versioning
  • Data quality monitoring
  • Backfilling and reprocessing pipelines
  • Multiple storage systems (lake + warehouse + databases)

Team requirements:

  • Dedicated data platform engineers
  • Strong data governance
  • On-call for data pipelines

Reality check: Most companies never need this.

If you're doing $100M+ revenue and have 50+ engineers, maybe. Otherwise, stick with Stage 2.

Data Ownership and Contracts

No matter which stage you're at, you need to answer: Who owns the data?

The Spaghetti Problem

Without clear ownership:

  • Backend team changes orders.status from enum to string
  • Analytics dashboard breaks
  • Data team creates orders_clean table with different schema
  • ML team creates orders_features table with different logic
  • Now you have 3 versions of truth

Result: Nobody trusts the data.

Data Contracts: The Solution

Define upfront:

1. Schema ownership

orders table → owned by Payments team
- Any schema change requires approval
- Breaking changes need 30-day deprecation notice
- Must notify #data-platform channel

2. Semantic versioning for schemas

orders_v1 → orders_v2 → orders_v3
- Old versions supported for 6 months
- Downstream consumers can migrate gradually

3. Data quality SLAs

orders table guarantees:
- Updated within 5 minutes of transaction
- No nulls in order_id, user_id, total
- Total always positive
- Created_at timestamp within last 7 days

4. Golden datasets

analytics.orders_clean → owned by Data team
- Source: production orders table
- Transformations documented in dbt
- Updated nightly at 2am UTC
- THIS is the table for dashboards

Practical Implementation

Small teams (10–30 engineers):

  • Designate one engineer per domain as "data custodian"
  • Document key tables in README or wiki
  • Announce schema changes in Slack

Medium teams (30–100 engineers):

  • Use dbt for transformation lineage
  • Schema change approval in pull requests
  • Data quality tests in CI/CD

Large teams (100+ engineers):

  • Formal data governance committee
  • Tools like Monte Carlo, Great Expectations for quality monitoring
  • Automated data catalogs (Atlan, Alation)

Pragmatic Governance

Versioning Schemas

Bad:

-- Breaking change, no warning
ALTER TABLE users DROP COLUMN legacy_id;

Good:

-- Deprecate first
ALTER TABLE users ADD COLUMN deprecated_legacy_id INT;
-- Copy data
UPDATE users SET deprecated_legacy_id = legacy_id;
-- Wait 90 days, monitor usage
-- Then drop old column
ALTER TABLE users DROP COLUMN legacy_id;

Backfilling Data Safely

The scenario: You add a new feature that needs historical data.

Bad approach:

# Backfill 10M rows in one transaction
for user in all_users:
    calculate_and_save_ltv(user)  # Locks tables, kills DB

Good approach:

# Batch backfill with rate limiting
batch_size = 1000
for batch in all_users.in_batches(batch_size):
    calculate_and_save_ltv(batch)
    sleep(1)  # Don't overwhelm DB
    
# Or: backfill in warehouse, then copy results

Deleting and Anonymizing PII

GDPR, CCPA, and data localization require:

  • Right to deletion (user requests data removal)
  • Right to anonymization (keep analytics, remove PII)
  • Data retention policies (don't keep data forever)

Implementation:

1. Hard delete (rare):

DELETE FROM users WHERE id = 12345;
DELETE FROM orders WHERE user_id = 12345;
-- Cascades everywhere, breaks foreign keys

2. Soft delete (better):

UPDATE users SET deleted_at = NOW() WHERE id = 12345;
-- Queries filter: WHERE deleted_at IS NULL

3. Anonymization (best for analytics):

UPDATE users 
SET 
    email = 'deleted_user@anonymous.local',
    name = 'Deleted User',
    phone = NULL
WHERE id = 12345;

-- Keep user_id for analytics, remove PII

4. Warehouse cleanup:

-- Nightly job in dbt
DELETE FROM warehouse.users 
WHERE deleted_at < NOW() - INTERVAL '30 days';

Data Retention Policies

Define upfront:

  • Transactional data: 7 years (for compliance)
  • Analytical data: 2 years (after that, aggregate)
  • Logs: 30 days
  • PII: deleted on request, anonymized after 90 days

Implementation:

  • Automated jobs to archive/delete old data
  • Separate "cold storage" for compliance data
  • Document retention rules in data catalog

Evolve Just-in-Time, Not Years Too Early

The biggest mistake I see: over-engineering before you have the problem.

Don't:

  • Build a data lake when you have 1GB of data
  • Set up Kafka when batch ETL every hour is fine
  • Hire a data platform team at 10 engineers

Do:

  • Feel the pain first
  • Solve with the simplest tool that works
  • Upgrade when current system can't handle load

The Maturity Checklist

Use this to identify where you are and what's next.

✅ Stage 1: Single Database

You're here if:

  • <10 engineers
  • One main database (Postgres, MySQL)
  • Dashboards query production DB directly
  • Reports occasionally slow things down

Next step:

  • Add read replicas for analytics queries
  • Set up automated backups
  • Document key tables and schemas

✅ Stage 2: Data Warehouse

You're here if:

  • 10–50 engineers
  • 5+ people regularly need data
  • Read replicas maxed out
  • Business asking for historical trends

Next step:

  • Choose a warehouse (BigQuery, Snowflake, Redshift)
  • Set up ETL (Fivetran, Airbyte, or custom)
  • Migrate 3–5 most critical dashboards
  • Define data contracts for key tables

✅ Stage 3: Event Streams & Lake

You're here if:

  • 50+ engineers
  • 5+ microservices producing data
  • Need real-time analytics (< 5 min latency)
  • Data science team building production ML models
  • 10TB+ data

Next step:

  • Evaluate Kafka/Kinesis vs managed CDC (Debezium)
  • Pilot lakehouse (Databricks, Delta Lake)
  • Hire dedicated data platform engineers
  • Establish data governance processes

The Just-in-Time Rule

Only move to the next stage when:

  1. Current system is measurably slowing you down
  2. You've tried simpler optimizations first
  3. Next stage solves a real, current problem (not hypothetical future problem)

Signs you're moving too early:

  • "We might need this in 2 years"
  • "This is what [BigTechCo] uses"
  • "It'll make us look more sophisticated"

Signs you're moving too late:

  • Database crashes weekly from analytics queries
  • Data team spends 50%+ time fighting infrastructure
  • Business decisions delayed because reports won't finish

Data Architecture is a Journey, Not a Destination

You don't need Kafka and a data lake to run a successful startup.

You need:

  • Operational database for fast transactions
  • Analytical system when reporting hits scale limits
  • Clear ownership so data doesn't become spaghetti
  • Governance so you don't violate privacy laws

Start simple:

  • Single Postgres with read replicas → works to $10M ARR
  • Add a warehouse → works to $100M ARR
  • Event streams + lakehouse → needed at true scale

Every architecture decision is a trade-off:

  • Simpler → easier to operate, faster to build
  • More complex → handles more scale, costs more to maintain

Evolve when you feel the pain. Not before.

Your business is better served by shipping features with a "good enough" data stack than by building a perfect data platform no one uses yet.

Start where you are. Move to the next stage when it hurts. Repeat.

That's how you build data architecture that actually scales.

Topics

data-architecturedata-warehousedata-lakeoltp-vs-olapetlkafkascaling-data
Ruchit Suthar

About Ruchit Suthar

Technical Leader with 15+ years of experience scaling teams and systems