Refactoring Legacy Systems Without Downtime: Applying the Strangler Fig Pattern in Real Teams
Big rewrites fail. Learn how to modernize legacy systems incrementally using the strangler fig pattern: map dependencies, prioritize modules, use feature flags for gradual rollout, and migrate data safely—all while shipping features.

TL;DR
Big-bang rewrites take 2+ years and often fail. Strangler fig pattern replaces legacy systems piece by piece: identify painful modules, rebuild one at a time, route traffic via feature flags, verify, kill old code, repeat. Rebuild 60% of a system in 18 months while shipping features continuously. Never stop for a rewrite.
Refactoring Legacy Systems Without Downtime: Applying the Strangler Fig Pattern in Real Teams
The Big Rewrite That Never Shipped
Let me tell you about two companies with the same problem.
Company A inherited a monolithic PHP application from 2012. It was slow, hard to test, and every deploy was a roll of the dice. The new CTO convinced the board to invest in a complete rewrite. "We'll rebuild it properly in microservices with Node.js. Give us 18 months."
Two years later, the rewrite was at 70% completion. But the old system had gained 40% more features because the business couldn't stop for two years. The new system was perpetually behind, never quite ready to replace the old one. Eventually, the board lost patience. The rewrite was scrapped. The CTO left. The team was demoralized.
Company B had the same problem. Different approach.
They didn't rewrite anything. They identified the most painful module—the reporting system—and rebuilt just that one piece. Took six weeks. Put it behind a feature flag. Tested it with 5% of traffic, then 20%, then 100%. Killed the old code. Celebrated. Moved to the next painful module.
Eighteen months later, 60% of the old system was gone, replaced piece by piece. The system was faster, more reliable, and easier to work on. And they never stopped shipping features.
This is the difference between big-bang rewrites and strangler fig refactoring.
Let me show you how Company B did it.
First: Understand the Legacy System Like an Archaeologist
Before you touch a single line of code, you need to map the territory. Legacy systems are archaeological sites. You can't safely excavate if you don't know where the load-bearing walls are.
What to Map
1. Critical user flows
Identify the 3–5 most important user journeys:
- User signs up → verifies email → completes onboarding
- User searches → views product → adds to cart → checks out → payment
- Admin creates report → schedules delivery → views analytics
For each flow, trace:
- Which endpoints or pages are involved
- Which database tables are touched
- Which external services are called
- Where authentication/authorization happens
2. Key modules and services
Draw boxes for major conceptual areas:
- Authentication & Authorization
- User Management
- Payment Processing
- Reporting & Analytics
- Email & Notifications
- Background Jobs
For each, note:
- Roughly how many lines of code
- How often it changes
- How many bugs/incidents it causes
- How many developers understand it
3. Datastores and external integrations
List every database, cache, queue, and third-party service:
- PostgreSQL (user data, orders)
- Redis (sessions, cache)
- Stripe (payments)
- SendGrid (emails)
- S3 (file storage)
Note dependencies: "Payment flow requires Stripe, user service, email service, and order database."
Why This Matters
You can't refactor safely if you don't know the blast radius.
If you change the authentication module, what breaks? If you migrate the user table, what queries fail? If you turn off an old API endpoint, which clients scream?
Spend a week mapping. It will save you months of emergency rollbacks.
Tools That Help
- Architecture diagrams (even rough sketches)
- Dependency graphs (tools like Madge, dependency-cruiser)
- Database query logs (what actually gets used)
- API call monitoring (which endpoints serve real traffic)
- Git history analysis (which files change together)
The goal: build a mental model of what this system actually does, not what you wish it did.
The Strangler Fig Pattern Explained in Plain English
The strangler fig is a tree that grows around another tree. It starts as a seed in a branch, sends roots down to the ground, grows around the host tree, and eventually replaces it entirely. The original tree slowly disappears.
Software refactoring works the same way.
The Pattern in Three Steps
Step 1: Identify a module or capability to replace
Pick something bounded and valuable. Not "the entire backend"—pick "the reporting system" or "the authentication flow."
Step 2: Build the new version alongside the old one
Don't touch the old code yet. Build the new implementation as a separate module, service, or component.
Step 3: Route traffic from old to new gradually
Start sending a small percentage of requests to the new implementation. Monitor. Increase gradually. When you're at 100%, delete the old code.
Visualizing the Pattern
Before (Legacy Monolith):
┌─────────────────────────────────┐
│ Legacy Monolith │
│ │
│ ┌─────────┐ ┌──────────┐ │
│ │ Auth │ │ Reporting│ │
│ └─────────┘ └──────────┘ │
│ ┌─────────┐ ┌──────────┐ │
│ │ Payment │ │ Emails │ │
│ └─────────┘ └──────────┘ │
│ │
└─────────────────────────────────┘
During Strangling (Reporting Extracted):
┌──────────────────┐
│ New Reporting │
│ Service │
└──────────────────┘
▲
│ (feature flag routes
│ some traffic here)
│
┌─────────────────────────────────┐
│ Legacy Monolith │
│ │
│ ┌─────────┐ ┌──────────┐ │
│ │ Auth │ │ Reporting│ ◄──┼─ (old code still exists
│ └─────────┘ └──────────┘ │ but getting less traffic)
│ ┌─────────┐ ┌──────────┐ │
│ │ Payment │ │ Emails │ │
│ └─────────┘ └──────────┘ │
└─────────────────────────────────┘
After (Reporting Strangled):
┌──────────────────┐
│ New Reporting │
│ Service │
└──────────────────┘
▲
│ (100% of traffic)
│
┌─────────────────────────────────┐
│ Legacy Monolith │
│ │
│ ┌─────────┐ │
│ │ Auth │ │ ◄── Old reporting code DELETED
│ └─────────┘ │
│ ┌─────────┐ ┌──────────┐ │
│ │ Payment │ │ Emails │ │
│ └─────────┘ └──────────┘ │
└─────────────────────────────────┘
Repeat this process for each module until the legacy system is gone or small enough to rewrite safely.
Choosing What to Strangle First (Prioritization)
Not all modules are equally good candidates for strangling. Here's how to prioritize.
The Strangler Priority Matrix
Score each module on four dimensions (1 = low, 5 = high):
1. Business Criticality
How painful if this breaks?
- Score 5: Payments, authentication, checkout
- Score 3: Reporting, admin tools
- Score 1: Rarely-used internal features
2. Change Frequency
How often do you need to modify this?
- Score 5: Modified multiple times per month
- Score 3: Modified a few times per year
- Score 1: Hasn't been touched in 2+ years
3. Pain Level
How much developer suffering does this cause?
- Score 5: Constant bugs, slow, hard to test, no one understands it
- Score 3: Somewhat messy but manageable
- Score 1: Works fine, clean enough
4. Coupling
How entangled is this with the rest of the system?
- Score 1: Loosely coupled, clear boundaries (GOOD for strangling)
- Score 3: Some shared dependencies
- Score 5: Deeply tangled with everything (BAD for strangling)
Calculate Strangler Score
Strangler Score = (Change Frequency × Pain Level) / (Business Criticality × Coupling)
Why this formula?
- High change frequency + high pain = worth fixing
- High criticality = risky, be careful
- High coupling = hard to extract, save for later
Example Scoring
| Module | Criticality | Change Freq | Pain | Coupling | Score |
|---|---|---|---|---|---|
| Reporting | 3 | 5 | 5 | 2 | 4.2 ⭐ |
| Authentication | 5 | 3 | 4 | 4 | 0.6 |
| Payment | 5 | 4 | 5 | 3 | 1.3 |
| 2 | 4 | 4 | 1 | 8.0 ⭐⭐ | |
| Admin Panel | 2 | 2 | 3 | 2 | 1.5 |
What this tells us:
- Email system (8.0) – High pain, frequently changed, low criticality, loosely coupled. Perfect first target.
- Reporting (4.2) – High pain and change frequency, reasonable criticality. Good second target.
- Payment (1.3) – High criticality and coupling. Leave this for later when you have strangling experience.
- Authentication (0.6) – Critical and coupled. Last thing to touch.
Start with the highest score. Build confidence. Move to harder targets.
Techniques for Safe Refactoring in Production
Now let's talk about specific patterns for strangling safely.
Pattern 1: Branch by Abstraction
Create a stable interface, then swap implementations behind it.
Before:
# Old code scattered everywhere
def process_order(order):
# Direct calls to legacy payment processor
result = legacy_stripe_call(order.amount, order.card)
if result.success:
send_legacy_email(order.user, "Payment successful")
Step 1: Introduce abstraction
class PaymentProcessor:
def charge(self, amount, payment_method):
# Route to old implementation for now
return legacy_stripe_call(amount, payment_method)
# Refactor all call sites to use abstraction
def process_order(order):
processor = PaymentProcessor()
result = processor.charge(order.amount, order.card)
if result.success:
send_legacy_email(order.user, "Payment successful")
Step 2: Add new implementation behind same interface
class PaymentProcessor:
def charge(self, amount, payment_method):
if feature_flag('new_payment_flow'):
return new_stripe_service.charge(amount, payment_method)
else:
return legacy_stripe_call(amount, payment_method)
Step 3: Flip the flag, monitor, delete old code
This pattern lets you change behavior without changing call sites.
Pattern 2: Feature Flags & Gradual Rollout
Route small percentages of traffic to new code.
Implementation:
def generate_report(user_id, report_type):
rollout_percentage = 10 # Start with 10% of users
if user_id % 100 < rollout_percentage:
# New implementation
return new_reporting_service.generate(user_id, report_type)
else:
# Old implementation
return legacy_report_generator(user_id, report_type)
Rollout schedule:
- Week 1: 5% traffic → monitor for errors, latency, correctness
- Week 2: 20% traffic → compare metrics to baseline
- Week 3: 50% traffic → watch for any edge cases
- Week 4: 100% traffic → delete old code
What to monitor:
- Error rates (old vs new)
- Latency (p50, p95, p99)
- Business metrics (conversion, revenue)
- User complaints
If any metric regresses, roll back to 0% immediately.
Pattern 3: Parallel Runs (Shadow Mode)
Run both old and new implementations, compare outputs, but only return old results to users.
Implementation:
def calculate_user_score(user_id):
# Old implementation (what users see)
old_score = legacy_scoring_algorithm(user_id)
# New implementation (shadow mode)
try:
new_score = new_scoring_service.calculate(user_id)
# Log discrepancies for analysis
if abs(old_score - new_score) > threshold:
log_mismatch(user_id, old_score, new_score)
except Exception as e:
log_error("New scoring failed", e)
# Always return old result (safe)
return old_score
Benefits:
- Zero risk to users (they always get old behavior)
- Find bugs and edge cases in new implementation
- Build confidence before switching
When to use:
- High-stakes calculations (pricing, recommendations, risk scores)
- Complex business logic with lots of edge cases
- When you're not confident the new implementation is correct
Data Migration Without Big Bangs
Strangling code is one thing. Strangling data is harder.
The Dual Write Pattern
When migrating from old to new datastore:
Phase 1: Dual write, old read
def update_user(user_id, data):
# Write to old DB (source of truth)
old_db.users.update(user_id, data)
# Also write to new DB (keeping it in sync)
try:
new_db.users.update(user_id, data)
except Exception as e:
log_error("New DB write failed", e)
# Don't fail the request, old DB is source of truth
def get_user(user_id):
# Read from old DB only
return old_db.users.get(user_id)
Phase 2: Dual write, new read with fallback
def get_user(user_id):
# Try reading from new DB first
user = new_db.users.get(user_id)
if user is None:
# Fallback to old DB if not found
user = old_db.users.get(user_id)
# Backfill missing data into new DB
if user:
new_db.users.create(user)
return user
Phase 3: Dual write, new read (no fallback)
Once new DB is fully backfilled and verified:
def get_user(user_id):
# Read from new DB only
return new_db.users.get(user_id)
Phase 4: Single write to new DB
After monitoring shows old DB isn't needed:
def update_user(user_id, data):
# Write to new DB only
new_db.users.update(user_id, data)
Delete old DB schema.
Backfilling Data Gradually
Don't try to migrate all data in one go. Backfill gradually:
Option 1: Background job
# Run hourly, migrates 1000 records per batch
def backfill_users():
old_users = old_db.users.where(migrated=False).limit(1000)
for user in old_users:
new_db.users.create(transform(user))
old_db.users.update(user.id, migrated=True)
Option 2: Lazy migration
Migrate on read (as shown in Phase 2 above). Eventually all active records get migrated. Inactive records can be archived separately.
Monitoring Data Consistency
During dual-write phases, monitor for drift:
def audit_data_consistency():
sample_users = random.sample(all_user_ids, 100)
for user_id in sample_users:
old_user = old_db.users.get(user_id)
new_user = new_db.users.get(user_id)
if not users_match(old_user, new_user):
alert("Data mismatch detected", user_id)
Run this daily. Fix mismatches before proceeding to next phase.
Communication: Managing Stakeholders and Expectations
The hardest part of strangler fig refactoring isn't technical—it's managing expectations.
The Trap: "We're Refactoring for 12 Months, Please Wait"
This kills projects. Stakeholders hear "no new features for a year" and pull the plug.
Better Approach: Incremental Value
Frame each strangling effort as delivering value, not just cleaning up:
Bad:
"We're refactoring the codebase to improve maintainability."
Good:
"We're rebuilding the reporting system. This will:
- Cut report generation time from 2 minutes to 10 seconds
- Reduce report-related incidents from 3/month to near-zero
- Let us add custom dashboards (feature request from 5 enterprise customers)"
Connect refactoring to business outcomes. Faster, more reliable, enables new features.
Status Reporting That Works
Monthly update template:
What we shipped:
- Migrated 40% of reporting to new service
- Report generation time: 2min → 45sec (55% faster)
- Report incidents: 3 → 1 this month
What's next:
- Complete reporting migration (3 weeks)
- Begin strangling email system (high pain, low risk)
Business impact:
- Unblocked 2 enterprise deals waiting for custom reporting
- Reduced on-call load from report failures
Feature work unchanged:
- Shipped all planned Q4 features on schedule
Notice: metrics before/after, business impact, and feature work continues.
Setting Realistic Timelines
Rule of thumb: strangling takes 2–3x longer than you think for the first module, then gets faster.
First module: 6–8 weeks (learning, tooling, process)
Second module: 3–4 weeks (using established patterns)
Later modules: 1–2 weeks (well-oiled machine)
Plan for learning curve. Don't promise the moon.
Building a Refactor Playbook for Your Org
Turn one-off strangling into repeatable process.
Document Your Patterns
After each successful strangling, write down:
1. What we strangled: "Email notification system"
2. Why we chose it: "High change frequency, high pain, low coupling"
3. Approach used: "Branch by abstraction → feature flag → 5%/20%/50%/100% rollout"
4. Techniques applied:
- Created EmailService interface
- Implemented with SendGrid instead of legacy SMTP
- Feature flag:
new_email_service - Monitored delivery rate, latency, bounce rate
5. Timeline:
- Week 1: Built new service, added interface
- Week 2: Deployed at 5%, monitored
- Week 3: Ramped to 50%
- Week 4: Ramped to 100%, deleted old code
6. Gotchas and lessons:
- HTML template rendering had subtle differences, needed QA review
- Bounce handling required mapping between old and new error codes
- Feature flag cleanup took longer than expected
7. Metrics before/after:
- Email send latency: 800ms → 200ms
- Delivery rate: 94% → 98.5%
- Email-related incidents: 2/month → 0/month
Create Standard Checklists
Pre-Strangling Checklist:
- Map all call sites and dependencies
- Define success metrics (latency, error rate, etc.)
- Create rollback plan
- Set up monitoring and alerts
- Write runbook for feature flag operations
- Brief on-call team
During Rollout Checklist:
- Deploy new code behind feature flag (0%)
- Test manually with flag enabled
- Ramp to 5%, monitor for 2–3 days
- Ramp to 20%, monitor for 2–3 days
- Ramp to 50%, monitor for 2–3 days
- Ramp to 100%, monitor for 1 week
- Delete old code
- Remove feature flag
Post-Strangling Checklist:
- Document what we learned
- Update architecture diagrams
- Share results with team and stakeholders
- Celebrate the win 🎉
Build Institutional Knowledge
The goal: make strangling boring. Not a heroic one-off, but a standard way we evolve the system.
Respect the Old System, Design the New One
Let me close with perspective.
That legacy system everyone complains about? It got the company here. It served customers, generated revenue, and survived real-world chaos. It has battle scars for good reasons.
Yes, it's messy. Yes, it needs to evolve. But it's not garbage—it's a successful system that needs renovation, not demolition.
Why Strangler Fig Wins
Big rewrites fail because:
- They bet the company on 18–24 months of no visible progress
- They try to replicate years of edge case handling from scratch
- They assume requirements won't change (they always do)
- They create two systems to maintain instead of one
Strangler fig succeeds because:
- Each strangling delivers incremental value in weeks, not years
- You learn what the old system actually does by studying it closely
- Feature work continues in parallel
- Risk is bounded—each module is small enough to roll back
- You can stop anytime and still be better off than when you started
The Patient Approach
Strangling is patient work. It's:
- Mapping before coding
- Building abstraction layers
- Gradual rollouts with monitoring
- Celebrating small wins
- Documenting lessons learned
It's not:
- Heroic all-nighters rewriting everything
- "Move fast and break things"
- Ignoring the business while you clean up code
Think in quarters, not years:
- Q1: Strangle email and reporting
- Q2: Strangle background jobs
- Q3: Strangle API layer
- Q4: Strangle data access
Two years later, you have a modern system and you never stopped shipping.
Your Strangler Fig Checklist
Starting a legacy refactor? Use this:
Preparation:
- Map critical user flows and dependencies
- Identify modules by change frequency, pain, criticality, coupling
- Calculate strangler scores and pick first target
- Get stakeholder buy-in by connecting refactor to business outcomes
Execution:
- Use branch by abstraction or similar pattern
- Deploy behind feature flag at 0%
- Gradual rollout: 5% → 20% → 50% → 100%
- Monitor metrics: errors, latency, business KPIs
- Have rollback plan ready
Data migration (if needed):
- Dual write to both old and new datastores
- Backfill gradually or lazily
- Monitor data consistency
- Switch reads to new datastore with fallback
- Delete old datastore only after extended monitoring
Communication:
- Frame as delivering value, not "just refactoring"
- Report metrics before/after
- Show feature work continues in parallel
- Celebrate incremental wins
Knowledge building:
- Document patterns that worked
- Record gotchas and lessons
- Create checklists for next strangling
- Turn hero effort into repeatable process
Legacy systems are archaeological sites. Treat them with respect. Map them carefully. Excavate incrementally.
And remember: the best refactor is the one that ships.
Slow and steady strangles the legacy.
