The AI Refactoring Playbook: When to Trust the Bot, When to Override
I've reviewed 200+ AI refactorings: 60% improved code, 30% were neutral, 10% would cause production issues. Learn the decision framework: Green Light scenarios (trust 90%), Yellow Light (verify closely), Red Light (override often). Includes validation checklist, real examples of refactorings gone wrong, and metrics to track quality over time.

TL;DR
After reviewing 200+ AI refactorings, 60% improved code, 30% were neutral, and 10% would cause production issues. Use a 4-quadrant framework: Green Light (trust 90%—extract methods, renames), Yellow Light (verify closely—database, async), Red Light (override often—security, distributed systems). Includes validation checklist and quality tracking metrics.
The AI Refactoring Playbook: When to Trust the Bot, When to Override
I've reviewed 200+ AI-suggested refactorings over the past year. Here's what I learned: 60% improved the code, 30% were neutral shuffling, and 10% would have caused production issues if merged blindly.
The problem isn't AI capability—it's knowing when to trust it. Most engineers fall into two camps: blind trust (risky) or blanket skepticism (waste of opportunity). Both are wrong.
What you need is a framework. This is mine.
The Trust Decision Framework
Not all refactorings carry equal risk. Extract a method? Low risk. Refactor distributed transactions? High risk. You need a mental model to categorize quickly.
The 4-Quadrant Matrix
Think about every AI refactoring along two dimensions:
Complexity: How much does this change?
- Low: Single function, obvious transformation
- High: Multiple files, changes system behavior, affects multiple callers
Risk: What happens if it breaks?
- Low: Caught immediately by tests, easy to revert, limited blast radius
- High: Could cause data corruption, silent failures, affects many users
This creates four quadrants:
Green Quadrant (Low Complexity + Low Risk)
- Trust AI 90%+ of the time
- Quick review, merge confidently
- Examples: Extract method, rename variables, add type annotations
Yellow Quadrant (Mixed)
- Verify closely before merging
- Check edge cases AI might miss
- Examples: Database migrations, async conversions, error handling changes
Red Quadrant (High Complexity + High Risk)
- Default to manual review
- Use AI for ideas, not implementation
- Examples: Cross-service transactions, security-critical code, performance hotspots
The "Blast Radius" Question
Quick way to assess: "If this breaks, how many users/services/systems are affected?"
- 1 service, caught by tests: Green
- Multiple services, might not be caught immediately: Yellow
- Production data or money involved: Red
The "Rollback Cost" Question
"Can I easily revert if something goes wrong?"
- Revert one commit, redeploy one service: Green
- Coordinate rollback across services: Yellow
- Data migration that's hard to reverse: Red
Use these questions in every AI refactoring review. 5 seconds of categorization prevents hours of debugging.
Green Light Scenarios: Trust AI 90%+
These refactorings are low-risk, high-value. AI handles them well. Your job: quick sanity check, then merge.
Scenario 1: Extract Method Refactoring
Why AI Excels
AI is exceptional at spotting code duplication. It scans your entire file in seconds, identifies repeated patterns, and suggests extractions. Humans do this too, but slower and less consistently.
Real Example
Before:
class UserService {
async registerUser(data: RegisterDto) {
// Validation
if (!data.email || !data.email.includes('@')) {
throw new Error('Invalid email');
}
if (!data.password || data.password.length < 8) {
throw new Error('Password must be at least 8 characters');
}
if (!data.username || data.username.length < 3) {
throw new Error('Username must be at least 3 characters');
}
return await this.userRepo.create(data);
}
async updateUser(id: string, data: UpdateDto) {
// Same validation duplicated
if (!data.email || !data.email.includes('@')) {
throw new Error('Invalid email');
}
if (data.password && data.password.length < 8) {
throw new Error('Password must be at least 8 characters');
}
if (!data.username || data.username.length < 3) {
throw new Error('Username must be at least 3 characters');
}
return await this.userRepo.update(id, data);
}
}
AI Suggestion:
class UserService {
async registerUser(data: RegisterDto) {
this.validateUserData(data);
return await this.userRepo.create(data);
}
async updateUser(id: string, data: UpdateDto) {
this.validateUserData(data);
return await this.userRepo.update(id, data);
}
private validateUserData(data: Partial<RegisterDto>) {
if (data.email && !data.email.includes('@')) {
throw new Error('Invalid email');
}
if (data.password && data.password.length < 8) {
throw new Error('Password must be at least 8 characters');
}
if (data.username && data.username.length < 3) {
throw new Error('Username must be at least 3 characters');
}
}
}
What to Check:
✅ Method name makes sense (validateUserData is clear)
✅ Parameters are logical (accepts Partial<RegisterDto> for flexibility)
✅ All callers work correctly (both registerUser and updateUser)
✅ Edge cases handled (made validation checks conditional for updates)
Red Flags:
🚩 Method name is vague (processData, handleStuff)
🚩 Extracted method does too many unrelated things
🚩 Logic changed subtly during extraction
Time Investment: 2 minutes review → Merge
Scenario 2: Rename Variables for Clarity
Why AI Excels
AI understands context from surrounding code. It sees a variable called data but knows from usage it represents customerOrders. Humans know this too, but naming is tedious. AI makes it instant.
Real Example
Before:
function processOrders(data: any[]) {
const result = [];
for (const item of data) {
const x = item.price * item.quantity;
const y = x * 0.1;
const z = x + y;
result.push({
id: item.id,
total: z
});
}
return result;
}
AI Suggestion:
function processOrders(orders: Order[]) {
const processedOrders = [];
for (const order of orders) {
const subtotal = order.price * order.quantity;
const tax = subtotal * 0.1;
const total = subtotal + tax;
processedOrders.push({
id: order.id,
total: total
});
}
return processedOrders;
}
What to Check: ✅ Names accurately describe the data ✅ Names follow team conventions (camelCase, descriptive) ✅ No breaking changes (public API parameters unchanged or deprecated properly)
Red Flags:
🚩 AI renames public API parameters (breaking change for consumers)
🚩 Names don't match team conventions (team uses order_id, AI suggests orderId)
🚩 Names are too verbose (customerOrderTotalIncludingTaxAndShipping vs. total)
Time Investment: 1 minute review → Merge
Scenario 3: Type Annotation Additions
Why AI Excels
AI infers types from usage. It sees user.email and knows user has an email property. It sees your function returns a number and adds number return type. Perfect for gradually typing legacy JavaScript or improving TypeScript strictness.
Real Example
Before (JavaScript/Loose TypeScript):
function calculateDiscount(user, cart) {
const items = cart.items;
let total = 0;
for (const item of items) {
total += item.price * item.quantity;
}
if (user.isPremium) {
return total * 0.2;
}
return total * 0.1;
}
AI Suggestion:
interface User {
isPremium: boolean;
}
interface CartItem {
price: number;
quantity: number;
}
interface Cart {
items: CartItem[];
}
function calculateDiscount(user: User, cart: Cart): number {
const items: CartItem[] = cart.items;
let total: number = 0;
for (const item of items) {
total += item.price * item.quantity;
}
if (user.isPremium) {
return total * 0.2;
}
return total * 0.1;
}
What to Check:
✅ Types are accurate (not just any everywhere)
✅ Types aren't overly complex (simple is better)
✅ Interfaces reuse existing types when possible
Red Flags:
🚩 AI adds any everywhere (defeats the purpose)
🚩 Complex union types that hurt readability: string | number | boolean | null | undefined | CustomType
🚩 Doesn't reuse existing type definitions (creates duplicate User interface when one exists)
Time Investment: 2 minutes review → Merge
Scenario 4: Simple Pattern Migrations
Why AI Excels
Mechanical transformations: callbacks → async/await, var → const/let, deprecated API → new API. AI follows the pattern exactly. Less error-prone than humans doing repetitive changes.
Real Example: Callbacks to Async/Await
Before:
function loadUserData(userId: string, callback: (err: Error | null, user?: User) => void) {
db.query('SELECT * FROM users WHERE id = ?', [userId], (err, results) => {
if (err) {
callback(err);
return;
}
const user = results[0];
db.query('SELECT * FROM orders WHERE user_id = ?', [userId], (err, orders) => {
if (err) {
callback(err);
return;
}
user.orders = orders;
callback(null, user);
});
});
}
AI Suggestion:
async function loadUserData(userId: string): Promise<User> {
const results = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
const user = results[0];
const orders = await db.query('SELECT * FROM orders WHERE user_id = ?', [userId]);
user.orders = orders;
return user;
}
What to Check: ✅ Error handling still works (async functions reject on errors) ✅ All callbacks converted (no half-async code) ✅ Awaits are in the right places ✅ Edge cases handled (empty results, null checks)
Red Flags: 🚩 Missing awaits (function marked async but doesn't await promises) 🚩 Error handling removed (callback had error handling, async version doesn't) 🚩 Changed behavior subtly (callback was non-blocking, now it blocks)
Time Investment: 3 minutes review → Merge
Key Principle for Green Light Refactorings
Trust but verify. AI is 90%+ accurate on these patterns. Your job: catch the 10% where AI missed something. 2-3 minute review is all you need.
If you find yourself spending 15+ minutes reviewing a "simple" refactoring, it's not actually simple. Reclassify it to Yellow or Red.
Yellow Light Scenarios: Verify Closely
These refactorings look good at first glance but have hidden complexity. AI generates plausible code, but you must check deeply.
Scenario 1: Database Schema Changes
Why Caution Needed
Databases have state. Changing schema affects existing data. AI generates the "after" state but often misses migration complexity, data loss risks, and backward compatibility.
Real Example: Adding a Column
Before:
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
total DECIMAL(10,2) NOT NULL,
created_at TIMESTAMP NOT NULL
);
AI Suggestion:
-- Add status column
ALTER TABLE orders ADD COLUMN status VARCHAR(50) NOT NULL DEFAULT 'pending';
-- Update application code
class Order {
id: number;
userId: number;
total: number;
status: string; // NEW
createdAt: Date;
}
Looks Good, But Check These:
1. Default Value Appropriateness
-- AI suggests: DEFAULT 'pending'
-- But: What about existing orders? Are they all actually pending?
-- Reality: Existing orders are probably 'completed' or 'shipped'
-- Better migration:
ALTER TABLE orders ADD COLUMN status VARCHAR(50);
UPDATE orders SET status = 'completed' WHERE created_at < NOW() - INTERVAL '7 days';
UPDATE orders SET status = 'shipped' WHERE created_at >= NOW() - INTERVAL '7 days';
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
2. Backward Compatibility
// AI updates the model immediately
// Problem: Old application instances still running
// Better: Make nullable first, deploy, then make NOT NULL later
ALTER TABLE orders ADD COLUMN status VARCHAR(50) NULL;
// Application handles both cases
class Order {
status: string | null; // Handles old orders without status
}
// Later (after all instances deployed):
UPDATE orders SET status = 'completed' WHERE status IS NULL;
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
3. Index Needs
-- AI adds column but forgets index
-- If you query by status frequently:
SELECT * FROM orders WHERE status = 'pending';
-- You need an index:
CREATE INDEX idx_orders_status ON orders(status);
4. Migration Rollback
-- AI provides forward migration
-- What if you need to rollback?
-- Forward:
ALTER TABLE orders ADD COLUMN status VARCHAR(50) NOT NULL DEFAULT 'pending';
-- Rollback:
ALTER TABLE orders DROP COLUMN status;
-- But: If old code still references status column, rollback breaks!
-- Better: Feature flag + gradual rollout
Validation Checklist:
- Default value makes sense for existing data
- Migration script handles all existing data correctly
- No data loss (test on production-sized dataset)
- Backward compatible (old code can run during deployment)
- Indexes added for new query patterns
- Rollback plan exists and tested
- Migration tested on production snapshot
Real War Story
Team added NOT NULL column with default value. Migration took 3 hours on 50M row table, locking the table. Production down for 3 hours. Better approach: Add nullable column, backfill asynchronously, make NOT NULL later.
Scenario 2: Async/Sync Boundary Modifications
Why Caution Needed
Converting sync to async (or vice versa) changes control flow. Race conditions appear. Callers break. Transaction boundaries shift. AI sees the syntax change but misses the concurrency implications.
Real Example: Sync to Async Conversion
Before:
class OrderService {
createOrder(userId: string, items: OrderItem[]): Order {
const user = this.userRepo.findById(userId);
const inventory = this.inventoryRepo.checkAvailability(items);
if (!inventory.allAvailable) {
throw new Error('Items not available');
}
const order = this.orderRepo.create({ userId, items });
this.inventoryRepo.reserve(items);
return order;
}
}
AI Suggestion:
class OrderService {
async createOrder(userId: string, items: OrderItem[]): Promise<Order> {
const user = await this.userRepo.findById(userId);
const inventory = await this.inventoryRepo.checkAvailability(items);
if (!inventory.allAvailable) {
throw new Error('Items not available');
}
const order = await this.orderRepo.create({ userId, items });
await this.inventoryRepo.reserve(items);
return order;
}
}
Looks Good, But Check These:
1. All Callers Updated
// Old caller (now BROKEN):
const order = orderService.createOrder(userId, items);
console.log(order.id); // Works in sync version
// AI might update this:
const order = await orderService.createOrder(userId, items);
console.log(order.id); // Fixed
// But AI might MISS this:
const order = orderService.createOrder(userId, items); // Returns Promise!
sendEmail(order.id); // UNDEFINED! Promise hasn't resolved
Check ALL callers in your entire codebase. One missed caller = production bug.
2. Transaction Boundaries
// Original (sync): Everything in one transaction
createOrder(userId, items) {
// These execute in single database transaction
const order = this.orderRepo.create({ userId, items });
this.inventoryRepo.reserve(items);
return order;
}
// AI version (async): Transaction might close between awaits!
async createOrder(userId, items) {
const order = await this.orderRepo.create({ userId, items });
// Transaction might commit here!
await this.inventoryRepo.reserve(items);
// Inventory reserved outside transaction - race condition!
}
// Fixed: Explicit transaction
async createOrder(userId, items) {
return await this.db.transaction(async (trx) => {
const order = await this.orderRepo.create({ userId, items }, trx);
await this.inventoryRepo.reserve(items, trx);
return order;
});
}
3. Error Handling Changes
// Sync version:
try {
const order = createOrder(userId, items);
console.log('Order created:', order.id);
} catch (error) {
console.error('Failed:', error);
}
// AI conversion might work, but check:
try {
const order = await createOrder(userId, items);
console.log('Order created:', order.id);
} catch (error) {
console.error('Failed:', error);
// Are unhandled promise rejections caught elsewhere?
}
// Watch for: async functions that don't await properly
async function handler() {
createOrder(userId, items); // Forgot await! Errors won't be caught
}
4. Performance Implications
// Sequential (slow):
const user = await this.userRepo.findById(userId); // Wait 10ms
const inventory = await this.inventoryRepo.check(items); // Wait 10ms
// Total: 20ms
// Parallel (faster):
const [user, inventory] = await Promise.all([
this.userRepo.findById(userId),
this.inventoryRepo.check(items)
]);
// Total: 10ms
// AI might convert correctly but miss optimization opportunity
Validation Checklist:
- All callers updated (search entire codebase)
- No missing awaits
- Transaction boundaries preserved
- Error handling still works
- No unhandled promise rejections
- Consider parallelization opportunities
- Test concurrent scenarios (race conditions)
Scenario 3: Error Handling Rewrites
Why Caution Needed
Error handling is business logic. How you handle errors affects user experience, debugging, and system reliability. AI consolidates try-catch blocks for "cleaner" code but might change behavior subtly.
Real Example
Before:
async function processPayment(orderId: string, amount: number) {
let payment;
try {
payment = await stripeService.charge(amount);
} catch (error) {
logger.error('Stripe charge failed', { orderId, amount, error });
await notifyAdmin('Payment processing failed', { orderId });
throw new PaymentError('Payment failed. Please try again.');
}
try {
await orderRepo.updatePaymentStatus(orderId, payment.id);
} catch (error) {
logger.error('Failed to update order', { orderId, paymentId: payment.id, error });
// Payment succeeded but order update failed - need manual intervention
await notifyAdmin('CRITICAL: Payment succeeded but order update failed', {
orderId,
paymentId: payment.id
});
throw new SystemError('Payment processed but order update failed');
}
return payment;
}
AI Suggestion (Consolidated):
async function processPayment(orderId: string, amount: number) {
try {
const payment = await stripeService.charge(amount);
await orderRepo.updatePaymentStatus(orderId, payment.id);
return payment;
} catch (error) {
logger.error('Payment processing failed', { orderId, amount, error });
await notifyAdmin('Payment processing failed', { orderId });
throw new PaymentError('Payment failed. Please try again.');
}
}
What AI Lost:
1. Different Error Handling Per Stage
Original: Stripe failure vs. order update failure are handled differently.
- Stripe fails: "Payment failed, try again" (user can retry)
- Order update fails: "CRITICAL alert" (payment succeeded, need manual fix)
AI version: Both errors treated the same. If order update fails after payment succeeds, user gets "Payment failed, try again"—but payment actually succeeded! They'll be double-charged if they retry.
2. Context in Logging
Original: Logs include paymentId when order update fails (needed for manual reconciliation)
AI version: Lost this context
3. Error Message Accuracy
Original: Different error messages based on where failure occurred AI version: Generic message doesn't reflect reality
Fixed AI Version:
async function processPayment(orderId: string, amount: number) {
let payment;
// Stage 1: Charge payment
try {
payment = await stripeService.charge(amount);
} catch (error) {
logger.error('Stripe charge failed', { orderId, amount, error });
await notifyAdmin('Payment processing failed', { orderId });
throw new PaymentError('Payment failed. Please try again.');
}
// Stage 2: Update order (critical - payment already succeeded)
try {
await orderRepo.updatePaymentStatus(orderId, payment.id);
} catch (error) {
logger.error('Order update failed after successful payment', {
orderId,
paymentId: payment.id,
error
});
await notifyAdmin('CRITICAL: Payment succeeded but order update failed', {
orderId,
paymentId: payment.id,
action: 'Manual reconciliation required'
});
throw new SystemError(
'Payment processed successfully but order update failed. ' +
'Please contact support. Payment ID: ' + payment.id
);
}
return payment;
}
Validation Checklist:
- Error messages still accurate for each failure scenario
- Logging includes context needed for debugging
- Alerts appropriate for severity (critical vs. warning)
- User-facing errors reflect reality
- No silent failures introduced
- Retry logic preserved where needed
- Test each error scenario explicitly
Real War Story
AI refactored error handling in authentication service. Consolidated catch blocks. Removed distinction between "invalid password" (user can retry) and "account locked" (user must contact support). Result: Locked users kept trying to log in, hitting rate limits, making situation worse.
Red Light Scenarios: Override Often
These are where AI suggestions are dangerous. Even if the code looks right, the complexity and risk are too high. Use AI for ideas, but humans must design the solution.
Scenario 1: Cross-Service Transaction Logic
Why Humans Must Lead
Distributed transactions require understanding business invariants, eventual consistency, compensation logic, and failure modes. AI suggests technically correct code that violates business rules or introduces race conditions.
Real Example
Original (Monolith):
async function createOrder(userId: string, items: OrderItem[]) {
return await db.transaction(async (trx) => {
// Everything in one ACID transaction
const order = await trx('orders').insert({ userId, total: 0 });
let total = 0;
for (const item of items) {
await trx('order_items').insert({ orderId: order.id, ...item });
await trx('inventory').decrement('quantity', item.quantity).where({ id: item.productId });
total += item.price * item.quantity;
}
await trx('orders').update({ total }).where({ id: order.id });
await trx('payments').insert({ orderId: order.id, amount: total, status: 'pending' });
return order;
});
}
AI Suggestion (Microservices):
async function createOrder(userId: string, items: OrderItem[]) {
// AI tries to make it "distributed transaction"
const order = await orderService.createOrder({ userId, items });
try {
await inventoryService.reserveItems(items);
await paymentService.initiatePayment(order.id, order.total);
return order;
} catch (error) {
// Rollback?
await orderService.deleteOrder(order.id);
throw error;
}
}
What AI Got Wrong:
1. No Distributed Transaction Guarantees
If payment fails after inventory reserved:
- Order deleted ✓
- But inventory still reserved ✗
If deleteOrder fails:
- Order exists
- Inventory reserved
- Payment might partially succeed
This isn't a transaction—it's a collection of API calls that can fail independently.
2. Doesn't Handle Partial Failures
What if inventoryService is down but order already created? User gets error, but order exists in database. They retry, creating duplicate order.
3. No Compensation Logic
Real-world approach: Saga pattern with compensation
Correct Human-Designed Solution:
// Saga coordinator
async function createOrderSaga(userId: string, items: OrderItem[]) {
const sagaId = uuid();
let order, reservationId, paymentId;
try {
// Step 1: Create order (pending state)
order = await orderService.createOrder({
userId,
items,
status: 'pending',
sagaId
});
// Step 2: Reserve inventory
reservationId = await inventoryService.reserve({
items,
orderId: order.id,
sagaId
});
// Step 3: Initiate payment
paymentId = await paymentService.charge({
amount: order.total,
orderId: order.id,
sagaId
});
// Step 4: Confirm order
await orderService.confirmOrder(order.id);
return order;
} catch (error) {
// Compensation: Undo in reverse order
logger.error('Order saga failed, compensating', {
sagaId,
orderId: order?.id,
step: error.step
});
if (paymentId) {
await paymentService.refund(paymentId, { reason: 'saga-compensation' });
}
if (reservationId) {
await inventoryService.releaseReservation(reservationId);
}
if (order) {
await orderService.cancelOrder(order.id, { reason: 'saga-failed' });
}
throw new OrderCreationError('Order could not be completed', {
sagaId,
originalError: error
});
}
}
// Services support saga operations
class InventoryService {
async reserve(params: ReserveParams): Promise<string> {
const reservation = await this.db.reservations.create({
...params,
expiresAt: Date.now() + 15 * 60 * 1000, // 15 min expiry
status: 'reserved'
});
// Reserve inventory
await this.db.inventory.decrement('available', params.quantity);
return reservation.id;
}
async releaseReservation(reservationId: string): Promise<void> {
const reservation = await this.db.reservations.findById(reservationId);
// Return inventory to available pool
await this.db.inventory.increment('available', reservation.quantity);
await this.db.reservations.update(reservationId, { status: 'released' });
}
}
Why Human Design Is Critical:
- Business rules: What's the order of operations? What can be undone?
- Timeouts: How long should inventory be reserved?
- Idempotency: What if user retries during saga?
- Monitoring: How do we track saga state?
- Failure modes: Partial compensation failures?
AI can generate code structure, but humans must design the saga flow based on business requirements.
Validation Checklist:
- Business invariants documented and preserved
- Saga steps in correct order (can later steps be compensated?)
- Compensation logic for each step
- Idempotency (safe to retry)
- Timeout handling (reservation expiry)
- Saga state tracking (for debugging)
- Test: What happens if any step fails?
- Test: What happens if compensation fails?
Scenario 2: Security-Critical Code Paths
Why Humans Must Lead
Security requires threat modeling. AI doesn't think like an attacker. It makes code "cleaner" but removes defense-in-depth checks, introduces timing vulnerabilities, or changes authentication subtly.
Real Example
Before:
async function verifyPasswordAndLogin(username: string, password: string) {
const user = await userRepo.findByUsername(username);
if (!user) {
// Intentional delay to prevent timing attacks
await bcrypt.hash(password, 12);
throw new AuthError('Invalid credentials');
}
const isValid = await bcrypt.compare(password, user.passwordHash);
if (!isValid) {
// Log failed attempt
await auditLog.create({
event: 'login_failed',
username,
ip: request.ip,
timestamp: Date.now()
});
// Increment failed attempt counter
await rateLimiter.increment(username);
throw new AuthError('Invalid credentials');
}
// Check if account is locked
if (user.lockedUntil && user.lockedUntil > Date.now()) {
throw new AuthError('Account locked. Try again later.');
}
// Reset failed attempts on successful login
await rateLimiter.reset(username);
// Generate session
const token = jwt.sign({ userId: user.id }, process.env.JWT_SECRET, {
expiresIn: '1h'
});
return { token, user };
}
AI Suggestion ("Cleaner"):
async function verifyPasswordAndLogin(username: string, password: string) {
const user = await userRepo.findByUsername(username);
if (!user || !(await bcrypt.compare(password, user.passwordHash))) {
throw new AuthError('Invalid credentials');
}
const token = jwt.sign({ userId: user.id }, process.env.JWT_SECRET, {
expiresIn: '1h'
});
return { token, user };
}
What AI Removed (Each is a security issue):
1. Timing Attack Prevention
Original: If user doesn't exist, still runs bcrypt (expensive operation). Attacker can't determine if username exists based on response time.
AI version: If user doesn't exist, returns immediately. Attacker can enumerate valid usernames by timing the response.
2. Audit Logging
Original: Logs all failed login attempts for security monitoring AI version: No logging. Can't detect brute force attacks or compromised accounts.
3. Rate Limiting
Original: Increments failed attempt counter, locks account after threshold AI version: No rate limiting. Attacker can try unlimited passwords.
4. Account Lockout Check
Original: Checks if account is locked before issuing token AI version: No check. Locked accounts can still log in.
5. Failed Attempt Reset
Original: Resets counter on successful login (prevents permanent lockout from old failed attempts) AI version: No reset logic.
Correct Refactoring (If Any):
async function verifyPasswordAndLogin(username: string, password: string) {
// Always log authentication attempts
const attemptId = await auditLog.create({
event: 'login_attempt',
username,
ip: request.ip,
timestamp: Date.now()
});
try {
const user = await userRepo.findByUsername(username);
// Timing attack prevention: Always perform expensive operation
if (!user) {
await bcrypt.hash(password, 12); // Same cost as compare
throw new AuthError('Invalid credentials');
}
// Check lockout before password verification (prevent brute force)
if (user.lockedUntil && user.lockedUntil > Date.now()) {
await auditLog.update(attemptId, { result: 'account_locked' });
throw new AuthError('Account locked. Contact support.');
}
// Verify password
const isValid = await bcrypt.compare(password, user.passwordHash);
if (!isValid) {
await this.handleFailedLogin(user, attemptId);
throw new AuthError('Invalid credentials');
}
// Successful login
await this.handleSuccessfulLogin(user, attemptId);
return {
token: this.generateToken(user),
user: this.sanitizeUser(user)
};
} catch (error) {
// Don't leak information in error messages
if (error instanceof AuthError) {
throw error;
}
// Log unexpected errors but return generic message
logger.error('Login error', { username, error });
throw new AuthError('Authentication failed');
}
}
private async handleFailedLogin(user: User, attemptId: string) {
await auditLog.update(attemptId, { result: 'failed', userId: user.id });
const failedAttempts = await rateLimiter.increment(user.username);
// Lock account after threshold
if (failedAttempts >= 5) {
const lockDuration = Math.min(60 * 60 * 1000 * (2 ** (failedAttempts - 5)), 24 * 60 * 60 * 1000);
await userRepo.update(user.id, {
lockedUntil: Date.now() + lockDuration
});
await notifyUser(user.email, 'account_locked');
}
}
private async handleSuccessfulLogin(user: User, attemptId: string) {
await auditLog.update(attemptId, { result: 'success', userId: user.id });
await rateLimiter.reset(user.username);
// Update last login
await userRepo.update(user.id, {
lastLoginAt: Date.now(),
lastLoginIp: request.ip
});
}
Why Human Review Is Critical:
- Threat modeling: What attacks are possible?
- Defense in depth: Multiple layers of security
- Behavioral security: Rate limiting, anomaly detection
- Audit requirements: Compliance needs
- Error handling: Don't leak information
AI optimizes for code simplicity. Security requires deliberate complexity.
Validation Checklist:
- Timing attacks prevented
- Rate limiting in place
- Failed attempts logged
- Account lockout implemented
- No information leakage in errors
- Token expiry appropriate
- Defense in depth maintained
- Penetration test after changes
Scenario 3: Performance-Sensitive Hotspots
Why Humans Must Lead
Performance requires profiling data. AI makes code "cleaner" or "more functional" without considering that this path processes 10M records per hour. Algorithmic complexity matters at scale.
Real Example
Before (Optimized but less readable):
function processOrders(orders: Order[]): ProcessedOrder[] {
// Pre-allocate array for performance
const processed = new Array(orders.length);
const lookup = new Map<string, number>();
// Build lookup table O(n)
for (let i = 0; i < orders.length; i++) {
lookup.set(orders[i].id, i);
}
// Process in place, avoid allocations
for (let i = 0; i < orders.length; i++) {
const order = orders[i];
processed[i] = {
id: order.id,
total: order.items.reduce((sum, item) => sum + (item.price * item.quantity), 0),
itemCount: order.items.length,
userId: order.userId
};
}
return processed;
}
AI Suggestion ("More Functional"):
function processOrders(orders: Order[]): ProcessedOrder[] {
return orders.map(order => ({
id: order.id,
total: order.items
.map(item => ({ total: item.price * item.quantity }))
.reduce((sum, item) => sum + item.total, 0),
itemCount: order.items.length,
userId: order.userId
}));
}
What Changed:
Functionally equivalent, but:
Original:
- Pre-allocates result array
- Single pass through data
- Minimal allocations
- ~100ms for 100K orders
AI Version:
- Map creates new array
- Items mapped to intermediate objects
- Then reduced
- Multiple allocations per order
- ~280ms for 100K orders (2.8x slower)
At Scale:
- 1M orders/hour
- Original: Processes in 1 second
- AI version: Processes in 2.8 seconds
- That's 1.8 seconds * 1000 times/hour = 30 minutes of extra CPU time per hour
When AI Optimization Is Acceptable:
If this runs once per user request and processes 10 orders: 280ms → 2.8ms. The difference is negligible. Functional style is fine.
If this runs in a batch job processing millions of records: The difference matters. Keep performance-optimized version.
The Problem: AI doesn't know which scenario you're in.
Validation Checklist:
- Profile before/after with production-like data volumes
- Check algorithmic complexity (O(n) → O(n²) is red flag)
- Measure memory allocations
- Consider this at scale (10x, 100x current volume)
- Benchmark hot paths
- Review profiling data (where is time spent?)
When to Use AI:
Generate algorithmic alternatives, then benchmark:
Prompt: "This function processes 1M orders per hour and takes 2 seconds.
Suggest 3 optimization approaches with expected complexity improvements."
AI might suggest:
1. Parallel processing (split into batches)
2. Streaming instead of array mapping
3. Caching repeated calculations
Then YOU benchmark each and choose.
AI is idea generator. Humans measure and decide.
The Validation Checklist
Before merging any AI refactoring, ask these questions:
1. Functional Correctness
- Does it do exactly what the original code did?
- Are edge cases handled the same way?
- Does it pass all existing tests?
2. Behavior Preservation
- Are error messages the same (or better)?
- Is logging/monitoring preserved?
- Are side effects identical?
- Is timing/ordering preserved?
3. API Compatibility
- Are function signatures unchanged (or deprecated properly)?
- Do all callers still work?
- Is backward compatibility maintained?
4. Tests
- Do existing tests pass?
- Are new tests needed?
- Are tests updated if behavior changed intentionally?
5. Performance
- Is performance the same or better?
- For hot paths: benchmark before/after
- Check algorithmic complexity didn't regress
6. Security
- No security checks removed?
- Auth/authz logic unchanged?
- No new vulnerabilities introduced?
7. Rollback Plan
- Can I revert this easily?
- What's the blast radius if it breaks?
- Is there data I can't easily roll back (schema changes)?
8. Team Review
- For yellow/red scenarios: Get second opinion
- Explain AI changes in PR description
- Highlight non-obvious changes
9. Monitoring
- Add monitoring if refactoring is risky
- Plan to watch error rates after deploy
- Can I detect issues quickly if they occur?
10. Documentation
- Are comments still accurate?
- Does documentation need updating?
- Are architectural decisions explained?
The "Sleep On It" Rule
For any refactoring in yellow/red zones: Wait 24 hours before merging.
Fresh eyes catch issues. Come back tomorrow, review again. If you still feel confident, merge.
Measuring Refactoring Quality Over Time
Track these metrics to improve your AI refactoring process:
Metric 1: Defect Rate
Track: Bugs caused by AI refactorings vs. manual refactorings
Target: AI should be equal or better than manual (not worse)
How to Measure:
Tag commits: "AI-assisted refactoring" or "Manual refactoring"
Track production bugs back to commits
Calculate: bugs per 100 refactoring commits
Example:
- Manual refactoring: 3 bugs per 100 commits
- AI refactoring: 2 bugs per 100 commits
- Conclusion: AI refactoring is actually safer (with your review process)
Metric 2: Code Quality Metrics
Track: Complexity, maintainability, duplication before/after
Tools: SonarQube, CodeClimate, ESLint complexity rules
Example:
- Before: Cyclomatic complexity 18
- After AI refactoring: Cyclomatic complexity 6
- Conclusion: Refactoring improved quality
Metric 3: Team Velocity
Track: Story points per sprint, time to complete features
Goal: Refactoring should save time, not create churn
Red Flag: If velocity decreases, you might be:
- Over-refactoring
- Introducing bugs that cause rework
- Spending too long reviewing AI suggestions
Metric 4: Time Saved
Track: Time spent on refactoring tasks before/after AI
Example:
- Manual refactoring: 4 hours
- AI-assisted refactoring: 1 hour (including review)
- Time saved: 3 hours (75%)
Metric 5: False Start Rate
Track: How often do you start accepting an AI refactoring, then realize it's wrong and revert?
Target: <5%
If higher: Your categorization (green/yellow/red) needs adjustment, or prompts need refinement.
Conclusion
AI refactoring is powerful, but not autopilot.
The framework:
- Categorize: Green (trust), Yellow (verify), Red (override)
- Validate: Use the checklist
- Measure: Track quality over time
The teams doing this well:
- Trust AI for mechanical refactorings (extract method, rename)
- Verify closely for complex changes (async, databases, errors)
- Design manually for high-risk areas (security, distributed systems, performance)
The teams struggling:
- Trust everything (lead to production incidents)
- Trust nothing (waste AI opportunity)
- Don't measure (can't improve)
Your action plan:
- This week: Review 5 AI refactorings using the framework
- This month: Track defect rate (AI vs. manual refactorings)
- This quarter: Refine your green/yellow/red categories based on data
The goal isn't to avoid AI refactorings—it's to use them safely and effectively. When you know when to trust and when to override, AI becomes a force multiplier.
Practical Takeaways
Print This Framework:
- Green: Trust 90%, 2-min review
- Yellow: Verify closely, check edge cases
- Red: Human-designed, AI for ideas only
Your Review Checklist:
- Functional correctness preserved
- All callers still work
- Tests pass
- No performance regression
- Security unchanged
- Can rollback easily
Start Here:
- Take next AI refactoring suggestion
- Categorize it (green/yellow/red)
- Use appropriate review level
- Track outcome (did it work?)
- Adjust categories based on learnings
The 60/30/10 rule (60% good, 30% neutral, 10% harmful) only holds if you review intelligently. With this framework, you can push that to 90/10/0.
That's the goal: 90% of AI refactorings make your code better, 10% are neutral, 0% cause production issues.
You get there by knowing when to trust, when to verify, and when to override.
