Software Craftsmanship

Observability Beyond Dashboards: Practical Debugging Workflows for Distributed Systems

Dashboard says everything's fine, but Slack is on fire. Observability isn't pretty charts—it's answering 'why is THIS user broken?' Learn logs/metrics/traces workflow, correlation IDs, structured logging, the 15-min debugging playbook, and anti-patterns.

Ruchit Suthar
Ruchit Suthar
November 18, 202511 min read
Observability Beyond Dashboards: Practical Debugging Workflows for Distributed Systems

TL;DR

Pretty dashboards don't answer "Why is this specific user seeing errors?" Use metrics (something is wrong), traces (where it failed), and logs (why it failed) together. Add correlation IDs to link requests across services. Structure logs for searchability. Build debugging workflows that go from user complaint to root cause in minutes, not hours.

Observability Beyond Dashboards: Practical Debugging Workflows for Distributed Systems

When the Dashboard Says 'Everything Is Fine' (But Slack Is on Fire)

It's 2am. Your phone buzzes. The on-call alert: "User reports: checkout is broken."

You open your monitoring dashboard. Everything looks green:

  • API success rate: 99.8% ✅
  • Latency p99: 180ms ✅
  • Error rate: 0.2% ✅

But Slack is exploding. Twenty users can't complete checkout. Support is panicking. Your CEO is awake and asking questions.

The dashboard says everything is fine. Your users say it's broken.

This is the observability gap. Pretty charts showing aggregate health don't help you answer: "Why is this specific user seeing errors?" or "Which service in the request chain is failing?"

Observability isn't about dashboards. It's about answering questions. When something breaks, you need to go from "users are complaining" to "here's the exact problem and how to fix it" in minutes, not hours.

Let's talk about how to actually debug distributed systems in production.

Three Pillars, One Workflow: Logs, Metrics, Traces

You've heard this: logs, metrics, and traces are the three pillars of observability. But what does that actually mean when you're debugging?

Metrics: "Something Is Wrong Somewhere"

What they are: Time-series data showing trends. Request counts, latency percentiles, error rates, CPU usage.

What they tell you: That something is wrong and when it started. High-level health signals.

What they don't tell you: Why it's wrong or which specific requests are affected.

Example: "Order API p99 latency jumped from 200ms to 2 seconds at 2:15am."

You know there's a problem. You don't know which orders, which users, or what caused it.

Traces: "Requests Flowing Through Services"

What they are: Records of a single request's journey through multiple services. Shows which services were called, in what order, how long each took.

What they tell you: The path a request took, where time was spent, which service failed.

What they don't tell you: Why that service failed or what data caused the issue.

Example: Trace shows Order API called Payment Service, which took 8 seconds and returned 500 error.

Now you know where the problem is (Payment Service), but not why it failed.

Logs: "What Actually Happened"

What they are: Textual records of specific events. Errors, debug messages, business logic outcomes.

What they tell you: The details—error messages, stack traces, request parameters, decision points.

What they don't tell you: High-level patterns or trends (too granular for that).

Example: Log entry from Payment Service: PaymentError: Card declined - insufficient funds. card_id=card_abc123, user_id=user_456

Now you know why: specific card was declined.

How They Work Together

Debugging workflow:

  1. Metrics tell you: "Something is wrong" (latency spike, error rate increase).
  2. Traces tell you: "Here's where in the request flow it's failing" (which service, which endpoint).
  3. Logs tell you: "Here's exactly what went wrong" (error message, stack trace, context).

You need all three. Metrics without traces = you know there's a problem but not where. Traces without logs = you know where but not why. Logs without context = you're searching for a needle in a haystack.

Designing for Debuggability, Not Just Monitoring

Most teams bolt observability onto their systems after they're built. Better approach: design for debuggability from the start.

1. Correlation IDs Across Services

Every request gets a unique ID that flows through all services.

Example:

// API Gateway generates correlation ID
const correlationId = uuid();
req.headers['x-correlation-id'] = correlationId;

// Every service logs it
logger.info('Processing order', { 
  correlationId, 
  userId, 
  orderId 
});

// Every service passes it to downstream services
await paymentService.charge({
  amount,
  headers: { 'x-correlation-id': correlationId }
});

Benefit: You can trace one user's request across 10 services by searching for their correlation ID.

Without this, you're trying to piece together which log line in Payment Service corresponds to which log line in Order Service. Impossible at scale.

2. Structured Logging (Fields, Not String Soup)

Bad logging (unstructured strings):

logger.info(`User ${userId} created order ${orderId} for $${amount}`);

Try searching logs for "all orders over $500" or "all errors for user_123". You can't. It's a string. You have to parse it.

Good logging (structured fields):

logger.info('Order created', {
  correlationId,
  userId,
  orderId,
  amount,
  status: 'pending'
});

Now you can query: orderId="order_456" or amount > 500 or userId="user_123" AND status="failed".

Use structured logging everywhere. JSON logs with consistent field names.

3. Key Business and Technical Metrics Per Service

Every service should expose metrics for:

Business metrics:

  • Orders created
  • Payments processed
  • Emails sent
  • Users signed up

Technical metrics:

  • Request rate
  • Error rate
  • Latency (p50, p95, p99)
  • Database query time
  • External API call time

Example (using Prometheus-style metrics):

// Business metric
metrics.increment('orders.created', { status: 'pending' });

// Technical metric
const start = Date.now();
await database.query(sql);
metrics.histogram('database.query.duration', Date.now() - start);

These metrics feed your dashboards. But more importantly, they help you ask: "Did payment volume spike?" or "Did database query time increase?"

4. Observability as Part of the System

Don't think: "We'll add logging later." Think: "Logging, tracing, and metrics are part of the feature."

Code review checklist:

  • Does this code log key events (with correlation ID and structured fields)?
  • Does it emit metrics (business outcomes and performance)?
  • Does it propagate trace context to downstream services?

If the answer is no, the feature isn't done.

A Typical Debugging Workflow for Production Incidents

Let's walk through a real scenario step by step.

Scenario: Latency Spike in Order API

2:15am: Alert fires: "Order API p99 latency > 2 seconds."

Step 1: Check High-Level Dashboards

Open your SLO dashboard. Look at:

  • Request rate: Has traffic spiked? (Normal: 500 req/min. Now: 480 req/min. Not a spike.)
  • Error rate: Are requests failing? (Normal: 0.1%. Now: 0.5%. Slight increase.)
  • Latency: p50, p95, p99. (p50: 150ms → 200ms. p99: 200ms → 2.5 seconds. Big spike at p99.)

Observation: p99 latency spiked. Small increase in errors. No traffic spike.

Hypothesis: Some requests are slow, not all. Likely a specific code path or dependency issue.

Step 2: Drill Into Affected Endpoints

Which endpoints are slow?

Check per-endpoint latency breakdown:

  • GET /orders: p99 180ms (normal)
  • POST /orders: p99 2.8 seconds (abnormal, usually 300ms)
  • GET /orders/{id}: p99 150ms (normal)

Observation: Only POST /orders is slow.

Step 3: Follow Traces Through Dependent Services

Pull up traces for slow POST /orders requests.

Example trace (spans with durations):

POST /orders (2.8s total)
  ├─ Validate inventory (50ms)
  ├─ Create order record (80ms)
  ├─ Call Payment Service (2.5s) ← SLOW
  │   ├─ Charge card (2.4s) ← SLOW
  │   └─ Save payment record (100ms)
  └─ Enqueue fulfillment job (20ms)

Observation: Payment Service charge card call is taking 2.4 seconds (normally 200ms).

Hypothesis: Payment Service or downstream payment gateway is slow.

Step 4: Use Logs for Specific Errors or Anomalies

Check Payment Service logs, filtered by correlation IDs from slow traces.

Log entries:

{
  "level": "error",
  "message": "Payment gateway timeout",
  "correlationId": "abc-123",
  "userId": "user_456",
  "error": "Timeout after 5000ms waiting for response from stripe.com",
  "timestamp": "2025-11-15T02:16:23Z"
}

Observation: Payment gateway (Stripe) is timing out. Not all requests—just some.

Step 5: Check Stripe Status Page

Visit status.stripe.com.

Status: "Degraded performance - API response times elevated."

Root cause found: Stripe is having issues. Our Payment Service is timing out waiting for Stripe responses.

Step 6: Mitigation

Immediate:

  • Increase timeout or add retries? No—will make it worse.
  • Enable circuit breaker to fail fast instead of waiting 5 seconds? If configured, yes.
  • Communicate to users: "Payment processing slower than usual, please be patient."

Follow-up:

  • Once Stripe recovers, check if any payments are stuck and need retry.
  • Review alerting: should we alert on external API latency separately?

Total Time: 15 Minutes

From alert to root cause: 15 minutes. Because:

  1. Metrics showed which endpoint was slow.
  2. Traces showed which service in the chain was slow.
  3. Logs showed the specific error (timeout from Stripe).

Without observability: "Users report checkout is slow" → hours of guessing, restarting services randomly, hoping it fixes itself.

Common Observability Anti-Patterns

Anti-Pattern 1: Too Many Dashboards, No One Knows Which to Check

You have 40 dashboards. No one knows which shows the problem.

Fix: One SLO dashboard per team/service showing the metrics that matter:

  • Request rate
  • Error rate
  • Latency (p50, p95, p99)
  • Key business metrics

Link this from Slack, runbooks, and on-call docs. This is your starting point.

Other dashboards are for deep dives, not incident response.

Anti-Pattern 2: Unstructured Logs with No Context

logger.error('Payment failed');

What payment? Which user? Why did it fail?

Fix: Always log context:

logger.error('Payment failed', {
  correlationId,
  userId,
  orderId,
  paymentMethod: 'card',
  errorCode: 'CARD_DECLINED',
  errorMessage: error.message
});

Anti-Pattern 3: Metrics with Unclear Names or Units

metrics.gauge('queue', queueSize);

Which queue? Is this size in items? Bytes? Is high bad or good?

Fix: Descriptive names and units:

metrics.gauge('fulfillment.queue.size.items', queueSize);

Now it's clear: fulfillment queue, size measured in items.

Anti-Pattern 4: No Correlation Between Metrics, Logs, Traces

You see a latency spike in metrics. You open logs. You have 10 million log lines. Which ones correspond to the spike?

Fix: Link everything with time ranges and correlation IDs.

Modern tools (Datadog, New Relic, Grafana + Loki + Tempo) let you:

  • Click a spike in a metric graph → see traces from that time.
  • Click a trace → see logs with that correlation ID.

If your tools don't support this, use consistent timestamps and correlation IDs to manually filter.

Observability for Distributed Architectures

Distributed systems (microservices, serverless) make debugging harder. The request flow is no longer linear.

Challenges

Network flakiness: Requests sometimes fail due to transient network issues.
Retries: Did this request succeed on first try or after 3 retries?
Timeouts: Service A times out waiting for Service B, but Service B actually succeeded.
Cascading failures: Service A is slow, so Service B queues up requests, so Service C times out.

How Good Observability Helps

Tracing shows cross-service impact:

If Service A calls Service B calls Service C, traces show the full path. If C is slow, you see it in the trace. If B retries 3 times, you see that too.

Correlation IDs track retries:

Log each retry attempt with the same correlation ID but different attempt number:

logger.warn('Retrying payment', {
  correlationId,
  attempt: 2,
  error: 'Timeout'
});

Now you can see: "This request succeeded on attempt 3."

Metrics show cascading failures:

If Service A latency spikes, check if Service B latency also spiked (and B calls A). Cascading failure.

Debugging Distributed Systems: Follow the Trace

When a request touches 5 services, don't debug 5 services independently. Follow the trace.

Example: User reports "checkout failed."

  1. Search logs for user's correlation ID or order ID.
  2. Find the trace for that request.
  3. Walk through the trace to see which service returned an error.
  4. Check that service's logs for details.

Making Observability a Team Habit

Observability isn't a tool you buy—it's a habit you build.

1. Include Observability Checks in PR Reviews

When reviewing code, ask:

  • Are key events logged (with correlation ID and context)?
  • Are metrics emitted for success/failure?
  • Is trace context propagated to downstream calls?

If no, request changes.

2. Add Observability Improvements to Postmortem Actions

After every incident:

  • "Could we have debugged this faster with better logs/metrics/traces?"
  • "What would have helped us catch this earlier?"

Add those improvements as action items.

Example: After timeout incident with Stripe:

  • Action: Add metric for external API latency (Stripe, Twilio, etc.).
  • Action: Add circuit breaker to fail fast on repeated timeouts.

3. Run "Debug Drills" to Practice Using Tools

Before real incidents, practice:

  • "Simulate a slow database query. Use traces to find which query."
  • "Inject an error in Payment Service. Use logs to find the root cause."

This trains your team to use observability tools under pressure.

Closing: From Dashboards to Insight

Dashboards are not observability. Dashboards show you what's happening. Observability helps you understand why.

Good observability shortens time-to-understanding:

  • From "something is wrong" to "here's the root cause" in minutes, not hours.
  • From "users report errors" to "here's the exact failing request and why."

This requires:

  • Structured logs with correlation IDs and context.
  • Metrics for high-level health and trends.
  • Traces to follow requests through services.
  • Design for debuggability: observability built in, not bolted on.

Observability Upgrade Checklist for One Service

Pick one critical service and upgrade its observability:

  • Correlation IDs: Generate at entry point, log in every service, pass to downstream services
  • Structured logging: All logs use JSON with consistent field names (correlationId, userId, orderId, etc.)
  • Key metrics exposed: Request rate, error rate, latency, business metrics (orders created, payments processed)
  • Tracing enabled: Trace context propagated, spans captured for key operations (DB queries, external API calls)
  • One SLO dashboard: Shows the 4-6 metrics that matter most, linked from runbook
  • Logs/metrics/traces linked: Can click metric spike → see traces → see logs with correlation ID
  • Runbook updated: Debugging workflow documented using these observability tools

This takes 1-2 days. It saves weeks of debugging time over the next year.


Observability is not optional. In distributed systems, you can't SSH into a server and tail a log file anymore. You need structured visibility into what's happening across dozens of services, thousands of requests per second.

Build observability into your system from day one. When production breaks at 2am, you'll thank yourself.

Topics

observabilitymonitoringdistributed-systemsdebuggingloggingtracingmetrics
Ruchit Suthar

About Ruchit Suthar

Technical Leader with 15+ years of experience scaling teams and systems