Back

6 Root Cause Analysis Examples: Methods, Steps, and Real-World Scenarios

6 Root Cause Analysis Examples: Methods, Steps, and Real-World Scenarios

There’s something satisfying about a postmortem that lands cleanly: a trail of telemetry from the 500 error to the offending commit, and a short list of fixes that keep the same outage from coming back. Each clean writeup feeds back into harder CI/CD guardrails, and the on-call rotation gets quieter month over month.

This guide covers five root cause analysis (RCA) methods, six real-world incident scenarios that show those methods working on production failures, and a step-by-step RCA process you can run against your own telemetry.

What Is Root Cause Analysis?

Root cause analysis is the structured investigation you run after service is restored to find the underlying trigger behind an incident, not the surface symptom. Troubleshooting and incident response happen live to stop the bleeding. RCA happens over the next few hours or days and produces a postmortem with trackable action items. Observability platforms like Coralogix shorten this work by joining logs, metrics, and traces in one query language so you can trace a symptom back to its trigger without context-switching between dashboards. One data point shapes which methods pay off most: across thousands of postmortems, binary pushes and configuration pushes drive 37 and 31 percent of change-caused incidents. 

Common Root Cause Analysis Methods

Pick the wrong RCA method and you’ll either drown a simple bug report in process or miss the second contributing factor that takes you down again next week. Each method below has a profile where it shines and another where it wastes hours. The table at the end maps each one to the incident shape it fits best.

The 5 Whys

The 5 Whys is a linear questioning technique where you ask “why” repeatedly until you reach a cause you can fix in code, config, or process. Each answer feeds the next question, and you stop when the chain points at a concrete change. Here’s a worked example for a Kubernetes payment service:

  1. Why did the API return 500 errors? The payment service was unreachable.
  2. Why was the payment service unreachable? All pods were in CrashLoopBackOff.
  3. Why were pods crashing? The service couldn’t connect to the database.
  4. Why couldn’t it connect? The database connection string changed in a config update.
  5. Why was the config changed incorrectly? The deployment pipeline didn’t validate environment variables.

The 5 Whys works when an incident has a clean linear progression with one dominant cause. The structure can’t represent two or three contributing factors firing at once, so when you have converging failure modes, switch to a Fishbone diagram or fault tree analysis.

Fishbone (Ishikawa) Diagram

A Fishbone diagram maps potential causes into categories branching off a central spine that points at the problem statement. Where the 5 Whys traces one chain, the Fishbone explores cause categories in parallel: people, process, technology, and environment. It outperforms the 5 Whys when an incident touches multiple teams or systems and you need to rule out four or five plausible triggers before narrowing the investigation.

Fault Tree Analysis

Fault tree analysis (FTA) starts with an undesired event and works backward to map every combination of lower-level failures that could cause it, using logical AND gates (all conditions must be present) and OR gates (any single condition is sufficient). The structure forces you to reason about how several conditions combine to produce one visible outage, which is the shape of most distributed-system failures. FTA applies both reactively after incidents and proactively before system launches.

Pareto Analysis

Pareto analysis applies the 80/20 principle to incident data to find which root cause categories drive the largest share of failures. The finding that binary and configuration pushes drive 37 and 31 percent of change-caused incidents is Pareto analysis in practice, and that distribution drove industry investment in canary releases and configuration validation. The method needs a backlog of tagged, categorized postmortems to work, and it offers nothing for one incident in isolation.

Failure Mode and Effects Analysis (FMEA)

Failure Mode and Effects Analysis (FMEA) is a proactive, bottom-up method that walks every component, asks what could fail at each point, and scores each failure mode on severity, likelihood, and detectability to produce a Risk Priority Number (RPN). FMEA is the outlier among these five methods because it prevents failures before they happen instead of investigating them afterward. Teams apply FMEA during design or pre-launch reviews to identify potential failure modes and prioritize mitigations for the items carrying the highest RPN scores.

The table below maps each method to the incident profile where it earns its keep:

MethodBest ForNot Suited ForReactive or Proactive
5 WhysLinear causal chains; single-team incidents; config/code push failuresMulti-factor incidents; complex distributed systemsReactive
FishboneMulti-team incidents; structured brainstorming; broad coverageSimple incidents with obvious single causeReactive
FTACascading failures; distributed system failures; safety-critical analysisRoutine single-cause incidentsBoth
ParetoPortfolio prioritization; recurring incident patterns; backlog triageSingle incident investigationBoth
FMEAPre-launch reviews; architecture design; high-risk migrationsPost-incident investigation of specific eventsProactive

6 Real-World Root Cause Analysis Examples

Postmortems from production incidents show how RCA methods hold up against messy, real-world failures. The six below cover application, database, deployment, microservice, network, and security incidents. Each entry walks through what happened, what cascaded, and which method fits the failure shape.

1. Application Outage from a Memory Leak

A performance tweak shipped to an authorization microservice in November 2023 carried a memory leak that surfaced under production traffic. Pods crashed in a loop, and their default closed state cascaded authorization failures across several features.

Rollback dragged because parts of the deployment pipeline depended on the same authorization service, a circular dependency that needed manual unwinding. The RCA landed on two causes: a leak pre-production load testing didn’t expose, and the circular dependency that widened the blast radius. Baseline anomaly detection alerts on pod restart counts and per-pod memory flag this kind of slow leak the moment production traffic exposes it.

2. Database Performance Degradation After a Schema Change

Following a Postgres 17 upgrade, a team turned PGAudit back on after staging runs came back clean. A routine table-creation migration then tripped a pathological lock chain: pgaudit stalled while holding a critical lock, which blocked queries waiting on pg_proc and dragged the wider database down with it.

A Fishbone diagram makes the contributing factors readable: an environment gap between staging and production, an audit extension with hidden lock behavior under load, and a migration that assumed routine schema changes were safe to ship without a canary. Joining slow-query logs, lock-wait metrics, and active session traces in one query shortens the loop between “the database got slow” and “this lock is the reason.”

3. Failed Deployment from a Misconfigured Environment Variable

Between 16:24 and 19:30 Coordinated Universal Time (UTC) on March 5, 2026, GitHub workflow runs degraded badly, with 95 percent of runs failing to start within five minutes. A Redis infrastructure update pushed a bad config into the load balancer, which routed internal traffic to the wrong host.

The 5 Whys traces a clean chain here. Workflows failed because internal calls didn’t reach Redis, the load balancer pointed at the wrong host, and the Redis update shipped an incorrect value with no pre-rollout guardrail to catch it. The corrective work splits between config validation in the deploy pipeline and an alert that fires when load-balancer health diverges from a recent change window.

4. Cascading Microservice Failure During Peak Load

On February 22, 2022, a phased Consul rollout hit 25 percent of Slack’s fleet during peak traffic and tipped the system over. As cache nodes dropped, a single channel missing from cache forced queries to fan out across every shard of the keyspace at once, and read load grew superlinearly with the cache miss rate.

Most queries timed out, which kept the cache from refilling, locking the system into a self-reinforcing loop until engineers throttled client boot requests and later moved to a cellular architecture.

A cascade like this should produce one paged incident, not dozens of independent alerts firing across services. Coralogix’s Flow Alerts chain alerts across logs, metrics, traces, and security data in a defined sequence, so the cache-miss alert, the database saturation alert, and the query-timeout alert collapse into one signal with the cascade context attached.

5. Latency Spikes from a Network Routing Misconfiguration

On January 22, 2026, a routing policy error at a Miami data center caused Border Gateway Protocol (BGP) prefixes to leak outside their intended scope. The leak ran for 25 minutes, with congestion and raised packet loss bleeding into customer traffic.

Fault tree analysis pushes the top event (“BGP prefixes leaked”) down into two branches: a policy generator that produced an unsafe configuration, and the missing pre-deploy validation that would have rejected it. The postmortem flagged the incident as “unfortunately very similar to the outage we experienced in 2020,” a recurring failure mode worth its own fault tree.

For an SRE team facing the same pattern, Olly (Coralogix’s autonomous observability agent) cross-references the live telemetry deviation against the recent change set, returning a candidate commit and blast radius the investigator can verify.

6. Security Incident from a Leaked Credential

CircleCI’s January 2023 security incident started when a malicious actor compromised an engineer’s laptop and lifted a valid session cookie off it. The attacker replayed that authenticated session to impersonate the engineer and pull data from several internal systems, including secrets and tokens.

This incident rhymes with the 2021 Codecov supply-chain breach where attackers compromised CI/CD tooling that legitimately held credentials during normal operation. Both incidents point any RCA at three dimensions: session lifetime, device trust, and the blast radius of a single engineer’s access across internal systems.

How to Perform a Root Cause Analysis

A structured RCA process moves from symptom observation to causal analysis to preventive action. The six steps below apply whether your team uses the 5 Whys, a Fishbone diagram, or fault tree analysis:

  1. Define the problem clearly: A precise problem statement names the affected service, user-facing impact, and time window. “The checkout API returned p99 latency above three seconds for 47 minutes, hitting 12 percent of transactions” focuses an investigation in ways that “the API was slow” never will.
  2. Collect and correlate observability data: Metrics show you latency spiked, logs show which service threw the error, and traces show which upstream call caused the cascade. DataPrime in Coralogix joins logs, metrics, traces, and business data in one query, so the trace ID stays with you across signal types.
  3. Reconstruct the incident timeline: A factual chronology from telemetry answers four questions: when the first anomalous signal appeared, when alerts triggered, when the team engaged, and when the problem was mitigated. Deployment events, configuration changes, and scaling events overlaid onto that chronology often reveal the trigger.
  4. Identify causal factors: Match the method to the incident profile, with the 5 Whys for a linear chain and a Fishbone diagram for multi-team failures with several converging modes. Olly, Coralogix’s autonomous observability agent, formulates DataPrime queries against the incident timeline and returns reasoning chains an engineer can validate during the investigation.
  5. Validate the actual root cause: The test is whether fixing the identified cause would have prevented the incident, with evidence to back the answer. If the cause names a person rather than a system or process, the analysis should continue.
  6. Implement corrective and preventive actions: Each corrective action becomes a tracked work item with an owner and a deadline, captured in a postmortem and pulled into sprint planning. Items that never reach the backlog rarely get done.

Run these steps in order, and the investigation produces fixes the team can ship.

Four Principles That Separate Learning Teams from Repeating Teams

Four principles separate teams that learn from incidents from teams that repeat them:

  • Focus on systems, not individuals: A blameless postmortem assumes everyone involved had good intentions. Identifying who triggered an incident should lead you to the system conditions that allowed it.
  • Ground findings in telemetry, not assumptions: An observability gap that hides silent failures is itself a root cause. Corrective actions should ask whether the same investigation could have run faster with signals joined in one query instead of siloed dashboards.
  • Document outcomes in a structured postmortem: Postmortems need to be reviewed, shared, and tracked to closure. Documentation pays off only when it drives visible follow-through.
  • Close the loop with preventive action: If the same class of incident recurs despite completed postmortems, the root cause, the corrective actions, or both fall short.

Consistent use across all four of these does more for reliability than any single postmortem template.

How Coralogix Accelerates Root Cause Analysis

Coralogix is a cross-stack observability platform built on Streama©, which analyzes telemetry as it lands, before indexing or storage. The platform keeps data in your own cloud object storage with remote, index-free querying, so RCA work can scan weeks of telemetry without rehydration delays or extra cost. One investigation can span logs, metrics, traces, and business data in the same workspace.

Olly, Coralogix’s autonomous observability agent, reads logs, metrics, alerts, and traces to surface patterns, flag anomalies, and trace a problem to its origin, as the Simpplr case study walks through on a live incident. Flow Alerts chain conditions across logs, metrics, traces, and security signals in a defined sequence, so a cascading failure produces one alert with root cause context instead of a wall of pages. DataPrime joins those signals in a single query, and Coralogix Investigations gives responders a shared workspace for rebuilding the incident timeline.

Turning Every Incident into a Stronger System

The incidents in this guide point in the same direction: a detailed postmortem with specific corrective actions turns one bad day into engineering input for the next quarter of work. Examples range from a migration to cellular architecture after a cache cascade to tighter pre-deploy validation after recurring BGP route leaks. Your chosen RCA method comes second to following the evidence in your telemetry, confirming the named cause would have stopped the incident, and tracking each action item to done.

Run Olly on your own production data with a free Coralogix trial. You’ll see how an autonomous agent traces a live incident from symptom to source without the manual log-jumping your on-call rotation does today.

Frequently Asked Questions About Root Cause Analysis

How is root cause analysis different from troubleshooting or incident response?

Troubleshooting diagnoses an active issue in minutes, and incident response restores service in minutes to hours. RCA runs hours to days after the fact and produces a postmortem with corrective actions aimed at stopping recurrence. Coralogix Investigations gives responders one workspace that carries context from the live page through the postmortem.

When should you use a fishbone diagram instead of the 5 Whys?

Use the 5 Whys when one team owns the path from trigger to failure and the chain is mostly linear. Reach for a Fishbone when several services, teams, or failure modes converge, then drop into the 5 Whys on the branch that looks heaviest. Olly, Coralogix’s autonomous observability agent, can pre-cluster the signals across services so the branches on your diagram start with evidence, not guesswork.

How do you know you’ve found the actual root cause?

Ask whether the fix would have prevented this incident and whether it prevents the same class of incident going forward, not only this one instance. If the answer names a person rather than a system or process, keep going. In Coralogix, DataPrime lets you replay the same query across weeks of stored telemetry to confirm the pattern holds beyond the single event.

Can root cause analysis be automated?

AI-assisted RCA is producing measurable results in production today. The gap is between assistants that summarize what humans already wrote and agents that investigate telemetry, code changes, and prior incidents to produce explainable hypotheses. Coralogix Olly sits in the second camp, working from raw signals rather than human-authored notes.

On this page