The Single Point of Failure is a Ghost in the Machine

The Single Point of Failure is a Ghost in the Machine

The myth of the root cause in complex systems.

Slumping back in the ergonomic chair that has long since lost its lumbar support, I watch the cursor blink inside the ‘Root Cause’ field of the post-mortem document. It’s 3:11 AM. My eyes feel like they’ve been rubbed with sandpaper, and I’ve already typed my password wrong 11 times in the last hour because my muscle memory has been replaced by a twitching sticktail of adrenaline and caffeine. The Jira ticket is mocking me. It demands a singular answer. It wants a name, a commit hash, or a specific configuration value that turned our production environment into a digital graveyard for exactly 41 minutes. My boss wants to know the ‘one thing’ we need to fix to make sure this never happens again. He wants a narrative arc with a clear villain and a satisfying resolution.

But reality doesn’t write stories. It writes tragedies of coincidence.

The truth is that there is no root cause. It’s a lie we tell ourselves so we can sleep at night, a comforting fiction designed to satisfy the human brain’s desperate need for linear cause-and-effect. In complex systems, failure is never a straight line; it is a convergence. It is a messy, tangled knot of ‘almosts’ and ‘not quites’ that suddenly pull tight all at once. If I write ‘Bug in the caching service’ in that box, I am technically telling the truth, but I am also committing a massive fraud against the engineering process. The bug had been there for 21 months. It wasn’t the bug that killed us; it was the bug meeting a misconfigured retry policy, meeting a sudden 1-second spike in latent database queries, meeting a network partition that only lasted 1 millisecond.

1. The Convergence of Coincidence

I’ve spent a lot of time thinking about these intersections since I started talking to Parker H. Parker isn’t a software engineer. Parker H. is a dollhouse architect-a professional who builds intricate, hyper-realistic 1:12 scale mansions for people who have more money than sense. You’d think building a miniature house would be simpler than building a distributed system, but Parker talks about ‘structural cascading failures’ with the same weary intensity I use for Kubernetes clusters.

Convergence

Parker once explained to me that a miniature roof doesn’t just collapse because a beam is thin. It collapses because the glue was applied in a room with 51% humidity, the wood grain was slightly off-axis, and a visitor 1 foot away sneezed at a specific frequency. None of those things is the root cause. The cause is the relationship between them.

The Scapegoat Mechanism

When we hunt for a single point of failure, we are looking for a scapegoat. We want something we can delete, replace, or fire. If we admit that the failure was the result of five different systems behaving exactly as they were designed to behave-but in an unforeseen combination-we have to admit that we are not in control. That’s a terrifying prospect for management.

It’s much easier to say ‘Dave broke it’ or ‘The Redis instance was too small.’ It allows us to apply a local fix to a systemic problem and call it progress. We patch the one thing, we check the box, and we go back to our lives, leaving the remaining 4 variables to simmer in the background until they find a new partner to dance with.

I remember an outage 11 months ago where a single character in a YAML file caused a global routing loop. Everyone pointed at the engineer who made the commit. They called it ‘Human Error.’ But that engineer didn’t work in a vacuum. The CI/CD pipeline didn’t catch it. The peer reviewer was distracted by a 1:1 meeting. The monitoring system didn’t alert for 21 minutes because the thresholds were set based on last year’s traffic. Was the ‘root cause’ the YAML character? Or was it the lack of automated validation? Or the culture of rushed reviews? Or the outdated monitoring? If you fix the YAML and don’t fix the culture, you haven’t solved anything. You’ve just waited for the next person to make a different typo.

2. Shifting Focus: Reasons vs. Mechanisms

[

The fiction of the ‘Five Whys’ is that they lead to a single floor, when they actually lead to a basement full of interconnected shadows.

]

This is why I find the traditional post-mortem process so draining. It forces us to simplify the world until it’s unrecognizable. We talk about this a lot on Ship It Weekly, where the focus is often on the ‘how’ rather than the ‘why.’ When you ask ‘why,’ you’re looking for a reason. When you ask ‘how,’ you’re looking for a mechanism. Mechanisms are useful. Reasons are just stories we use to blame people.

Why vs. How Analysis Ratio

Reason

Why

(Blame/Fiction)

VS

Mechanism

How

(Resilience/Fact)

If we want to build resilient systems, we have to embrace the mess. We have to look at the 101 different factors that contributed to the event and accept that we might not be able to ‘fix’ all of them. Sometimes, the fix is just making the system better at failing.

The Holistic View

I think about Parker H.’s dollhouses again. When a miniature staircase sags, Parker doesn’t just put a stronger nail in. Parker looks at the foundation, the climate control in the display room, and the way the weight of the tiny porcelain dolls is distributed. It’s a holistic approach to fragility. In software, we rarely have that luxury. We are under pressure to ‘get it back up’ and ‘close the incident.’ So we lie. We write down our one root cause and we move on.

The Arrogance of Design vs. The Alien Organism

The Lie (Symptom)

Memory Leak

Actionable for VP

vs.

The Reality (Systemic)

Culture + Thresholds

Non-Deterministic Factors

There’s a certain arrogance in the way we design systems. We assume that because we wrote the code, we understand how it will behave. But code in production is an alien organism… To point at one of those [factors] and call it the ‘root’ is like pointing at the final snowflake that triggers an avalanche and blaming it for the destruction of the village. The snow was already there. The slope was already steep. The wind had been blowing for 11 hours. The snowflake was just the participant that happened to arrive last.

4. Focusing on the Safety Margin

If we want to actually learn from our mistakes, we need to stop asking for the root cause. We should be asking for the ‘Safety Margin.’ How close were we to the edge before we fell over? What were the 21 things that *almost* went wrong last week that we didn’t notice because they didn’t happen all at once?

⚠️

High Latency

(Yesterday)

🗄️

Cache Miss

(Last Week)

Review Delay

(Today)

🧠

Cognitive Load

(Always)

We spend all our time studying failure, but we rarely study the ‘near misses.’ We don’t look at the days where the system stayed up despite having the same bugs and the same misconfigurations. Those days are the real mystery. Why didn’t it break yesterday? What was the invisible hand that kept the 5 factors from aligning?

We are just lucky, most of the time.

The Final Password Attempt

I find myself staring at the password prompt again. I’ve reached the lockout threshold. 1 attempt left. My brain is trying to find the ‘root cause’ of why I can’t remember my own password. Is it the sleep deprivation? The blue light from the monitor? The fact that I changed it 31 days ago and never fully committed the new one to memory? It’s all of them. It’s a systemic failure of my own cognitive hardware. If I get it wrong again, the system will lock me out, and I’ll have to call IT. They’ll ask me ‘what happened,’ and I’ll probably just say ‘I forgot it.’ Another lie. Another simple story to cover up a complex reality.

PASSWORD_INPUT [ _ ]

Cognitive hardware failure (Systemic Lockout)

Maybe the real problem is that we’ve built a world that is too complex for our narrative-driven brains to handle. We are monkeys trying to manage starships, and we’re still looking for the ‘bad berry’ that made the tribe sick. We crave the simplicity of the root cause because it gives us the illusion of safety. If there’s one cause, there’s one fix. If there’s one fix, we are safe.

But we aren’t safe. We are just lucky. Most of the time, the five things don’t happen at once. Most of the time, the humidity is low when the visitor sneezes near Parker H.’s dollhouse. We mistake that luck for skill, and we mistake our simple post-mortems for understanding.

The Honest Record

I’m going to type ‘Systemic interaction between caching, database latency, and retry logic’ into the box. I know my boss will hate it. I know he’ll ask me to ‘boil it down.’ But tonight, I’m too tired to lie. I’ll give him the 11-page technical breakdown. I’ll give him the 41-item list of contributing factors. I’ll tell him that if he wants to fix it, he has to stop looking for a single root and start looking at the whole forest. He probably won’t listen. He’ll probably just find a way to turn my report into a single bullet point anyway. But at least the record will be honest, even if the summary isn’t.

The cursor is still blinking. 3:51 AM. I take a deep breath, type my password one more time-slowly, deliberately-and it works. I’m in. Now I just have to explain to the world that there is no ‘one thing’ to fix, knowing full well they only have the attention span for one thing. It’s a lonely place to be, standing at the intersection of five different disasters, trying to convince people that the intersection itself is the problem. But someone has to do it. Otherwise, we’re just building bigger dollhouses and waiting for the next sneeze to bring them down.

This is a system under luck, not skill. To achieve true resilience, the focus must shift from isolating the point of failure to understanding the proximity to systemic fragility.