THE ESSAYS · ON PART III← BACK TO THE ESSAYS

The failure that does not look like failure

[deferred to copy pass]


The failure that does not look like failure

Failure, in the conventional sense, has a signature. Something stops working. An alarm sounds. An incident is filed. A postmortem is written. Accountability is located, a fix is proposed, the system is restored to a state that resembles its prior functioning. The signature is what makes the failure addressable. The fact that the failure has a shape is what allows it to be picked up and dealt with.

What is being described here is a failure mode that produces none of these things. Nothing stops working. No alarm sounds. No incident is filed. The systems continue to produce outputs. The metrics continue to hold within their expected bands. Customers continue to be served. Dashboards continue to report green. The failure occurs and leaves no signature, because the structures that would normally record it are tuned to a different kind of event.

This is not a failure that is hidden. It is a failure that is not legible to the apparatus built to register failures. The two conditions are different, and the difference matters.


The first shape the silence takes is the alarm that did not sound. Alarms are built to fire on specific kinds of events: crashes, latency spikes, error rates exceeding a threshold, outputs falling outside a defined envelope. The events the alarms watch for are the events the system was built to fail in, when it was built. The failure mode in question does not produce any of these. The outputs are within spec. The latency is normal. The error rate is acceptable. The system is, by every measure the alarms are equipped to take, functioning.

The alarms remain green, not because the failure was small enough to miss, but because the failure does not register on the dimensions the alarms watch. This is not a calibration problem. Recalibrating the existing alarms more sensitively would produce more false positives without producing the right detection, because the failure is not a more-extreme version of the events the alarms are tuned to. It is a different category of event. The alarms do not detect it for the same reason a smoke detector does not detect a flood. The instruments are accurate. They are also looking at the wrong thing.

The silence here is not the silence of failure that has not yet been reported. It is the silence of failure that is not being measured.


The second shape is the metric that did not move. Conventional failures show up in the numbers, eventually. The number drops, or it spikes, or the chart bends in a direction that gets attention. The metric is the form in which failure becomes legible to the people who do not see the system directly. It is also the form in which failure becomes prioritizable, because the metric is what produces the case for allocating attention to the failure rather than to something else.

The failure mode in question does not move the metrics, because the metrics were not designed to capture the dimension along which the failure runs. The dashboards measure outputs: volume, throughput, accuracy against benchmark, satisfaction scores, completion rates. The failure is not in the outputs. It is in the relationship between the outputs and the situations the outputs were supposed to address. That relationship is interpretive. The metrics are quantitative. The two do not map. The dashboards remain unchanged, not because nothing is changing, but because what is changing is not what the dashboards are watching, and the question of whether the dashboards are watching the right thing is not, in most organizations, a question anyone is structured to ask.

The silence here is the silence of a measurement system that is correctly measuring the wrong dimension.


The third shape is the incident that was never filed. Incident reports are produced when something has gone wrong in a way that has a moment, a location, and a perpetrator. The report names the moment ("at 14:32 on Tuesday"), locates the cause ("the upstream service returned malformed data"), assigns responsibility ("Team X owns this surface"), and recommends a remediation. The genre of the incident report depends on the existence of an event. Without an event, the form has nothing to attach to.

The failure mode in question has no event. It is distributed across hundreds or thousands of interactions, each of which was defensible in isolation. No single output is the failure. The failure is the aggregate, and the aggregate does not happen at a moment. It happens across time, in increments small enough to fall below the threshold at which any single instance would have warranted a report. There is no perpetrator to name, because there was no perpetrating decision. There is no remediation to recommend, because the remediation would require redesigning the architecture that produced the aggregate, which is not what incident reports are written to recommend.

The incident is not delayed. It is structurally impossible. The genre of the incident report does not fit the shape of what occurred. The people who would have written reports do not write them, not because they are negligent, but because the form does not apply. The silence here is the silence of a knowledge-recording system whose categories the situation does not occupy.


The signature is the absence of signature. That is the condition worth being able to see. The failure mode is not invisible because it is small. It is invisible because the instruments that would render it visible are tuned to events of a different kind, and the failure is not an event of that kind. It is a condition of that kind.

Conditions do not get filed against. They are inhabited. People who inhabit a condition without being able to name it do not act on it, because the actions available to them are calibrated to events, and the condition is not an event. They notice that something is off. They notice it for a long time. The noticing does not produce action, because there is no action shaped to the thing they are noticing.

The reason this failure mode is worth watching is precisely that it does not announce itself. The failure modes that announce themselves get addressed, because the announcing is what triggers the address. The failure modes that do not announce themselves become permanent, not because anyone decided they should be permanent, but because the mechanisms by which problems become priorities run on signals the failure does not emit. The address is downstream of the alarm. No alarm, no address.

Nothing has gone wrong, in the conventional sense. Everything is also not right. The gap between those two facts is the failure mode. It has no name, because naming runs through the same apparatus that detection does, and the apparatus does not catch it. The lack of a name is not incidental. It is part of what makes the failure durable. The condition continues because there is no available form in which it could end.


ON PART III

[deferred to copy pass]