There’s a great article from 2012 titled Blameless PostMortems and a Just Culture. It was written by John Allspaw, a SVP of Technical Operations at Etsy. The article tackles the topic of handling errors and incidents after they have happened. It’s a great article and I recommend that everybody in IT read it because it has the power to change your team culture for the better.
The premise of the article states that failure happens and it’s going to happen with complex systems. Humans will also make mistakes and you can’t just replace the mistake-making humans with the latest model of non-mistake-making humans, the traditional “Bad Apple Theory”.
Instead of focusing on who’s to blame for causing an outage, the focus should instead be on learning from the mistake and preventing it in the future. To accomplish this, Allspaw uses blameless postmortems. This establishes a Just Culture, balancing safety of the systems with accountability. If an engineer is fearing punishment or retribution for trying to do their job, they are less likely to provide all of the details needed to learn what truly happened in the situation.
With a blameless postmortem, the management team is trying to ascertain a timeline of events, the actions taken, results of those actions, the engineer’s expectations of those actions, and any assumptions made in reaching these decisions. Getting the full picture of the incident will help discover if there was a fault in the logic, bad information, or the system reacting differently than documented.
Firing the person that has made a mistake and learned from the event is doing the organization a disservice to replace them with an engineer that has not made the mistake nor learned from the event. Only by understanding the individual, technical, or organizational reasoning behind decisions that lead to problems, can it be expected to fix the true cause of these outages.
A culture of “name-and-blame” along with “cover your ass” does not build teamwork and leaves management unable to manage. The person that has learned from an outage is unlikely to repeat it and should be trumpeting the solution, a better method, or correcting others’ logic to prevent that same situation from arising.
One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.
Along with the Just Culture, blameless postmortems, and this article, Etsy also created Morgue on github as a software component to hold incident postmortem details.