Blameless Postmortems: How to Actually Do Them

Bias Fundamental attribution error Confirmation bias Definition Countermeasure Discuss ‘what’ questions instead of ‘who’. What people do reflects their character Focus on the system, the infrastructure, rather than their circumstances. and the situation - not the people involved. Favoring information that reinforces existing positions. Appoint someone to play devil’s advocate to take contrarian viewpoints during investigations. Hindsight bias Explain events in terms of foresight Seeing the incident as inevitable despite instead. Start your timeline analysis at a there having been little or no objective point before the incident, and work your basis for predicting it because we know way forward instead of backward from the outcome. resolution. Negativity bias Reframe incidents as learning Things of a more negative nature have a opportunities, and remember to describe greater effect on one’s mental state than what was handled well in incident neutral or even positive things. response. @mattstratton

Owner responsibilities • Scheduling the postmortem meeting on the shared calendar and inviting the relevant people (this should be scheduled within 3 business days for a Sev-1 and 5 business days for a Sev-2). • Investigating the incident, pulling in whomever you need from other teams to assist in the investigation. • Ensuring the page is updated with all of the necessary content. Use your organization’s template for what should be included. • Creating follow-up tickets. (You are only responsible for creating the tickets, not following them up to resolution). • Reviewing the postmortem content with appropriate parties before the meeting. Running through the topics at the postmortem meeting (the Incident Commander will “run” the meeting and keep the discussion on track, but you will likely be doing most of the talking). • Communicating the results of the postmortem internally. @mattstratton

Key Takeaways • The postmortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be lost. • The goal of the postmortem is to understand what systemic factors led to the incident and identify actions that can that can improve the resiliency of the affected system • An essential outcome of the postmortem meeting is buy-in for the action plan • Ask “what” and “how” questions rather than “who” or “why” • There is no single root cause of major failure in complex systems, but a combination of contributing factors that together lead to failure • An individual’s action should never be considered a root cause. • The impulse to blame and punish has the unintended effect of disincentivizing the knowledge sharing required to learn from incidents @mattstratton