Avengers Assemble - The Thanos Incident

A presentation at DevOpsDays Amsterdam 2022 in June 2022 in Amsterdam, Netherlands by Matt Stratton

Slide 1

Slide 1

AV ENGERS ASSEMBLE T HE T HANOS INCIDEN T

Slide 2

Slide 2

@mattstratton

Slide 3

Slide 3

SPOILER WARNINGS @mattstratton

Slide 4

Slide 4

@mattstratton

Slide 5

Slide 5

SO, WHAT HAPPENED? @mattstratton

Slide 6

Slide 6

SO, WHAT HAPPENED? ▸ We will start by creating a post-mortem on the incident of “The Snap” ▸ Our approach will be the address this in a blameless fashion ▸ We want to understand what happened, as well as the process the Avengers used ▸ For purposes of this discussion, “The Avengers” will include characters not generally considered Avengers, please bear with me @mattstratton

Slide 7

Slide 7

SO, WHAT HAPPENED? @mattstratton

Slide 8

Slide 8

CONSTRUCT OUR TIMELINE ▸ Stick to the facts ▸ Include key decisions and actions taken by responders ▸ Avoid evaluating what should or shouldn’t have been done ▸ Start the timeline at a point before the incident began @mattstratton

Slide 9

Slide 9

HIGH LEVEL TIMELINE ▸ Thanos obtains the Power Stone and the Space Stone ▸ Thor and the Guardians of the Galaxy decide to split up - Thor to head to Nidavellir with Rocket and Groot, the rest to head to Knowhere ▸ Thanos retrieves the Reality Stone from The Collector on Knowhere ▸ Dr. Strange uses the Time Stone to view millions of possible futures ▸ Thanos sacrifices Gamora on Vormir to obtain the Soul Stone ▸ Several team members attempt to recover the Infinity Gauntlet from Thanos on Titan @mattstratton

Slide 10

Slide 10

HIGH LEVEL TIMELINE ▸ Dr. Strange decides to exchange the Time Stone for Tony Stark’s life ▸ Shuri works to remove the Mind Stone from Vision ▸ Various team members attempt to defend Vision in Wakenda ▸ Thanos obtains the Mind Stone from Vision ▸ Thor attacks Thanos but is unable to defeat him ▸ Thanos snaps his fingers, wiping out half of all life @mattstratton

Slide 11

Slide 11

Groot Cause Analysis Credit to @djpiebob and @this_hits_home @mattstratton

Slide 12

Slide 12

SYSTEMS ARE COMPLEX @mattstratton

Slide 13

Slide 13

THERE IS NO SINGLE ROOT CAUSE OF MAJOR FAILURE IN COMPLEX SYSTEMS, BUT A COMBINATION OF CONTRIBUTING FACTORS THAT TOGETHER LEAD TO FAILURE @mattstratton

Slide 14

Slide 14

Blameless @mattstratton

Slide 15

Slide 15

WHY DOES BLAMELESSNESS MATTER? ▸ This impulse to blame and punish has the unintended effect of disincentivizing the knowledge sharing required to prevent future failure ▸ The goal of the postmortem is to understand what systemic factors led to the incident and identify actions that can help improve the resilience of the system ▸ Stay focused on how a mistake was made instead of who made it @mattstratton

Slide 16

Slide 16

WHY IS BLAMELESSNESS HARD? ▸ When processing information, the human mind unconsciously takes shortcuts ▸ We are hard-wired from millions of years of evolutionary neurobiology to tend to blame ▸ The human mind optimizes for timeliness over accuracy, which is reinforced by cognitive biases @mattstratton

Slide 17

Slide 17

@mattstratton

Slide 18

Slide 18

COGNITIVE BIASES ▸ Fundamental attribution error ▸ Confirmation bias ▸ Hindsight bias ▸ Negativity bias @mattstratton

Slide 19

Slide 19

FUNDAMENTAL ATTRIBUTION ERROR @mattstratton

Slide 20

Slide 20

CONFIRMATION BIAS @mattstratton

Slide 21

Slide 21

HINDSIGHT BIAS @mattstratton

Slide 22

Slide 22

NEGATIVITY BIAS @mattstratton

Slide 23

Slide 23

HOW TO AVOID BLAME

Slide 24

Slide 24

ASK “WHAT” AND “HOW” QUESTIONS RATHER THAN “WHO” OR “WHY” @mattstratton

Slide 25

Slide 25

CONSIDER MULTIPLE AND DIVERSE PERSPECTIVES @mattstratton

Slide 26

Slide 26

ASK YOURSELF WHY A REASONABLE, RATIONAL, AND DECENT PERSON MAY HAVE TAKEN A PARTICULAR ACTION @mattstratton

Slide 27

Slide 27

ABSTRACT TO AN INSPECIFIC RESPONDER @mattstratton

Slide 28

Slide 28

CONTRAST WHAT YOU DID NOT INTEND WITH WHAT YOU DO INTEND @mattstratton

Slide 29

Slide 29

ALL PRACTITIONER ACTIONS ARE ACTUALLY GAMBLES, THAT IS, ACTS THAT TAKE PLACE IN THE FACE OF UNCERTAIN OUTCOMES. Dr. Richard Cook @mattstratton

Slide 30

Slide 30

YOU NEVER KNOW. YOU HOPE FOR THE BEST, THEN MAKE DO WITH WHAT YOU’VE GOT Nick Fury @mattstratton

Slide 31

Slide 31

What can we learn? @mattstratton

Slide 32

Slide 32

@mattstratton

Slide 33

Slide 33

Have An Incident Commander @mattstratton

Slide 34

Slide 34

DELEGATE AND COORDINATE @mattstratton

Slide 35

Slide 35

DECISION MAKER @mattstratton

Slide 36

Slide 36

SINGLE SOURCE OF TRUTH @mattstratton

Slide 37

Slide 37

SHOULD NOT BE A RESPONDER @mattstratton

Slide 38

Slide 38

Who should be the Incident Commander for the Avengers? @mattstratton

Slide 39

Slide 39

Rotations Matter @mattstratton

Slide 40

Slide 40

Escalate and Bring in Help @mattstratton

Slide 41

Slide 41

Hero Culture @mattstratton

Slide 42

Slide 42

Teamwork Makes the Dream Work @mattstratton

Slide 43

Slide 43

SHARE ON-CALL ▸ Carol Danvers is the only one who carries a pager! ▸ The more folks on-call, the less the load for everyone ▸ Having a consistent mechanism for bringing in experts for incident response is key @mattstratton

Slide 44

Slide 44

@mattstratton

Slide 45

Slide 45

SO WHAT HAVE WE LEARNED? @mattstratton

Slide 46

Slide 46

@mattstratton

Slide 47

Slide 47

And perhaps most of all… @mattstratton

Slide 48

Slide 48

@mattstratton

Slide 49

Slide 49

https://speaking.mattstratton.com @mattstratton

Slide 50

Slide 50

ACKNOWLEDGEMENTS ▸ Jeremy Meiss @IAmJerdog ▸ Karissa Peth @karissapeth ▸ Nell Shamrell-Harrington @nellshamrell ▸ Sarai Rosenberg @saraislet ▸ Ryan Kitchens @this_hits_home @mattstratton