How Do You Infect Your Organization With Humane Ops?

A presentation at 2019 NanoConf in May 2019 in Redmond, WA, USA by Matt Stratton

Slide 1

Slide 1

HOW TO INFECT YOUR ORGANIZATION WITH HUMANE OPS Matty Stratton DevOps Advocate, PagerDuty @mattstratton

Slide 2

Slide 2

@mattstratton

Slide 3

Slide 3

@mattstratton

Slide 4

Slide 4

@mattstratton

Slide 5

Slide 5

@mattstratton

Slide 6

Slide 6

@mattstratton

Slide 7

Slide 7

@mattstratton

Slide 8

Slide 8

🔥📟 @mattstratton

Slide 9

Slide 9

@mattstratton

Slide 10

Slide 10

@mattstratton

Slide 11

Slide 11

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS @mattstratton

Slide 12

Slide 12

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours @mattstratton

Slide 13

Slide 13

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours @mattstratton

Slide 14

Slide 14

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours @mattstratton

Slide 15

Slide 15

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours â–¸ 122 million notifications on weekends @mattstratton

Slide 16

Slide 16

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours â–¸ 122 million notifications on weekends â–¸ A total of 750,000 nights with sleep-interrupting notifications @mattstratton

Slide 17

Slide 17

THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours â–¸ 122 million notifications on weekends â–¸ A total of 750,000 nights with sleep-interrupting notifications â–¸ A total of 330,000 weekend days with interrupt notifications @mattstratton

Slide 18

Slide 18

LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE @mattstratton

Slide 19

Slide 19

LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted @mattstratton

Slide 20

Slide 20

LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted ▸ Number of days when a responder is woken overnight @mattstratton

Slide 21

Slide 21

LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted ▸ Number of days when a responder is woken overnight ▸ Number of weekend days interrupted by notifications. @mattstratton

Slide 22

Slide 22

@mattstratton

Slide 23

Slide 23

EXAMPLES OF MEMES ARE TUNES, IDEAS, CATCH-PHRASES, CLOTHES FASHIONS, WAYS OF MAKING POTS OR OF BUILDING ARCHES. JUST AS GENES PROPAGATE THEMSELVES IN THE GENE POOL BY LEAPING FROM BODY TO BODY, SO MEMES PROPAGATE THEMSELVES IN THE MEME POOL BY LEAPING FROM BRAIN TO BRAIN VIA IMITATION. @mattstratton Richard Dawkins @mattstratton

Slide 24

Slide 24

SNOW CRASH @mattstratton

Slide 25

Slide 25

SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. @mattstratton

Slide 26

Slide 26

SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme @mattstratton

Slide 27

Slide 27

SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme ▸ Plus, lots of swordplay @mattstratton

Slide 28

Slide 28

SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme ▸ Plus, lots of swordplay “IDEOLOGY IS A VIRUS.” - NEAL STEPHENSON @mattstratton

Slide 29

Slide 29

WHAT IF YOU ARE THE SUPREME LEADER? @mattstratton

Slide 30

Slide 30

WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work @mattstratton

Slide 31

Slide 31

WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil @mattstratton

Slide 32

Slide 32

WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil ▸ Avoid “executive swoop” @mattstratton

Slide 33

Slide 33

WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil ▸ Avoid “executive swoop” @mattstratton

Slide 34

Slide 34

MIDDLE MANAGEMENT TIPS @mattstratton

Slide 35

Slide 35

MIDDLE MANAGEMENT TIPS â–¸ Encourage safe post-incident review spaces @mattstratton

Slide 36

Slide 36

MIDDLE MANAGEMENT TIPS â–¸ Encourage safe post-incident review spaces â–¸ Drive for a culture of learning @mattstratton

Slide 37

Slide 37

MIDDLE MANAGEMENT TIPS â–¸ Encourage safe post-incident review spaces â–¸ Drive for a culture of learning â–¸ Take care of your people @mattstratton

Slide 38

Slide 38

REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING @mattstratton

Slide 39

Slide 39

REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” @mattstratton

Slide 40

Slide 40

REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. @mattstratton

Slide 41

Slide 41

REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. ▸ You can also ask Dr. Nicole Forsgren - @nicolefv @mattstratton

Slide 42

Slide 42

REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. ▸ You can also ask Dr. Nicole Forsgren - @nicolefv http://bit.ly/2KpzKKW @mattstratton

Slide 43

Slide 43

USE THE FORCE, EVEN IF YOU AREN’T A JEDI @mattstratton

Slide 44

Slide 44

REVIEW ALL THE THINGS @mattstratton

Slide 45

Slide 45

HAND-OFF TIME THE ON-CALL REVIEW @mattstratton

Slide 46

Slide 46

HAND-OFF TIME THE ON-CALL REVIEW â–¸ Primary purpose is to understand on-call load and pain @mattstratton

Slide 47

Slide 47

HAND-OFF TIME THE ON-CALL REVIEW ▸ Primary purpose is to understand on-call load and pain ▸ Approximately a week’s worth of on-call history is common @mattstratton

Slide 48

Slide 48

HAND-OFF TIME THE ON-CALL REVIEW ▸ Primary purpose is to understand on-call load and pain ▸ Approximately a week’s worth of on-call history is common ▸ Take about 30 minutes, give or take @mattstratton

Slide 49

Slide 49

ON-CALL REVIEW, CONTINUED @mattstratton

Slide 50

Slide 50

ON-CALL REVIEW, CONTINUED â–¸ Typically instituted by a team manager @mattstratton

Slide 51

Slide 51

ON-CALL REVIEW, CONTINUED â–¸ Typically instituted by a team manager â–¸ Usually run by on-call responders @mattstratton

Slide 52

Slide 52

ON-CALL REVIEW, CONTINUED â–¸ Typically instituted by a team manager â–¸ Usually run by on-call responders â–¸ Minimum attendees are the team manager, outgoing on-call, and incoming oncall @mattstratton

Slide 53

Slide 53

ON-CALL REVIEW, CONTINUED â–¸ Typically instituted by a team manager â–¸ Usually run by on-call responders â–¸ Minimum attendees are the team manager, outgoing on-call, and incoming oncall â–¸ BETTER PRACTICE - include the entire team! @mattstratton

Slide 54

Slide 54

REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE @mattstratton

Slide 55

Slide 55

REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. @mattstratton

Slide 56

Slide 56

REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. â–¸ This happened to NASA. Twice. @mattstratton

Slide 57

Slide 57

REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. â–¸ This happened to NASA. Twice. â–¸ In our case, we start to accept alerts or degradations as acceptable. @mattstratton

Slide 58

Slide 58

REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. â–¸ This happened to NASA. Twice. â–¸ In our case, we start to accept alerts or degradations as acceptable. http://bit.ly/2Ihj1wV @mattstratton

Slide 59

Slide 59

QUESTION METRICS @mattstratton

Slide 60

Slide 60

QUESTION METRICS WHY ARE WE USING THESE NUMBERS? @mattstratton

Slide 61

Slide 61

QUESTION METRICS WHY ARE WE USING THESE NUMBERS? â–¸ What is the data that drive your incident process @mattstratton

Slide 62

Slide 62

QUESTION METRICS WHY ARE WE USING THESE NUMBERS? â–¸ What is the data that drive your incident process â–¸ Are your metrics tied to business outcomes? @mattstratton

Slide 63

Slide 63

QUESTION METRICS WHY ARE WE USING THESE NUMBERS? ▸ What is the data that drive your incident process ▸ Are your metrics tied to business outcomes? ▸ Correlation doesn’t always equal causation @mattstratton

Slide 64

Slide 64

SIMPLE. ALWAYS. @mattstratton

Slide 65

Slide 65

KEEP IT SIMPLE @mattstratton

Slide 66

Slide 66

KEEP IT SIMPLE THE MORE RESILIENTLY THE SYSTEM IS DESIGNED, THE MORE LIKELY IT IS TO CAUSE A NEGATIVE BUSINESS IMPACT @mattstratton

Slide 67

Slide 67

KEEP IT SIMPLE THE MORE RESILIENTLY THE SYSTEM IS DESIGNED, THE MORE LIKELY IT IS TO CAUSE A NEGATIVE BUSINESS IMPACT Stratton’s Law of Catastrophic Predestination @mattstratton

Slide 68

Slide 68

COMMUNICATE. TALK TO PEOPLE @mattstratton

Slide 69

Slide 69

COMMUNICATE. TALK TO PEOPLE â–¸ Who are your customers? What are their expectations? @mattstratton

Slide 70

Slide 70

COMMUNICATE. TALK TO PEOPLE â–¸ Who are your customers? What are their expectations? â–¸ Whose customer are you? Can you help them out? @mattstratton

Slide 71

Slide 71

COMMUNICATE. TALK TO PEOPLE â–¸ Who are your customers? What are their expectations? â–¸ Whose customer are you? Can you help them out? â–¸ What are the perceptions of your team? @mattstratton

Slide 72

Slide 72

HUMANS, PEOPLE ARE @mattstratton

Slide 73

Slide 73

HUMANS, PEOPLE ARE â–¸ Consider contextual on-call @mattstratton

Slide 74

Slide 74

HUMANS, PEOPLE ARE â–¸ Consider contextual on-call â–¸ The Golden Rule @mattstratton

Slide 75

Slide 75

HUMANS, PEOPLE ARE â–¸ Consider contextual on-call â–¸ The Golden Rule â–¸ Bake cookies @mattstratton

Slide 76

Slide 76

HUMANS, PEOPLE ARE â–¸ Consider contextual on-call â–¸ The Golden Rule â–¸ Bake cookies @mattstratton

Slide 77

Slide 77

LEARN TO TAKE COMMAND INCIDENT COMMAND @mattstratton

Slide 78

Slide 78

MAKE IT NICE ON THE BRIDGE DURING A CALL @mattstratton

Slide 79

Slide 79

MAKE IT NICE ON THE BRIDGE DURING A CALL â–¸ Have clearly defined roles @mattstratton

Slide 80

Slide 80

MAKE IT NICE ON THE BRIDGE DURING A CALL â–¸ Have clearly defined roles â–¸ Avoid bystander effect @mattstratton

Slide 81

Slide 81

MAKE IT NICE ON THE BRIDGE DURING A CALL â–¸ Have clearly defined roles â–¸ Avoid bystander effect â–¸ Rally fast, disband faster @mattstratton

Slide 82

Slide 82

MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster ▸ Don’t litigate severity @mattstratton

Slide 83

Slide 83

MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster ▸ Don’t litigate severity ▸ Have a clear mechanism for making decisions @mattstratton

Slide 84

Slide 84

SHARING IS CARING SHARE ALL TESTS @mattstratton

Slide 85

Slide 85

SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH @mattstratton

Slide 86

Slide 86

SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH â–¸ All functional tests used in preproduction should have a corresponding monitor in production @mattstratton

Slide 87

Slide 87

SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH â–¸ All functional tests used in preproduction should have a corresponding monitor in production â–¸ All monitoring functionality in production should have corresponding tests in the build/release process @mattstratton

Slide 88

Slide 88

SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH â–¸ All functional tests used in preproduction should have a corresponding monitor in production â–¸ All monitoring functionality in production should have corresponding tests in the build/release process â–¸ Monitoring is testing with at time dimension. There should be full parity between preproduction and production. @mattstratton

Slide 89

Slide 89

EVERY SPRINT DO ONE NICE THING @mattstratton

Slide 90

Slide 90

HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT @mattstratton

Slide 91

Slide 91

HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT â–¸ In each sprint/work unit, add value to your responders @mattstratton

Slide 92

Slide 92

HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders ▸ Even if it’s not on a card @mattstratton

Slide 93

Slide 93

HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders ▸ Even if it’s not on a card ▸ You rebel, you. @mattstratton

Slide 94

Slide 94

ADDING VALUE SOME EXAMPLES @mattstratton

Slide 95

Slide 95

ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) @mattstratton

Slide 96

Slide 96

ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. @mattstratton

Slide 97

Slide 97

ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. ▸ Add some (useful) tests @mattstratton

Slide 98

Slide 98

ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. ▸ Add some (useful) tests ▸ Remove something unused @mattstratton

Slide 99

Slide 99

ADDING VALUE @mattstratton

Slide 100

Slide 100

ADDING VALUE â–¸ If you use feature flags, add a description field to the configuration @mattstratton

Slide 101

Slide 101

ADDING VALUE ▸ If you use feature flags, add a description field to the configuration ▸ If you use runbooks, ensure they are up to date every time you cut a release. If you don’t do this, abandon the runbook altogether (an incorrect runbook is considered harmful) @mattstratton

Slide 102

Slide 102

ADDING VALUE ▸ If you use feature flags, add a description field to the configuration ▸ If you use runbooks, ensure they are up to date every time you cut a release. If you don’t do this, abandon the runbook altogether (an incorrect runbook is considered harmful) ▸ SIMPLIFY, MAN! @mattstratton

Slide 103

Slide 103

@MATTSTRATTON LINKEDIN.COM/IN/MATTSTRATTON MATTSTRATTON.COM ARRESTEDDEVOPS.COM SHARE YOUR ON-CALL STORIES WITH ME LATER @mattstratton

Slide 104

Slide 104

SPEAKING.MATTSTRATTON.COM @mattstratton

Slide 105

Slide 105

FURTHER READING AND REFERENCES â–¸ Improving Your Employee Retention With Real-Time Ops Data - http://bit.ly/ 2rGTnq4 â–¸ Page It Forward! - http://bit.ly/2In8Lzc â–¸ The study of information flow: A personal journey - http://bit.ly/2KpzKKW â–¸ The Normalization of Deviance (If It Can Happen to NASA, It Can Happen to You) - http://bit.ly/2Ihj1wV @mattstratton

Slide 106

Slide 106

â–¸ Snow Crash by Neal Stephenson - http://bit.ly/2Iiuc8L â–¸ The Cybersecurity Canon: Snow Crash - http://bit.ly/2InDYGI â–¸ Disasters! Arrested DevOps Episode 37 - https://arresteddevops.com/37 â–¸ PagerDuty Incident Response - https://response.pagerduty.com â–¸ Operational Reviews - https://reviews.pagerduty.com @mattstratton