HOW TO INFECT YOUR ORGANIZATION
WITH HUMANE OPS Matty Stratton DevOps Advocate, PagerDuty @mattstratton
Slide 2
@mattstratton
Slide 3
@mattstratton
Slide 4
@mattstratton
Slide 5
@mattstratton
Slide 6
@mattstratton
Slide 7
@mattstratton
Slide 8
🔥📟 @mattstratton
Slide 9
@mattstratton
Slide 10
@mattstratton
Slide 11
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
@mattstratton
Slide 12
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours
@mattstratton
Slide 13
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours
@mattstratton
Slide 14
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours
@mattstratton
Slide 15
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours â–¸ 122 million notifications on weekends
@mattstratton
Slide 16
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours â–¸ 122 million notifications on weekends â–¸ A total of 750,000 nights with sleep-interrupting notifications
@mattstratton
Slide 17
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS â–¸ 60 million notifications during dinner hours â–¸ 82 million notifications during evening hours â–¸ 250 million notifications during sleeping hours â–¸ 122 million notifications on weekends â–¸ A total of 750,000 nights with sleep-interrupting notifications â–¸ A total of 330,000 weekend days with interrupt notifications @mattstratton
Slide 18
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE
@mattstratton
Slide 19
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted
@mattstratton
Slide 20
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted ▸ Number of days when a responder is woken overnight
@mattstratton
Slide 21
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted ▸ Number of days when a responder is woken overnight ▸ Number of weekend days interrupted by notifications.
@mattstratton
Slide 22
@mattstratton
Slide 23
EXAMPLES OF MEMES ARE TUNES, IDEAS, CATCH-PHRASES, CLOTHES FASHIONS, WAYS OF MAKING POTS OR OF BUILDING ARCHES. JUST AS GENES PROPAGATE THEMSELVES IN THE GENE POOL BY LEAPING FROM BODY TO BODY, SO MEMES PROPAGATE THEMSELVES IN THE MEME POOL BY LEAPING FROM BRAIN TO BRAIN VIA IMITATION. @mattstratton
Richard Dawkins
@mattstratton
Slide 24
SNOW CRASH
@mattstratton
Slide 25
SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus.
@mattstratton
Slide 26
SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme
@mattstratton
Slide 27
SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme ▸ Plus, lots of swordplay
@mattstratton
Slide 28
SNOW CRASH ▸ In the book, “Snow Crash” itself is a neurallinguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme ▸ Plus, lots of swordplay
“IDEOLOGY IS A VIRUS.” - NEAL STEPHENSON @mattstratton
Slide 29
WHAT IF YOU ARE THE SUPREME LEADER?
@mattstratton
Slide 30
WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work
@mattstratton
Slide 31
WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil
@mattstratton
Slide 32
WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil ▸ Avoid “executive swoop”
@mattstratton
Slide 33
WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil ▸ Avoid “executive swoop”
@mattstratton
MIDDLE MANAGEMENT TIPS â–¸ Encourage safe post-incident review spaces â–¸ Drive for a culture of learning
@mattstratton
Slide 37
MIDDLE MANAGEMENT TIPS â–¸ Encourage safe post-incident review spaces â–¸ Drive for a culture of learning â–¸ Take care of your people
@mattstratton
Slide 38
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING
@mattstratton
Slide 39
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.”
@mattstratton
Slide 40
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum.
@mattstratton
Slide 41
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. ▸ You can also ask Dr. Nicole Forsgren - @nicolefv
@mattstratton
Slide 42
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. ▸ You can also ask Dr. Nicole Forsgren - @nicolefv
http://bit.ly/2KpzKKW @mattstratton
Slide 43
USE THE FORCE, EVEN IF YOU AREN’T A JEDI
@mattstratton
Slide 44
REVIEW ALL THE THINGS @mattstratton
Slide 45
HAND-OFF TIME
THE ON-CALL REVIEW
@mattstratton
Slide 46
HAND-OFF TIME
THE ON-CALL REVIEW â–¸ Primary purpose is to understand on-call load and pain
@mattstratton
Slide 47
HAND-OFF TIME
THE ON-CALL REVIEW ▸ Primary purpose is to understand on-call load and pain ▸ Approximately a week’s worth of on-call history is common
@mattstratton
Slide 48
HAND-OFF TIME
THE ON-CALL REVIEW ▸ Primary purpose is to understand on-call load and pain ▸ Approximately a week’s worth of on-call history is common ▸ Take about 30 minutes, give or take
@mattstratton
Slide 49
ON-CALL REVIEW, CONTINUED
@mattstratton
Slide 50
ON-CALL REVIEW, CONTINUED
â–¸
Typically instituted by a team manager
@mattstratton
Slide 51
ON-CALL REVIEW, CONTINUED
â–¸
Typically instituted by a team manager
â–¸
Usually run by on-call responders
@mattstratton
Slide 52
ON-CALL REVIEW, CONTINUED
â–¸
Typically instituted by a team manager
â–¸
Usually run by on-call responders
â–¸
Minimum attendees are the team manager, outgoing on-call, and incoming oncall
@mattstratton
Slide 53
ON-CALL REVIEW, CONTINUED
â–¸
Typically instituted by a team manager
â–¸
Usually run by on-call responders
â–¸
Minimum attendees are the team manager, outgoing on-call, and incoming oncall
â–¸
BETTER PRACTICE - include the entire team!
@mattstratton
Slide 54
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE
@mattstratton
Slide 55
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization.
@mattstratton
Slide 56
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. â–¸ This happened to NASA. Twice.
@mattstratton
Slide 57
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. â–¸ This happened to NASA. Twice. â–¸ In our case, we start to accept alerts or degradations as acceptable.
@mattstratton
Slide 58
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE â–¸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. â–¸ This happened to NASA. Twice. â–¸ In our case, we start to accept alerts or degradations as acceptable.
http://bit.ly/2Ihj1wV @mattstratton
Slide 59
QUESTION METRICS @mattstratton
Slide 60
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS?
@mattstratton
Slide 61
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS? â–¸ What is the data that drive your incident process
@mattstratton
Slide 62
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS? â–¸ What is the data that drive your incident process â–¸ Are your metrics tied to business outcomes?
@mattstratton
Slide 63
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS? ▸ What is the data that drive your incident process ▸ Are your metrics tied to business outcomes? ▸ Correlation doesn’t always equal causation
@mattstratton
Slide 64
SIMPLE. ALWAYS. @mattstratton
Slide 65
KEEP IT SIMPLE
@mattstratton
Slide 66
KEEP IT SIMPLE
THE MORE RESILIENTLY THE SYSTEM IS DESIGNED, THE MORE LIKELY IT IS TO CAUSE A NEGATIVE BUSINESS IMPACT
@mattstratton
Slide 67
KEEP IT SIMPLE
THE MORE RESILIENTLY THE SYSTEM IS DESIGNED, THE MORE LIKELY IT IS TO CAUSE A NEGATIVE BUSINESS IMPACT Stratton’s Law of Catastrophic Predestination @mattstratton
Slide 68
COMMUNICATE.
TALK TO PEOPLE
@mattstratton
Slide 69
COMMUNICATE.
TALK TO PEOPLE â–¸ Who are your customers? What are their expectations?
@mattstratton
Slide 70
COMMUNICATE.
TALK TO PEOPLE â–¸ Who are your customers? What are their expectations? â–¸ Whose customer are you? Can you help them out?
@mattstratton
Slide 71
COMMUNICATE.
TALK TO PEOPLE â–¸ Who are your customers? What are their expectations? â–¸ Whose customer are you? Can you help them out? â–¸ What are the perceptions of your team?
@mattstratton
Slide 72
HUMANS, PEOPLE ARE
@mattstratton
Slide 73
HUMANS, PEOPLE ARE â–¸ Consider contextual on-call
@mattstratton
Slide 74
HUMANS, PEOPLE ARE â–¸ Consider contextual on-call â–¸ The Golden Rule
@mattstratton
Slide 75
HUMANS, PEOPLE ARE â–¸ Consider contextual on-call â–¸ The Golden Rule â–¸ Bake cookies
@mattstratton
Slide 76
HUMANS, PEOPLE ARE â–¸ Consider contextual on-call â–¸ The Golden Rule â–¸ Bake cookies
@mattstratton
Slide 77
LEARN TO TAKE COMMAND
INCIDENT COMMAND @mattstratton
Slide 78
MAKE IT NICE ON THE BRIDGE
DURING A CALL
@mattstratton
Slide 79
MAKE IT NICE ON THE BRIDGE
DURING A CALL â–¸ Have clearly defined roles
@mattstratton
Slide 80
MAKE IT NICE ON THE BRIDGE
DURING A CALL â–¸ Have clearly defined roles â–¸ Avoid bystander effect
@mattstratton
Slide 81
MAKE IT NICE ON THE BRIDGE
DURING A CALL â–¸ Have clearly defined roles â–¸ Avoid bystander effect â–¸ Rally fast, disband faster
@mattstratton
Slide 82
MAKE IT NICE ON THE BRIDGE
DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster ▸ Don’t litigate severity
@mattstratton
Slide 83
MAKE IT NICE ON THE BRIDGE
DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster ▸ Don’t litigate severity ▸ Have a clear mechanism for making decisions
@mattstratton
Slide 84
SHARING IS CARING
SHARE ALL TESTS @mattstratton
Slide 85
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH
@mattstratton
Slide 86
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH â–¸ All functional tests used in preproduction should have a corresponding monitor in production
@mattstratton
Slide 87
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH â–¸ All functional tests used in preproduction should have a corresponding monitor in production â–¸ All monitoring functionality in production should have corresponding tests in the build/release process
@mattstratton
Slide 88
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH â–¸ All functional tests used in preproduction should have a corresponding monitor in production â–¸ All monitoring functionality in production should have corresponding tests in the build/release process â–¸ Monitoring is testing with at time dimension. There should be full parity between preproduction and production.
@mattstratton
Slide 89
EVERY SPRINT
DO ONE NICE THING @mattstratton
Slide 90
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT
@mattstratton
Slide 91
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT â–¸ In each sprint/work unit, add value to your responders
@mattstratton
Slide 92
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders ▸ Even if it’s not on a card
@mattstratton
Slide 93
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders ▸ Even if it’s not on a card ▸ You rebel, you.
@mattstratton
Slide 94
ADDING VALUE
SOME EXAMPLES
@mattstratton
Slide 95
ADDING VALUE
SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count)
@mattstratton
Slide 96
ADDING VALUE
SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some.
@mattstratton
Slide 97
ADDING VALUE
SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. ▸ Add some (useful) tests
@mattstratton
Slide 98
ADDING VALUE
SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. ▸ Add some (useful) tests ▸ Remove something unused
@mattstratton
Slide 99
ADDING VALUE
@mattstratton
Slide 100
ADDING VALUE
â–¸ If you use feature flags, add a description field to the configuration
@mattstratton
Slide 101
ADDING VALUE
â–¸ If you use feature flags, add a description field to the configuration
â–¸
If you use runbooks, ensure they are up to date every time you cut a release. If you don’t do this, abandon the runbook altogether (an incorrect runbook is considered harmful)
@mattstratton
Slide 102
ADDING VALUE
â–¸ If you use feature flags, add a description field to the configuration
â–¸
If you use runbooks, ensure they are up to date every time you cut a release. If you don’t do this, abandon the runbook altogether (an incorrect runbook is considered harmful)
â–¸ SIMPLIFY, MAN!
@mattstratton
Slide 103
@MATTSTRATTON LINKEDIN.COM/IN/MATTSTRATTON MATTSTRATTON.COM ARRESTEDDEVOPS.COM
SHARE YOUR ON-CALL STORIES WITH ME LATER @mattstratton
Slide 104
SPEAKING.MATTSTRATTON.COM @mattstratton
Slide 105
FURTHER READING AND REFERENCES â–¸ Improving Your Employee Retention With Real-Time Ops Data - http://bit.ly/ 2rGTnq4 â–¸ Page It Forward! - http://bit.ly/2In8Lzc â–¸ The study of information flow: A personal journey - http://bit.ly/2KpzKKW â–¸ The Normalization of Deviance (If It Can Happen to NASA, It Can Happen to You) - http://bit.ly/2Ihj1wV
@mattstratton