@mattstratton Matty Stratton DevOps Evangelist, PagerDuty WITH HUMANE OPS HOW TO INFECT YOUR ORGANIZATION
A presentation at PagerDuty Tour Amsterdam in October 2018 in Amsterdam, Netherlands by Matt Stratton
@mattstratton Matty Stratton DevOps Evangelist, PagerDuty WITH HUMANE OPS HOW TO INFECT YOUR ORGANIZATION
@mattstratton @mattstratton
@mattstratton
@mattstratton
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS ▸ 60 million notifications during dinner hours PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS ▸ 60 million notifications during dinner hours ▸ 82 million notifications during evening hours PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS ▸ 60 million notifications during dinner hours ▸ 82 million notifications during evening hours ▸ 250 million notifications during sleeping hours PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS ▸ 60 million notifications during dinner hours ▸ 82 million notifications during evening hours ▸ 250 million notifications during sleeping hours ▸ 122 million notifications on weekends PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS ▸ 60 million notifications during dinner hours ▸ 82 million notifications during evening hours ▸ 250 million notifications during sleeping hours ▸ 122 million notifications on weekends ▸ A total of 750,000 nights with sleep-interrupting notifications PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton THE DATA 50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS ▸ 60 million notifications during dinner hours ▸ 82 million notifications during evening hours ▸ 250 million notifications during sleeping hours ▸ 122 million notifications on weekends ▸ A total of 750,000 nights with sleep-interrupting notifications ▸ A total of 330,000 weekend days with interrupt notifications PagerDuty commissioned a study across over 10,000 companies over 100 different segments.
@mattstratton LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE
@mattstratton LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted
@mattstratton LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted ▸ Number of days when a responder is woken overnight
@mattstratton LET’S HAVE SOME DATA THE MOST MEANINGFUL METRICS ON ATTRITION ARE ▸ Number of days where a responder’s work and life are interrupted ▸ Number of days when a responder is woken overnight ▸ Number of weekend days interrupted by notifications.
@mattstratton EXAMPLES OF MEMES ARE TUNES, IDEAS, CATCH-PHRASES, CLOTHES FASHIONS, WAYS OF MAKING POTS OR OF BUILDING ARCHES. JUST AS GENES PROPAGATE THEMSELVES IN THE GENE POOL BY LEAPING FROM BODY TO BODY, SO MEMES PROPAGATE THEMSELVES IN THE MEME POOL BY LEAPING FROM BRAIN TO BRAIN VIA IMITATION. Richard Dawkins @mattstratton
@mattstratton SNOW CRASH Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.
@mattstratton SNOW CRASH ▸ In the book, “Snow Crash” itself is a neural- linguistic virus. Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.
@mattstratton SNOW CRASH ▸ In the book, “Snow Crash” itself is a neural- linguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.
@mattstratton SNOW CRASH ▸ In the book, “Snow Crash” itself is a neural- linguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme ▸ Plus, lots of swordplay Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.
@mattstratton SNOW CRASH ▸ In the book, “Snow Crash” itself is a neural- linguistic virus. ▸ The bad guys figure out how to unlock it, and it spreads from hacker to hacker like a meme ▸ Plus, lots of swordplay “IDEOLOGY IS A VIRUS.”
@mattstratton WHAT IF YOU ARE THE SUPREME LEADER?
@mattstratton WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work
@mattstratton WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil
@mattstratton WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil ▸ Avoid “executive swoop”
@mattstratton WHAT IF YOU ARE THE SUPREME LEADER? ▸ “Command and control” doesn’t work ▸ Use measurement for good, not for evil ▸ Avoid “executive swoop”
@mattstratton MIDDLE MANAGEMENT TIPS
@mattstratton MIDDLE MANAGEMENT TIPS ▸ Encourage safe post-incident review spaces
@mattstratton MIDDLE MANAGEMENT TIPS ▸ Encourage safe post-incident review spaces ▸ Drive for a culture of learning
@mattstratton MIDDLE MANAGEMENT TIPS ▸ Encourage safe post-incident review spaces ▸ Drive for a culture of learning ▸ You hired smart people - use them
@mattstratton REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING http://bit.ly/2KpzKKW If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” http://bit.ly/2KpzKKW If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. http://bit.ly/2KpzKKW If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton REVIEW. REVIEW. REVIEW A CULTURE OF LEARNING ▸ In a generative, performance-oriented organization, “failure leads to inquiry.” ▸ Don’t take my word for it. Ask Ron Westrum. ▸ You can also ask Dr. Nicole Forsgren - @nicolefv http://bit.ly/2KpzKKW If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton USE THE FORCE, EVEN IF YOU AREN’T A JEDI
@mattstratton REVIEW ALL THE THINGS @mattstratton Andy Fleener, Platform Operations Manager, Sportsengine - “We review every alert from the last 24 hours/weekend every day. No broken windows.”
@mattstratton REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE http://bit.ly/2Ihj1wV If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE ▸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. http://bit.ly/2Ihj1wV If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE ▸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. ▸ This happened to NASA. Twice. http://bit.ly/2Ihj1wV If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton REVIEW. REVIEW. REVIEW NORMALIZATION OF DEVIANCE ▸ The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization. ▸ This happened to NASA. Twice. ▸ In our case, we start to accept alerts or degradations as acceptable. http://bit.ly/2Ihj1wV If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
@mattstratton QUESTION METRICS @mattstratton Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
@mattstratton QUESTION METRICS WHY ARE WE USING THESE NUMBERS? Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
@mattstratton QUESTION METRICS WHY ARE WE USING THESE NUMBERS? ▸ What is the data that drive your incident process Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
@mattstratton QUESTION METRICS WHY ARE WE USING THESE NUMBERS? ▸ What is the data that drive your incident process ▸ Are your metrics tied to business outcomes? Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
@mattstratton QUESTION METRICS WHY ARE WE USING THESE NUMBERS? ▸ What is the data that drive your incident process ▸ Are your metrics tied to business outcomes? ▸ Correlation doesn’t always equal causation Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
@mattstratton SIMPLE. ALWAYS. @mattstratton Don’t over-design systems. Resume-driven development is almost always a recipe for on-call disasters.
@mattstratton KEEP IT SIMPLE At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs
@mattstratton THE MORE RESILIENTLY THE SYSTEM IS DESIGNED, THE MORE LIKELY IT IS TO CAUSE A NEGATIVE BUSINESS IMPACT KEEP IT SIMPLE At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs
@mattstratton THE MORE RESILIENTLY THE SYSTEM IS DESIGNED, THE MORE LIKELY IT IS TO CAUSE A NEGATIVE BUSINESS IMPACT Stratton’s Law of Catastrophic Predestination KEEP IT SIMPLE At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs
@mattstratton COMMUNICATE. TALK TO PEOPLE ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.
@mattstratton COMMUNICATE. TALK TO PEOPLE ▸ Who are your customers? What are their expectations? ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.
@mattstratton COMMUNICATE. TALK TO PEOPLE ▸ Who are your customers? What are their expectations? ▸ Whose customer are you? Can you help them out? ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.
@mattstratton COMMUNICATE. TALK TO PEOPLE ▸ Who are your customers? What are their expectations? ▸ Whose customer are you? Can you help them out? ▸ What are the perceptions of your team? ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.
@mattstratton HUMANS, PEOPLE ARE
@mattstratton HUMANS, PEOPLE ARE ▸ Consider contextual on-call
@mattstratton HUMANS, PEOPLE ARE ▸ Consider contextual on-call ▸ The Golden Rule
@mattstratton HUMANS, PEOPLE ARE ▸ Consider contextual on-call ▸ The Golden Rule ▸ Bake cookies
@mattstratton HUMANS, PEOPLE ARE ▸ Consider contextual on-call ▸ The Golden Rule ▸ Bake cookies
@mattstratton INCIDENT COMMAND LEARN TO TAKE COMMAND volunteer to help as an incident commander (what’s that? Maybe we should have them!)
@mattstratton MAKE IT NICE ON THE BRIDGE DURING A CALL You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.
@mattstratton MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.
@mattstratton MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.
@mattstratton MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.
@mattstratton MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster ▸ Don’t litigate severity You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.
@mattstratton MAKE IT NICE ON THE BRIDGE DURING A CALL ▸ Have clearly defined roles ▸ Avoid bystander effect ▸ Rally fast, disband faster ▸ Don’t litigate severity ▸ Have a clear mechanism for making decisions You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.
@mattstratton SHARE ALL TESTS SHARING IS CARING
@mattstratton SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH
@mattstratton SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH ▸ All functional tests used in preproduction should have a corresponding monitor in production
@mattstratton SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH ▸ All functional tests used in preproduction should have a corresponding monitor in production ▸ All monitoring functionality in production should have corresponding tests in the build/release process
@mattstratton SHARE ALL TESTS TESTS ARE FOR SWE AND SRE BOTH ▸ All functional tests used in preproduction should have a corresponding monitor in production ▸ All monitoring functionality in production should have corresponding tests in the build/release process ▸ Monitoring is testing with at time dimension. There should be full parity between preproduction and production.
@mattstratton DO ONE NICE THING EVERY SPRINT
@mattstratton HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT Even if it’s not on a card
@mattstratton HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders Even if it’s not on a card
@mattstratton HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders ▸ Even if it’s not on a card Even if it’s not on a card
@mattstratton HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT ▸ In each sprint/work unit, add value to your responders ▸ Even if it’s not on a card ▸ You rebel, you. Even if it’s not on a card
@mattstratton ADDING VALUE SOME EXAMPLES These might seem obvious, but if they’re so obvious, I assume you’ve done them already?
@mattstratton ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) These might seem obvious, but if they’re so obvious, I assume you’ve done them already?
@mattstratton ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. These might seem obvious, but if they’re so obvious, I assume you’ve done them already?
@mattstratton ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. ▸ Add some (useful) tests These might seem obvious, but if they’re so obvious, I assume you’ve done them already?
@mattstratton ADDING VALUE SOME EXAMPLES ▸ Provide better context in logging (stacktraces alone don’t count) ▸ Remove some technical debt. Yes, you have some. ▸ Add some (useful) tests ▸ Remove something unused These might seem obvious, but if they’re so obvious, I assume you’ve done them already?
@mattstratton ADDING VALUE
@mattstratton ADDING VALUE ▸ If you use feature flags, add a description field to the configuration
@mattstratton ADDING VALUE ▸ If you use feature flags, add a description field to the configuration ▸ If you use runbooks, ensure they are up to date every time you cut a release. If you don’t do this, abandon the runbook altogether (an incorrect runbook is considered harmful)
@mattstratton ADDING VALUE ▸ If you use feature flags, add a description field to the configuration ▸ If you use runbooks, ensure they are up to date every time you cut a release. If you don’t do this, abandon the runbook altogether (an incorrect runbook is considered harmful) ▸ SIMPLIFY, MAN!
@mattstratton SHARE YOUR ON-CALL STORIES WITH ME LATER @MATTSTRATTON LINKEDIN.COM/IN/MATTSTRATTON
MATTSTRATTON.COM
ARRESTEDDEVOPS.COM
@mattstratton NOTI.ST/MATTSTRATTON
@mattstratton FURTHER READING AND REFERENCES ▸ Improving Your Employee Retention With Real-Time Ops Data - http://bit.ly/ 2rGTnq4
▸ Page It Forward! - http://bit.ly/2In8Lzc
▸ The study of information flow: A personal journey - http://bit.ly/2KpzKKW
▸ The Normalization of Deviance (If It Can Happen to NASA, It Can Happen to You) - http://bit.ly/2Ihj1wV
@mattstratton ▸ Snow Crash by Neal Stephenson - http://bit.ly/2Iiuc8L
http://bit.ly/2InDYGI
▸ Disasters! Arrested DevOps Episode 37 - https://arresteddevops.com/37
▸ PagerDuty Incident Response - http://response.pagerduty.com