How Do You Infect Your Organization With Humane Ops?

A presentation at DevOpsDays Salt Lake City 2018 in May 2018 in Salt Lake City, UT, USA by Matt Stratton

Slide 1

Slide 1

@mattstratton Matty Stratton DevOps Evangelist, PagerDuty WITH HUMANE OPS HOW TO INFECT YOUR ORGANIZATION

Slide 2

Slide 2

Slide 3

Slide 3

Slide 4

Slide 4

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 5

Slide 5

Slide 6

Slide 6

Slide 7

Slide 7

Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

Slide 8

Slide 8

Slide 9

Slide 9

Slide 10

Slide 10

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 11

Slide 11

Slide 12

Slide 12

Andy Fleener, Platform Operations Manager, Sportsengine - “We review every alert from the last 24 hours/weekend every day. No broken windows.”

Slide 13

Slide 13

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 14

Slide 14

Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent.

Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

Slide 15

Slide 15

Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent.

Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

Slide 16

Slide 16

Don’t over-design systems. Resume-driven development is almost always a recipe for on-call disasters.

Slide 17

Slide 17

At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs

Slide 18

Slide 18

ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

Slide 19

Slide 19

Slide 20

Slide 20

volunteer to help as an incident commander (what’s that? Maybe we should have them!) 


Slide 21

Slide 21

You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.

Slide 22

Slide 22

Slide 23

Slide 23

Slide 24

Slide 24

Slide 25

Slide 25

Slide 26

Slide 26

These might seem obvious, but if they’re so obvious, I assume you’ve done them already?

Slide 27

Slide 27

Slide 28

Slide 28

Slide 29

Slide 29

Slide 30

Slide 30