THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Andy Fleener, Platform Operations Manager, Sportsengine - “We review every alert from the last 24 hours/weekend every day. No broken windows.”

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that   if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

Don’t over-design systems. Resume-driven development is almost always a recipe for on-call disasters.

At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs

ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

volunteer to help as an incident commander (what’s that? Maybe we should have them!) 


You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.

Even if it’s not on a card

These might seem obvious, but if they’re so obvious, I assume you’ve done them already?