The talk
The Four Agreements of Incident Response
Delivered 3 times · 2018–2019



























































Major outages, incident calls, war rooms, whatever you want to label them, can be stressful and frustrating experiences. In this talk, I will use the lessons of the book “The Four Agreements” by don Miguel Ruiz, to illustrate an easy-to-remember modality for effective and humane incident response.
Don Miguel Ruiz’s book, The Four Agreements, presents a code of personal conduct based on ancient Toltec wisdom to help remove self-limiting structures and beliefs.
Each of the Four Agreements can help us understand a more mature, effective, and humane approach to incident response in our organizations. In this talk, I will address how the Agreements can be expressed as a modality for Incident Response. Using the Agreements, it is easier to understand modern approaches to resolving incidents as effectively as possible, and even help reduce burnout as well!
Incident ResponseThe Human Side#PagerDuty
Every delivery (3)
Resources
Transcript · 7,661 words · ~38 min read
Lightly edited for readability from the video’s captions. Download as text
Okay, so welcome to my talk. This is the Four Agreements of Incident Response. My name is Matty Stratton. I am a DevOps advocate and thought validator at PagerDuty. I don't like the term thought leader so I decided I'm a thought validator — the things you're thinking, I'm here to tell you yes, you're correct, that sounds great.
So a few things about me. The first thing is, as I said, I work for a company called PagerDuty. I'm going to tell you a little bit about PagerDuty in a minute. I'm not here to sell you PagerDuty. This is not a talk about PagerDuty. But there's a little bit about how we do things at PagerDuty so I wanted to give a little bit of background. I've been at PagerDuty for about a year now. Before that I worked at a company called Chef for almost five years, and I especially did a lot of work with Chef with Azure, so I do have some Azure background. It's not completely out of nowhere. I have a podcast that's been going on for about five years called Arrested DevOps. We've got some stickers over there. We talked about DevOps-related items. One of my fellow podcasters is heavily involved in the Azure space — his name is Trevor Hess. Another one of our hosts is named Bridget Kromhout, who is an Azure advocate on the Azure advocates team working with Jesse Frazelle and Ashley McNamara and Brian Kenyon. I'm one of the founders of DevOpsDays Chicago so I started that about five years ago. I'm one of the global organizers of the DevOpsDays conference. And back when I had a car — I just recently moved to San Francisco so I don't have a car anymore — but when I did have a car my license plate said DevOps. I do still have the license plate, I just don't have a car to put it on.
So I'm just gonna tell you a little bit about — this is like the fancy marketing slide — but tell you a little bit about what PagerDuty does. How many people are familiar with PagerDuty? Okay, then I don't need to tell you anything about what we do, fantastic. So since you're familiar with PagerDuty, how many people have ever been in a situation when you're sitting on a phone call with a bunch of other humans, you're trying to solve a problem and everything's going pear-shaped and it's just a really miserable experience? How many people have had that experience? How many people think that's awesome? So incident management can be super stressful but there are ways to make them less stressful and that's one of the things I'm gonna be talking about today.
So the first thing is: what exactly is an incident? We talked a lot about incident management. There was in Mike's talk this morning that little 5x5 matrix thing and incident response was right there in the middle — that's super key. I also wanted to mention — I don't normally talk about Azure as much as I'm about to — but we did just release a lot of PagerDuty integrations into Azure, so when you're getting your events and your alerts and all those things from Azure you can run them right into PagerDuty. You can also run them into other tools. I'm not here to sell you a tool. All I'm here to tell you is don't build your own. That's the only thing. I don't care — use VictorOps, use OpsGenie, use xMatters, use PagerDuty — just don't try to build your own thing. You all are better at building other things, things that make your company money.
So what exactly is an incident? Before we can respond to an incident we have to know what it is, and it sounds silly but if you don't really know what an incident is for your organization you don't really know what to do or you don't know if you should follow your process. We don't know whether to respond to it. So it's critical that you have a criteria for what an incident means to your organization. And there's no one magic thing that says this is an incident, because everybody in this room — unless you're sitting with some co-workers — has a different type of business that they're in. There are different business outcomes for what your company does.
So I'm gonna give you an example of what we consider an incident at PagerDuty just so you can kind of see what one might look like. A couple things that matter: they need to be relatively simple because everybody has to understand them, and we don't want to spend a lot of time arguing about whether or not we're having an incident. And the other thing that's really critical is they need to be tied to a business outcome. How many people have worked on metal servers? Okay. How many people have done IaaS in their career at least? How many people know what a CPU is? Yes, okay, good enough. High CPU utilization is not an incident on a server, but being slow to respond, not processing orders — if you're saying normally at this time we're selling X amount of shoes and we're selling a lot less shoes than we normally do, that's the kind of thing that might generate an incident.
So this is PagerDuty's definition of an incident: it's an unplanned disruption or degradation of our service — and again our service is to let you know and wake you up in the middle of the night when stuff's on fire — that is actively affecting customers and a customer's ability to use the product. So this is kind of a broad definition, but it's business-related. We're saying it's something that's affecting people's ability to use the product. You will have a different definition, but this gives you an idea that can get you started. You want it to be simple — no more than a sentence — and it needs to be easily understood by pretty much everybody inside your organization.
This is also a pretty broad definition because a typo in code could fit this definition. Lots of different things. So this is where you run into sometimes more granularity. We might talk about different severity levels of incidents. We also talk about sometimes an incident may only be related to one part of our service or it may require coordination. Not going to get into so much of those details about that, but we can certainly talk more, and one of the things I'd like to do is have conversations either during the break or maybe during the networking about kind of your on-call experiences and how you do incidents where you're at.
Now keeping in mind, you have different severities. For us we talked about sev 1, sev 2, sev 3. I don't care if you call them severity or priorities. It could be a little fire emoji or a little smiley emoji or sad emoji — people might actually do that, I don't know. But what matters is that they're clearly understood within the organization.
Now I'm gonna back up a couple things. We did a study across over 10,000 companies across a hundred different market verticals because we wanted to understand what on-call pain was like. So this is some of the details we saw. In this study we had 50,000 incident responders who received over 760 million notifications. So this was a fairly large sample size of data. And here are some of the things that we found out. Sixty million of those notifications took place during dinner hours while people were trying to eat their food with their family. 82 million happened during the evening in general. A quarter of a billion notifications happened during sleeping hours. 122 million happened on weekends. And that meant we had almost three-quarters of a million nights that were interrupted and three hundred and thirty thousand weekend days that were interrupted. So this is a lot. This is what on-call looks like for a lot of people.
So we're trying to make it better, but the reason why I'm talking about it here is to understand why it matters to make it better. And here's some of the things that we found out. We also wanted to understand why people left their roles. We did a survey across these responders and we picked out the people who had left their position in the last 18 months and we asked them a bunch of questions to find out why. Because oftentimes when you think about why somebody leaves a job — a lot of times they leave because they want more money, they want to do more interesting work. How many people have heard the statement that you don't quit your company or you don't quit your job, you quit your manager? That's something that happens quite a bit. That's been my career at least. Fortunately I have a pretty good manager right now — and you might be watching the live stream, although it's I think 1:00 in the morning in San Francisco so I think I'm safe.
But you know what we found out? These were the most meaningful metrics about attrition: the number of days when responders' nights were interrupted, the number of days when they were woken overnight, and the number of weekend days interrupted by notifications. These were the things that were most influential on whether or not somebody who was a responder to incidents decided to find another job. And to put things in perspective — I didn't do the currency conversion — but in the United States it costs about three hundred thousand dollars to replace a software engineer or an IT pro. So regardless of the fact that we should want to have people's lives be better, it's also part of the bottom line.
So I like this comment here. Why does this matter? As Charity says, yes, on-call sucks, can destroy your life. We know. It's a fact of life for anybody who's developing high-quality software. So how can we make it not suck?
If I recall correctly from earlier, most of the folks in this room identify as IT pros, more as ops folks as opposed to software engineers. Who identifies more as a software engineer? Okay. How many of you who are software engineers are on call? Okay, that's a whole other conversation, that's a whole other talk. The thing about making on-call suck less for everybody is having everybody be on call. Because here's the thing that sucks: if you're not on call, guess what happens? You're always on call. Because the best thing about being on call if it's done right is knowing when you're not. It's part of a rotation — part of knowing when you're not — because if you're not officially on call, that's a thing to bear in mind.
So I'm gonna use — one of my little ways that I give talks is I find self-help books, I read the description of them a little bit, and I say well that sort of sounds a little bit like some stuff I want to talk about. I have a talk called the Five Love Languages of DevOps, for example. So this is the Four Agreements of Incident Response. And this is a book — I actually did read this book, by the way, so this one I didn't just read the back cover. But the book is called The Four Agreements by Don Miguel Ruiz and it's a really good book. If you're gonna be on a relatively short plane ride, I recommend picking it up. Not from a tech perspective but it's a good way to live your life.
The ideas — these four agreements — are meant to be taken on as a way to live your life, to become a better human, to have a more enjoyable life, to have a more satisfying life. And when I read this book and I started learning about it, I said you know what, there's some metaphors in here for good practices of incident response and incident management. And so my hope is that by using this metaphor we'll make some of these things a little more memorable and they might stick. Spoiler: this is the first time I'm giving this talk so it's being piloted to all of you, but I'm giving it a bunch of other places after this so we'll see how it goes.
But these are what the four agreements are: be impeccable with your word, don't take anything personally, don't make assumptions, and always do your best.
These things actually sound really straightforward, but they're super super hard. A couple things to bear in mind: being impeccable with your word doesn't mean just don't tell lies. It also has to do with words having magic, words having meaning. If we've ever heard the analogies about gossip — once it's out there you can't get it back. So we talk a little bit about why it's important to think about what we're saying and why we're saying it. Don't take anything personally — this is the agreement that I have absolutely the hardest time in the world with, I'm still struggling with it. We always hear about making assumptions because we all have different metaphors and different lenses by which we view the world. And always do your best makes me think about the things I talk to my kids about. I have twin boys who are almost nine and a daughter who's seven, and I always tell them to always do their best. I don't always do my best but I'm doing my best for you right now, I guarantee that.
So what I'm hoping is that this will help us understand these modern approaches to incident response, make things as effective as possible, and help reduce burnout.
So if we think first about being impeccable with your word — what are some of the items that, when I think about good practice of incident response, have to do with being impeccable with your word? To me these are things about being clear, being direct, and being concise.
One thing that's really critical is we have a lot of automation. Automation is a big deal, especially in DevOps, especially in the world of IT we live in today. And it's not uncommon — in fact it's a good practice — to have your Incident Response process get kicked off by automation. You can have events that occur with Azure kick off an incident inside PagerDuty. But we don't want it to only be that way. We believe strongly at PagerDuty that anybody can kick off our Incident Response process at any time. For example, we have it set up in a Slack chatbot so if you say "I see page" it goes ahead and kicks off and gets our incident commanders on board and does all these things. We believe strongly that even if someone on the cleaning staff is walking past a display that shows something's looking pear-shaped on a monitor, they should be able to pull that cord and kick off the incident response process.
And here's why: because let's say you start your incident response process and it turns out that you weren't actually having an incident. What's the worst thing that happened out of all of that? You got a chance to practice doing incident response. Now some of you might have been thinking well, the worst thing that's gonna happen is it went up on the CIO's dashboard that said we had a sev 1 incident and that's how we're measured about whether or not we do a good job. That's a bad bad metric. How many people in here are setting the metrics for their entire IT department at the C-level executive level? Unfortunately we can't always make those changes, but those are the things I talk about when I talk to CIOs and CTOs about thinking about the way we measure. Because the trick is humans will work to the metrics you tell us you're gonna measure us to. But the point is it's important that anybody should be able to kick off incident response at any time.
We also talked about don't litigate severity during a call. What does that mean? That means let's not argue about it. We talked about doing different severities — a sev 1, a sev 2, a sev 3. We used to do this at PagerDuty a lot: we would get on a call and say is this really a sev 1 or is this a sev 2 or is this a sev 3? And you know what happens? By the time we're done arguing about it it's become a high severity incident. We haven't actually accomplished anything. We've just sat there and argued about it. So we always assume the highest severity because again, what's the worst thing that happened? We practiced our sev 1 process. If you're getting measured on having X amount of severity incidents and this is going to impact your bonus or your pay or your review, that's a different kind of a problem. But what will happen almost every time if you start litigating severity during a call is that you're gonna spend all your time arguing — and I call that mean time to innocence — it's not a good metric to have.
Also you want to make sure you're notifying your stakeholders. And this is again being impeccable the right way — sending the information. The more that you let stakeholders know about what's going on, number one the more they'll stay out of your way, and they can also help move that communication chain along. Because stakeholders have stakeholders. If we're involved in restoring service, we don't want to be sitting there trying to think about everybody who has to know everything. So we need to have some common way that we can provide status on what's going on — whether that's a status page or a Slack room, which is what we do for example at PagerDuty. We have one main Slack room where we're doing all the work and communicating amongst ourselves during an incident. We also have the conference bridge. But then there's a separate Slack channel which is more for the updates, so if the executives want to know, the stakeholders know they have a place to go to see those status updates.
A lot of times you want to have someone — if you have the ability and the staffing to do this — have someone in a role that's a customer liaison or an internal liaison, whose job during the incident is to provide that status. Because the people who are solving the problem should be working on solving the problem, not updating stakeholders. But as an organization you want to make sure your stakeholders are updated.
So these are, to recap, the things that are about being impeccable with your word. The practices: anyone can trigger incident response, don't litigate severity, notify your stakeholders.
So now the next one is don't take anything personally. Again, as you may recall, this was the one that I have a hard time with. I don't have as hard of a time during incident management and Incident Response — I have more of a hard time with it in my personal life. But feedback is welcome on my talk, I won't take it personally. Probably.
Going into that, this is why it becomes really hard to not take things personally during an incident. At some point when things are going squirrelly we have to make a mental shift between peacetime and wartime. Which is we are going from normal operations to now we are literally defending our business. We are taking some kind of an outage, we have some kind of an incident. And for a lot of organizations these incidents can cost hundreds of millions of euros an hour or more. So we have to move quickly. And for this to happen, it's again a mental shift.
A couple things happen when we go from one to the other. One is our tolerance for risk actually goes up — we are more likely to take risks during wartime. And if you don't like the military metaphors we could say your normal mode and emergency mode — I'd prefer this myself. But during emergency mode our tolerance for risk goes up because what we're trying to do is restore service and then we can get back to normal operations. Our need for consensus goes down — we don't have to make sure everybody feels super good about it and we're all getting nice warm hugs and everything like that. We communicate much more directly, and that's where the don't take it personally comes in.
Does this mean that during an incident it's okay for people to be jerks to each other? No — we still treat each other as good humans. But we tend to be more direct. And when I talk a little bit about what the role of an incident commander is, those are things that we would probably not tolerate during product development or during just normal day-to-day operations.
Now we base a lot of the way we do incident management on some structured stuff. We didn't really create it, but these have come from first responders — whether they're firefighters or emergency responders. For example, the National Incident Management System is one that is used in the United States. Where this came from was in the 1970s there was a string of really devastating wildfires in Southern California — which is actually kind of happening now — but the problem was there were hundreds of thousands of firefighters from all over the country who came to try to help fight these fires, and they had a real hard time with it. The problem wasn't that they weren't good firefighters, but every fire department had a different way of working. They were all good at fighting fires but they weren't good at working together. So that's where the National Incident Management System came from. And in different countries there are different ones that are similar. The UK system is kind of cool because they have a role called the Gold Commander, which kind of sounds like a James Bond villain.
So when we kind of follow these processes, one of the things that's true about almost all of them — whether it's called the Gold Commander if you want to be evil or cool or awesome — is what we often call the incident commander. The incident commander is a very important role that happens during coordinated incident response. The incident commander is the person who is in charge during the incident. They're the big boss, whether or not that's their normal job — in fact it normally isn't. But what's important is an incident commander is not someone who is trying to solve problems. They're not a resolver. They're there to coordinate and to delegate.
The only thing you need to get started from the roles perspective is you need an incident commander and you need your subject matter experts. And your subject matter experts are the people who just know how to do the work — they're the experts in the services and the systems that are dealing with the incident.
Now the trick about the incident commander that can be really hard — and this is where it starts to become hard on that don't take it personally point — is no matter what their day-to-day role, the incident commander is in charge and is the one who is actually making all the decisions. I cannot stress enough: they are not resolving the incident. In the United States, in the National Incident Management System, the way firefighters use it is the incident commander wears a white helmet and they have a saying: if you see someone with a white helmet holding a wrench, take the wrench away from them and hit them over the head with it, because they're not supposed to be trying to fix things.
Now don't hit your colleagues over the head with wrenches because we don't usually wear helmets. But sometimes a person who is the incident commander who was on call might actually be the right person to solve the incident. And then what do you do when that happens? Change the incident commander. Just swap them. Because as my hero Ron Swanson would say, never half-ass two jobs, full-ass one job.
Now remember the incident commander is in charge — they're the highest authority on the call, higher even than the CEO. This is tricky. My advice to you is make sure that your management knows about this before it happens or things will not go well for you. You do not want to surprise senior management with this idea during an incident. This happened to PagerDuty. Our CEO had just joined the company — maybe three weeks into her tenure — and got on a call and was basically kicked off the call by a junior engineer. I have nothing but respect for that engineer for being able to do that. And Jen was pretty fuming and pissed off at first, and then there was follow-up and an explanation of this is how we do incident management at PagerDuty and more importantly this is why.
The way you have to make sure people understand this is there's a reason we work this way. And it can be tricky especially for executives. Things like what I just talked about — when our CEO jumped on the call — is a type of behavior that we sometimes call the executive swoop, because an executive sort of swoops in and they want to take charge. So sometimes you get things like: don't listen to the incident commander, just start doing what I say. Have you tried doing this? Why isn't someone doing this? Why aren't you doing that?
And there's one magical phrase. If you are the incident commander and an executive or someone of that nature gets on the call and does that, the only phrase you need to memorize is "are you taking command?" You just simply say that. And one of two things is gonna happen. Either they're gonna say yes, and then you say fantastic, I am off the hook, it's all yours. Or more often than not they say nothing — they just sort of back away and they move along. Because that helps them understand that it's real. A lot of incident management is more about psychology than it is about technical prowess.
Another one you get is "I want this resolved in 10 minutes." How many people have been on a call like that where someone gets in and says this better be fixed in an hour, this better be fixed in 15 minutes? This is so demoralizing to hear because what it's implying is you're not working as hard as you could — so thank God we've got this CIO who came in here to motivate us and get us to go faster. No, that is not effective. The comment for this one is simply to say "we're in the middle of resolving an incident, please save your comments for the end." Because — don't take it personally — people who do this are not doing it to be jerks. They are concerned. This is their business. They are worried about what's happening. They think they're helping. So we need to do the things that just keep things focused. Being strong in our statements helped.
This is why, by the way, people who are very good incident commanders are usually people who have a role of being a product owner or a project manager. Engineers are not always great incident commanders because we don't always have the non-technical skills that make us great at it, and we also usually want to try to fix the problem. But product owners — and Google does this, which is where I learned this from reading their book, and found out we do the same thing at PagerDuty — it makes a lot of sense. If you have product owners doing it, they have those non-technical skills about delegating and communicating that make them great incident commanders. They have an understanding of the system but not enough that they're gonna try to troubleshoot it. And the selfish part is now the product owner has visibility into your outages and your incidents because they sat on the call too — so all those little technical stories where you're like we want to fix this stuff, they were there. This works exceptionally well.
One of my colleagues who just joined our evangelism team, who had been a product person at PagerDuty for years, was our first non-engineer incident commander at PagerDuty. And now we have 22 incident commanders and none of them are engineers, because we want the engineers to be subject matter experts and not incident commanders.
So when we think about that not taking anything personally — to review: we're gonna make that switch in our mindset, the incident commander is the highest authority, the incident commander is not a resolver, and remembering ways to deal with the executive swoop.
So the third agreement is don't make assumptions. I like this one which is: let's assume I'm right and you're wrong. So we want to avoid making assumptions. How many people have heard the thing "never make assumptions because they make an ass out of you and me"? But I like this cartoon better which is let's assume I'm right, you're wrong. So we want to avoid making assumptions and this is why I propose that this background is blue. Who agrees with me? Is it blue? Okay, that could take forever. Getting distributed consensus takes a long time. Let's try this another way. I propose this background is blue. Is there any strong objection? Cool, background's blue, let's move on. I had one objection but I'm optimizing for the 99%.
So that phrase is really important. Notice I said "is there any strong objection" and the word strong is key — that's making sure that I'm saying you better be able to back it up. I'm not saying I'm gonna challenge you. You may have a strong objection and it's valid, you should be able to voice it. But we're not looking for the most optimized situation. The reason we want to do this is this avoids the hindsight effect. This avoids someone coming up later and saying "well I told you that that wasn't gonna work, if you'd only asked me I could have told you that wouldn't work, I could have told you we would have had that problem." So this does give the opportunity to raise objections but we're getting consensus by asking for negatives. I remember when I was in the Boy Scouts and we would go backpacking in the mountains, you never would say "is everybody ready?" You say "is anybody not ready?" You're looking for those objections rather than the agreement, because the objections imply the agreement.
So the other thing too is: avoid jargon. Avoid very complicated technical terms and acronyms. The problem is when we put in jargon — we say things like "let's get all the ICs on the RC and get some BLTs for all the SMEs" — there's cognitive overhead. I have to now sit there and say what do all those terms mean? And if there are new people to the organization who don't know all of our terms, that can make them feel excluded. We don't want to have to over-explain things but we want to be clear rather than concise. This isn't about writing great code. At this point we're saying let's be clear rather than concise.
So we need to assign some tasks to our subject matter experts. As an incident commander, we need to resolve our incident. We've collected some ideas from the subject matter experts, we've got some things we want to try. We need to now assign these tasks. How are we gonna go about doing that in a way where we're not making assumptions? It looks a little something like this: the incident commander would say "Rachel, I'd like you to investigate the increased latency. I'm gonna come back to you in five minutes. Understood?" And Rachel will say "understood."
A couple things happen in here that are key. As the incident commander, I'm clear about what I'm asking, I'm giving it a time box, and I'm making sure that it's been acknowledged. You always want to avoid the deadly phrase which is "can someone..." A lot of times when I give my incident management talk I start at the beginning and I say "hey, can someone keep track of time for me?" And then about half an hour in I'll say so who is keeping track of time and almost nobody will have done it. That's called the bystander effect. So you always want to make sure that you assign tasks to an individual. Now they can be assigned to a role — you could say "I need the on-call DBA to go check the shipping logs" — but you're not saying "I need someone to go check those logs." And you're giving it a time box.
Now let's say Rachel comes back to me after five minutes — or I come back to her after five minutes — and she just says "I need more time." I'm not necessarily gonna say okay, I'll check back with you in another five minutes. I will say "how much more time do you need?" But it's still time-boxed. The key part of this is that the "understood" is so critical, because otherwise I could come back in five minutes and she'd be like "oh I wasn't sure you meant that I was supposed to do that" or "I wasn't sure about what you meant." This gives her the opportunity to be clear about what I'm asking.
So when we think about not making assumptions, the pieces of that are: remember, consensus is hard — distributed consensus is hard. Think about things like Kafka and all sorts of other distributed systems — they're challenging. That's even worse when we're humans. Clear is better than concise. You want to assign tasks to one specific person. And you want to time-box those tasks.
So finally the last agreement is to always do your best.
This is a controversial statement I'm about to make and it's gonna make some of you very uncomfortable: the wrong decision is better than no decision at all. Rules are different during incident response. And the reason why the wrong decision is better than no decision is because the wrong decision provides you with information. No decision provides you with no information. If we make the wrong decision we can see that something happened, we can come back, see that that didn't work, and move on to our next thing. Otherwise we run into something called analysis paralysis, where we sit there and we try to think of every single thing that could happen, and in the meantime our company is bleeding money.
Now that doesn't mean that crazy decisions are better. It doesn't mean just go do whatever we want. But if we try to get that distributed consensus — if we said we want to make sure that we've got the exact right thing before we move — we will never move.
I call this rally fast, disband faster. What that means is: get all the right people that you need when you need them, but don't leave them on the call. Don't keep them around any longer than you need to. How many people have sat on an incident call with a hundred other people? Yeah. And of that time how much were you actually doing anything? Most of the time almost zero.
The problem is it's very common to say "well let's get everybody who might have some insight into what's going on, let's get them on a bridge and we're gonna sit there for hours." You could have a hundred, two hundred people that all cost a hundred euros an hour and you're just burning cash. And not only are you burning cash, it's probably happening in the middle of the night — these people are having quality of life issues and they're certainly not going to be effective tomorrow.
So what is okay, and this is where the disband faster comes in, is I might sit there and say "you know what, I think this is a network problem. I don't think we need the app folks anymore. Go ahead, go back, keep your phone on, we will call you if we need you." Because as a responder I would much rather be told "I can leave and you'll call me if you need me" than sit around and wait. It's stressful for the person sitting around doing nothing because they're frustrated. It's really stressful for the people that are actually trying to solve the problem because they know that there are tons of people sitting there getting irritated that they're not moving any faster.
So you want to rally fast and disband faster. Do responders get tired? What about incident commanders? Incidents can go on pretty long. So you want to be able to swap people in and out. The good news is this is relatively easy to do provided that you don't have a complete single point of failure in the organization, which is a whole other conversation. But at least for incident commanders and whoever your subject matter experts are, when you know you're getting to the point that people are starting to get tired — usually we find that between four to six hours is a time when you really should be bringing in somebody fresh — what you do is you have whoever is the backup have them join the call about fifteen minutes before you want to swap, so they can be shadowing. Have the backup and the current incident commander talking in a private chat room to get brought up to speed. And then all you have to do is say: "I am passing command to Rick." And Rick says "hi, I'm Rick, I am now the incident commander." And now everybody knows what's going on.
So you can talk a little bit about post-mortems or after incident reviews or learning reviews or whatever you want to call them. I know a lot of people don't like to call them post-mortems because it's not like your system actually died. But it's sort of like the term DevOps engineer — I've made my peace with it. Whatever is the thing that makes you do them makes me happy.
So there are a couple things to keep in mind, and this again is about doing your best and about doing improvement, because that's what after incident reviews are for — they're for improving.
They should be what we would call blameless. Now blameless is a tricky word because sometimes people think blameless means no accountability. That's absolutely not true. You screw something up, you're still accountable. You're accountable for making things better. But a blameful post-mortem is one that says "hey Matty messed up" — who cares, how does that help the system become more resilient? So okay, I went in and I deleted something. But why was I able to do that in the first place? What's wrong with our system that's allowing these fallible humans to do that? Because we're gonna make mistakes. Human error is never a root cause. That's a whole other conversation — there are multiple causes.
You want your post-mortems to be blameless, because that's what lets people be safe. Because here's the thing: people will work to the incentives you give them. So if people are afraid that if they make a mistake they are gonna be punished, does that make them make fewer mistakes? To be quite honest, no. It makes them become subject matter experts in hiding their mistakes. And when that happens you are well and truly screwed because you don't know what's happening in production anymore.
And the other thing about after incident reviews is they need to avoid being what I would call a write-only post-mortem, which is when people will say "our process says after we have an incident I write up the post-mortem and it goes in our wiki and then we're done." The whole point of them is so that we can learn from them. It doesn't necessarily mean that we're just getting action items out of them. But how are we sharing with the business, with our colleagues, with other teams about the things that we learned from this process?
And then also make sure that you're continually reviewing your incident response process because those things change constantly as well. PagerDuty, just for context, we're about 500 people right now. But there was a time when if there was any kind of an outage or an incident of any kind, pages went to everybody in the whole company. That worked fine when there was like 40 people. It would super suck right now. So as you're continually changing, whether it's once a quarter or once a year, continually be reviewing your incident response process and always look for continuous improvement. This is a key concept when we look at modern technology, modern IT operations — whether we call it DevOps or whether we call it just doing work — we are continually looking at how we can improve the way that we're doing things.
And finally: don't panic. Here's the thing — being on call and dealing with an incident is very panic-inducing. Have you ever been sitting on the metro or on the train and somebody's cell phone is set to the exact same tone as your pager alert for on-call, and you get that jumping-heart feeling? So when you hear that tone that means you're having an outage — we have a response. It's okay to panic on the inside but don't panic on the outside, because you need to keep all the responders and everybody else you're working with calm. Panic is contagious.
When we were at PagerDuty developing a lot of these processes, some of the folks who came over trained with firefighters — not fighting fires, but just going through incident learning processes with them. And they told the story: they said often there's a fire, they go to respond to this fire, and of course the homeowner is panicking and freaking out because why wouldn't you, your house is burning down. And the firefighters — they said what we say to them is "hey, this may be your first fire, but it is not ours."
This is not our first fire. So keep that in mind. Incidents are things that happen. We want to keep calm. You're gonna have stakeholders that are panicking. Keep calm. Don't knock over your Coke Zero. A calm people stay alive.
So to recap the things about always doing your best: it's better to make the wrong decision than to make no decision. You want to rally fast and disband faster. Handovers are encouraged. Make sure your post-mortems are useful — you're not doing them just to fill out a form. And always review your process. And don't panic.
So what questions can I answer for you?
Well, then make the next decision. It can happen because you're never gonna know that you're making the right decision until after you've made the decision. Depending on how badly you made it, that's part of your post-mortem. So let's say your decision was "well let's just drop all the tables in the database" — how did that happen? So part of that is again it's a little glib to sort of be like "oh well just make any decision, it's fine." Remember, it's not make any decision — the point of that is to not be so concerned about making sure that you've thought through every single possible thing that could happen, because we need to get service restored. But to answer this specifically: when the wrong decision makes things worse, that's what your post-mortem is for — to figure out what did we learn so we can do this better next time.
Okay, so if you would like — I have a QR code. You can also go to mattstratton.com/speaking — this presentation is up there along with links to back things up. One of the things that I have a link to: we at PagerDuty have open-sourced our incident response process, the way that we do incident response. That's at response.pagerduty.com. But I have links to that, I have links to the survey and the study that we did around on-call, I have links to the Don Miguel Ruiz book as well. And you can also see all the talks and decks and things I've done over there.
And you can find me on Twitter, I'm at Matt Stratton. My podcast is arresteddevops.com. I would love to hear about your on-call stories either during breaks or during the networking event — any questions, any follow-ups you have. But thank you for your time and I really enjoyed being here.