Hey everybody. Welcome to "The Proactive Approach: Data-Driven Observability and Incident Response." A clever title. I guess I get to go first, yes.

So this is the part when we tell you who we are, so you can say this is why I should totally listen to these guys, besides the fact that someone gave them a microphone or two. Anyway, my name is Matty Stratton. I'm a DevOps advocate for a company called PagerDuty. How many people here have heard of PagerDuty? How many people love getting woken up in the middle of the night by PagerDuty? It's sometimes fun, like the ringtones can be cute. He's into it, I'm into it. Anyway, just a little bit of background about myself for fun. I also run devopsdays Chicago, which is going to be in August. If you are into that, if you're into wanting to maybe give a talk, the CFP is open for the rest of the week, so we want to hear from you. I podcast on Arrested DevOps. I am currently for the next two weeks living in San Francisco but I'm moving back to Chicago in two weeks, so hence Portillo's and Ryne Sandberg was my hero growing up. A little bit of local Chicago flavor. How many people here are from Chicago? Okay good.

So anyway, my name is PJ Haggerty. I hail from a little town just to the right of Chicago called Buffalo, New York. I am a developer advocate. I'm currently working at a company called Humio. How many people have heard of Humio? It's going to be less than PagerDuty, we know. We do logging, monitoring, observability — that's why I'm here. Beyond that, I'm a dad, I get to travel around the world which is the best part of this job, I am also a musician. I tell you this because I think a lot of times we come to these conferences and we think to ourselves, oh I'm a developer or I'm in sys ops or I'm in sec ops, and that's how I represent myself, but there's really so much more to you and you need to always remember that. Also, you might be catching on, we have a little bit of a theme, so bear with us.

First things first, there are going to be a lot of animated GIFs. Like a lot, a lot. Like I can't even tell you how many. We've already had one — that's not even a significant percentage. I tell you this because some people aren't into that, some people have issues and they don't want to see blinking things. I think this warning sign is much more annoying than anything we're about to show you, but nonetheless you have been warned: animated GIFs.

Our Twitters are there, but they're kind of hard to read because PJ made them too little. You should especially tweet PJ a lot because he's avoiding Twitter because he hasn't watched Game of Thrones. So no questions about Game of Thrones up to like last night's episode — fine. Anything regarding last night's episode, we will throw down. This is Chicago, people fight in this town.

We decided that we'd start off because not everyone comes from the DevOps background, so we'll tell you a little bit about DevOps. It's kind of an interesting idea. It kind of was born out of the agile framework, the agile concept, the philosophy. The idea is this constant continuous movement between the development and the ops sides. The idea here is that if you keep it moving, you keep it fresh, you can keep it stable and build much more resilient and much better frameworks, much better applications, and have much better infrastructure. What we're going to talk about deals with all of those things, but we need to define a few things first.

So the next big term also in the title: observability. According to Rudolf Kalman — I'm going to read this straight off the text — he coined the phrase "observability" and this was his definition: observability is the measure of how well internal states of a system can be inferred from knowledge of its external outputs. Clearly this definition was developed from some understanding of engineering or some sort of complex system that was not what we are developing, not what we call engineering. This is about machines and being able to look at a car, hear a noise, and say it's the carburetor. That doesn't totally fit exactly what we do with DevOps, so we're going to kind of re-examine that and look at it more as we go through this talk. We're going to redefine observability for what we do and for our needs.

Another piece of the puzzle that we need to define is the concept of an SRE. Does everyone know what an SRE is? SRE is a site reliability engineer. I've also heard there are a few people now using this term as "site resilience engineer" — I don't know why, but that's great. This is a person or team of people responsible for taking all the observed behaviors that we're getting from our feedback loop — which we'll talk about — and taking action. That could be anything from solving problems proactively, looking at things like chaos engineering, buffing the infrastructure to make sure it can handle the load, even dealing with something reactively like fighting a fire, dealing with an outage, or finding out that something was released to production on a Friday and the engineers left at three because it was beer o'clock and they decided the deploy was good and seemed stable and everything went down. Now you're woken in the middle of the night and PagerDuty lets you know it's 3am.

What else do we need to define? I'm sure everybody kind of understands the idea of the feedback loop, right? So this is the idea that information goes out, information comes back in. It's one of the most important concepts in DevOps — the ability to keep moving forward based on the feedback that we've taken in and to change the process itself based on what we've been informed about. This is kind of the way that we understand what we need to do and it's a major part of the entire proactive programming concept that I'll touch on a little bit, and we'll look into more like the tools and stuff like that later. But it's good to know the term "feedback loop" and kind of internalize that before moving forward.

So the feedback loop itself — we know that it's key to what we do because it informs what we're doing and it changes our behaviors. This is great, especially in the way that we mostly program and build systems, which is reactively. I think we can all agree that reaction is kind of the way that things function on a day to day basis. No one says, I know exactly how this is going to work, I know exactly what's going to happen moving forward. Even testing, even QA, things like that are part of reactive programming because you've done something, you've pushed it out, and maybe it comes back to you before it makes production, but you're still programming reactively based on that feedback loop.

People use a lot of different tools for this. There are QA tools like Selenium and things. Some people use actual human testers. Some people have user testing where they have a certain group that gets different features. If you're into the whole feature flag idea, you came to the wrong talk.

Another piece of this puzzle though is the concept of chaos engineering that I mentioned. If you've never heard of that, here's the quick primer on chaos engineering: the concept here is testing systems live and in production. How many people have heard of chaos engineering before? How many people feel that they actively use chaos engineering in their organization? How many people feel like there's no way in hell they could ever convince anyone at a level higher than like the guy sitting next to them that they would be able to do chaos engineering? Right. Exactly.

So the idea here is to build sustainability and reliability through active chaos, and what this really provides is a way to get a glimpse of what's happening when the worst occurs. It's great that you might have a plan — and Matty's going to get into this a lot more — you have a plan of what to do when things go wrong, but what about practicing for when it actually happens? What about making that happen live in production and seeing what happens? This came out of Netflix. A lot of people know that one of the biggest things they did when they were working on chaos engineering was they actually took down 50% of their production servers live to see how many people would notice. And how many people do you think noticed? Zero. And they were amazed. They were like, so we could actually have this go down and people will just be like, oh I can't get my movie, let me reload the screen. And they got it back up in time that people never noticed because instances were spun up. It wasn't a problem. So knowing how things get handled is not only about the application but it's also about the infrastructure, your team of engineers understanding where the flaws are and where the strengths might be, building the application to deal with something like that.

The next key feature to the feedback loop is proactive programming, which is part of the observability. This is the direct feedback that you get from your application infrastructure and all the other systems that allow you to actually deploy a thing into the world and make sure it's functioning. There was a time when folks would write code with no tests, no QA, no staging server, no sandbox. That time is in the past. And this is coming from a person — I used to be what they called a cowboy coder, which I would just literally rip code and just be like there we go, I made it, let's put it in production. And people would be like, what if it's broken? And I'd be like, yeah, screw it, who cares? That was the way it was. And this is not like I'm not talking about the faraway past — this was in the 90s. But this is also like two weeks ago. There are still people that are resistant to the concepts of QA and testing and having a feedback loop. They're just like, I just write it. Luckily we have things today like Travis CI, GoCD — these things that allow us to prevent ourselves from doing the most harm before it ever gets to production. This is like pre-chaos engineering — making sure that we don't screw it up before we can screw it up.

So let's look a little bit further into observability. Monitoring methods and tools far and wide are getting more sophisticated. How many people have used New Relic? New Relic is like an APM — it's a good application monitoring system. Very high-level, lets you know kind of what's going on more or less with your application. Doesn't give you a lot of insight into your distributed system or your infrastructure at all really. And our infrastructure is getting more and more complex as time goes on. This is why you end up needing things like PagerDuty, because things aren't going to be smooth running. You don't have that cool server in the closet that's hooked up to like four nodes with wires hanging from the ceiling in a really sweaty hot room. That's not the way the world works anymore. We're working in the cloud, or we have our own server farm, or things are gigantic and it's crazy. Actually being able to monitor and see those things has really changed the way that we're able to deal with them.

And then we start looking at things in real time. I will say this: I work for Humio — we're a pretty awesome company, we monitor things, we have less than one second latency which is absolutely fantastic. Real-time though would be instant, and that's as near to impossible as you're going to get. Lots of other systems, you're not seeing feedback for five to ten minutes — that's no good. How do you react when everything's already gone wrong and the phone's already ringing and your emails are filling up because your system is down? You want to get that time to as low as possible. You want to be able to view information, investigate it, figure out the behaviors, and fix it as quickly as possible.

Most tools, you know, they focus on digesting the information and kicking it back. That's not really what we're looking for. We want something as near to instantaneous as possible. We can't get the full picture without making all of the aspects of what we're trying to observe readily available. So it's kind of like the whole idea of a JIT compiler — just in time. We want to see the things that we need to see at the time that we need to see them. We don't need to see everything all the time. If that was the case you wouldn't need a monitoring tool, you could just look at the logs.

Up to this point we've had to deal with things like slow query speeds, latency, ingest-to-search making data not available, and data being dropped. Things have, luckily, because of the DevOps concept, got a lot easier in the world of observability because of the ideas of site reliability and stability in the application. That translates to the applications that are monitoring your infrastructure and your applications — they need to be just as reliable so you know that you can trust them.

This ultimately results in a shift away from outdated observability. When you're looking at just parts of the system — let's say you're looking at CPU and that's kind of where your obsession lies, you want to make sure that CPU doesn't get overloaded and your instances don't have to auto scale — that's great, but if the issue is your CPU is going up because you have a FLOP issue but you're not looking at FLOP, then you have a problem because you're not seeing what you need to see. That's what observability is all about — seeing what you need to see.

So data-driven observability means that you leverage all of your log data and use real-time streaming capability for querying in dashboards, so you can power live systems that have visibility for engineers, visibility for the ops team, and visibility for the ops-minded DevOps folks such as yourselves. Now because I gave you the definition, you're all DevOps experts. We can get a stamp, we can put it on your badge. Update your LinkedIn, I will endorse you.

Making systems better is the real goal of observability. The more we know, the more we are able to improve and adjust what we're doing. It's commonly said that software development is never finished. Does anybody think they've ever written anything that was done? Anyway, the idea of being done works the same way with observability — you're never going to stop looking at your system, you're never going to stop trying to see what's going on because as soon as you turn your back everything is going to go wrong. So we're trying to eliminate some of those elements of surprise and make our application environments easier to manage and easier to use.

Remember not to get too caught up in the tooling. Find a simple thing — Humio is new and easy to use by the way. Use the thing that allows you to see the most that you need to see out of your system and that helps you to become a proactive programmer. Live system observability is all about being data-driven, and that's really what we should be looking at when we're trying to worry about observability and worry about our infrastructure, our application, what we're building, what we're doing. It improves the overall health and resiliency of your systems and, to be perfectly honest, it improves the overall mental health and reliability of your teams. If they're able to see what they need to see when they need to see it, then they're not stressed out about getting called at 3am because everything's breaking and everything's on fire.

Mitigating the whole idea of systems observability is at the top of the heap for every modern company and organization developing an application — whether it's a web application, a mobile application, whether it's drivers, you still need to monitor these things. So you need to give your developers the tools they need to be successful, and observability falls into that. Find out what works for you, use the tools that work, and go forward.

Observability — the interesting thing is it comes to almost every single part of the DevOps cycle. Maybe not when you're strictly writing code before it even gets pushed, but once it's built, once it's released and deployed, once it's in operation, once it's being monitored, once you're developing your new plan, you use the information from what you've learned from that last deploy to say, okay, well maybe having 8,000 extra lines of JavaScript to make snowfall prettily on the screen is not such a great idea. I heard that story last night — a friend of mine was talking about he had built a site internally at Microsoft that at Christmastime it started to snow with a nice JavaScript thing. The problem was every single piece of snow was two pixels. This was 10 years ago. Every single piece of snow was two pixels and it would fill up at the bottom of the screen slowly over time, crashing the site every five minutes or so. It was an internal site and lots of people at Microsoft sales used it. He crashed it and then went on vacation. I'm pretty sure they were using PagerDuty at the time or something very similar because he got called back in pretty quickly.

But it's really about the several different parts of the DevOps cycle coming together — you want to be able to see what's going on at all times and that's the key. Another part of this observability concept is also having a proactive programming approach. To be clear, some people have heard this and they've heard "proactive programming" and thought it means having the ability to see in the future exactly what your code is going to do once it's out in the world — and that is not even close to the definition. The idea is using your tools to eliminate as many of those problems before they ever become problems. Problems can range from everything from a script that doesn't exactly run properly and starts to build up all kinds of CPU and infrastructure problems because it wasn't properly metered, to the person who didn't run their tests and somehow went beyond the CI/CD situation and released to production. Problems can be human — observability works there too.

The idea behind proactive programming is pushing the concept of isolating variables that might be easy to mitigate, and we'll take a look at some of these things as we move on. Again, this is not about being the all-knowing mastermind awesome person. Being proactive just means taking into account that there's considerable work to be done first to establish a base for handling user issues or code issues. The initial code doesn't solve problems — rather it sets up the organization to be able to do that in the future. So this doesn't mean that you're giving up on flexibility or that you have to have some sort of rigid structure. The way you do things is still iterative — you're still going to use the iterative nature that you use in programming, or should be if you're using modern programming techniques. But what it does mean is you'll have a nice toolkit to plug the holes and patch the leaks before they happen.

Kind of think of it as — instead of using spackle and WD-40 and flex tape and flex seal — you actually have some things that are going to make sure the gutters aren't leaking before you put them on the house. It's kind of like building a house. You want to start with a strong foundation. When you build your code, you want it to be not brittle. You want to make sure it's secure. You want to make sure the site or application is going to be reliable before you start to build on that foundation. If you're going to build on popsicle sticks and duct tape because you heard about this cool new open-source language that everyone should be using, you're going to have problems. Because if it's not mature and it's not ready and your team isn't able to build something with a good foundation, it's going to fall over. You wouldn't build a house and leave out the roof. You wouldn't build a house and leave out the foundation and the structural beams. You're not going to do the same thing with your code. That's what proactive programming is really about — making sure that anything that goes on top of the stack is going to be stabilized by the things beneath it.

You want to solve matters before they become an issue. You generally spend more time on optimizations — for example, improved security or caching everything for people to use the site offline if necessary. It's about developing more stable situations. But you could anticipate the wrong future and end up spending a lot of time on things that aren't important. So you don't want to get mired in making sure the structure is so stable that it can't be broken. You don't want to try to make the most secure system in the world — you want to make your best effort, but don't get mired in yak shaving and trying to build things that aren't going to be stable along the way, because now you can't actually work outside of this box.

When it comes to proactive engineering, the focus is on being as safe and secure as you possibly can be, and you do this before putting things out in the world. So this means having a tight feedback loop. The key drawback to proactive developing is it will take more time in the initial stages, and this is where the hard sell comes with the people who are ahead of you, the director and the C-suite folks. You're going to say, listen, I want to make things as stable as possible and it's going to take a little more time, and they're going to be like, no, get this out. It takes time and understanding — it's a cultural change to bring in proactive programming. Understand that you'll need to take that time.

I have a little anecdote about that. I once worked at an academic software company — that sounds about as sexy as it was; everything was beige. But it was really fun, we had a lot of fun solving problems. We decided to port it from Visual FoxPro — how many people will remember Visual FoxPro? I am so sorry, I apologize immediately. We were porting from Visual FoxPro, which was generating weird dynamic ASP pages, to Ruby on Rails. The project had originally been built over a course of three years, and they said, cool, you've got three months to convert it to Rails. And we're like, cool, it's not going to work. And they're like, just do it as fast as you can. And we're like, well, we have to write the tests first. And they're like, oh no, we don't have time for writing tests. I was like, but this is the way you do Rails — this is like built into the system, they want tests. They're like, no, you don't have time. So of course we released January 1 and the big educational foundation that we were doing this for found that the whole thing was broken and students were being dropped from the database randomly. We manually backdated everything and put people back into the database. And they were like, what happened, how did this happen? Like — we didn't write tests. If we had been using some proactive programming techniques, the tests would have been written, the software would have been stable, we wouldn't have had that problem. Three months may not have been enough time.

So all of it together — the feedback loop, observability, proactive programming — they're meant to make your team more resilient. So while your title might not be site reliability engineer, you are part of the site reliability engineering focus, the site reliability engineering ideal. The honest truth: nothing will make you 100% bulletproof. That's not the way the world works, and I'm sorry that I had to be the one to break that to you. So what happens when it all goes sideways?

This is the point where I will hand it over to Matty, who will tell you exactly how that works.

So one thing thinking about observability that always comes to mind — especially when PJ was talking about not having the predictability — it brings to mind when I was working for an ecommerce company here in Chicago. You know, if you kind of look on LinkedIn you'll figure it out. We would have an outage and my boss would come to me and say, how come we didn't have monitoring in place about that thing? And I'd be like, because we didn't know that was a thing that could happen until now. This is where it comes in: monitoring and dashboards always reflect on what you actually already know. Observability helps us answer questions that we didn't know that we had.

So when we think about when things go wrong — we think about Incident Response. We talked a lot about incidents. If we are in a kind of ITIL/ITSM world there's a formal definition and you can read a bunch of books and yeah I've taken the training and it's a bunch of crap, but whatever — whatever you want to call it. The thing is, when we talk about Incident Response it's just: something is happening that isn't what's expected, something is causing our business to not be able to do the things that make our business happy, and we need to be able to restore service. It's really all about interruption and service.

But here's the thing: Incident Response depends on what matters to your business. So this happens to be the definition of an incident that we use at PagerDuty. It's not the one you have to use even if you use our product. It's the one that we use because we are a product and as it turns out people kind of care if we're available. So we define an incident as any unplanned disruption or degradation of service that is actively affecting customers' ability to use the product. Should you use this definition? I don't know — if it works for you. But here's what matters: you have to have a definition of what an incident is. Because if you don't know if an incident is happening, you don't know if you should respond to it.

So what matters about the definition of an incident is it needs to be relatively simple so it's easy for everybody in the organization to remember, and it needs to be widely publicized so there's no argument, no rules-lawyering about whether this was really an incident or not. Now there's more granularity that might come into play. A typo would fit into this definition for us, and so does the whole site being down. So we get into more granularity around severities and things like that. But it's really, really important to have a clear definition of an incident because that's what is going to trigger your incident response process.

Now in the incident response process, one of the most important things is we have to change our mode of thinking. We have to make a mentality shift. We're creating a distinction between normal operations and there's an incident in progress. We're switching decision-making from peacetime to wartime — from normal day-to-day operations to actively defending our business. The reason this matters is there are things that might be considered completely unacceptable during normal operations, such as deploying code without running any tests. That's not acceptable during normal operations. But it might be acceptable during a major incident when we're trying to restore service quickly. So the way that you operate, your role hierarchy, and the level of risk you're willing to take all change as we make this shift. Remember: fire is not an emergency to the fire department. You expect rapid response from a group of professionals that are skilled in the art of solving whatever issue you're having, because we're switching our mentality.

Now, not everybody jams to the military metaphor, so that's cool. You could maybe say "normal" and "emergency" if you don't like "peacetime" and "wartime." Or if this is even too complex, you could even just say "okay" and "not okay." I don't really care what you call it — you could do a happy smiley emoji and a fire emoji if that's what works for your organization. But what's happening is you're making this mental shift. This is probably one of the most important things about Incident Response, because there are a lot of things that change. And one of the most important things to remember is: during an incident, any energy or activity that is not directed toward restoring service is a waste.

Someone's getting paged, all right. So along those lines, one thing that we feel is really, really important — by the way, most of the quote-unquote rules I'm going to tell you about Incident Response are things we learned the hard way at PagerDuty. So we're not perfect. But this one is: don't litigate severity. What does this mean? This means that when you get on an incident response call you don't spend time arguing about, is this really a sev one? I think it's a sev two, I'm not quite sure. The reason is, by the time you're done arguing about whether or not it's a sev one or sev two, guess what — it's a sev one. This does not help us restore service.

Because your severity levels might have to do with your communication process, but oftentimes what you will run into is sort of artificial parameters around things like: well, if it's a sev one we attempt to restore service within five minutes, a sev two we have half an hour. No — no matter what, you're trying to restore service as quickly as possible. It might have to do with how you communicate, which customers are impacted.

One of the things I find when I give talks like this and I talk about this slide about "don't litigate severity" — I will talk to people afterwards and they'll say, oh yeah that happens to us a lot. We have this happen in calls all the time. And I will turn around and say, I have a question for you: how does your senior management measure effectiveness of your team? "Oh, well, how many sev ones do we have this month?" So what happens is when you are measuring people according to sev ones, people don't worry about the metrics of mean time to resolution or mean time to acknowledgment. They start to focus on a metric I call "mean time to innocence" — so you spend most of your time proving that it's not really a sev one, or "don't yell at me." This doesn't help anybody. So that's a lesson for management.

Another thing that's really important: the way we follow our incident response process at PagerDuty is heavily influenced by something called the Incident Command System, which is used by first responders. There's kind of a whole history around it. Back in the 70s there was a spate of many wildfires in Southern California, and they brought in all these fire departments from all over the state to work together. They had a really hard time working together because while they were all very skilled in the art of putting out fires, every department had a different way of working together. So the Incident Command System was created, which was a standard way for first responders and firefighters — and now FEMA uses it along with a bunch of other organizations.

One of the things that's important is that we don't necessarily use every single thing that's part of the Incident Command System, but one thing that's really key is there's a role called the incident commander. The incident commander sits at the top of the incident call, and what's really important about the incident commander is they delegate and they make decisions, but they are not a resolver. They are not someone who's logging into servers and restarting services and looking at deploys. They are not someone who's actually resolving the incident. They are there to help coordinate and make decisions and communicate amongst the subject matter experts — the people who are actually empowered and skilled at working at restoring service.

Does the incident commander have to be a technical person? They actually do not. And one thing that's an interesting thing we found at PagerDuty is that some of our best incident commanders are product owners. Part of this is because the problem with having an engineer or an engineering lead or an SRE in the incident command rotation is: what happens if the incident commander — we say they're not a resolver — what if they are the person that's best equipped to actually solve the problem? Do they try to solve the problem? No, because as the great Ron Swanson would say, never half-ass two jobs, whole-ass one job. So what you have to do is then you have to pass off the hat. We find that product owners actually are great incident commanders because number one, they're not going to be tempted to try to solve the problem. Number two, they tend to have skills that are really good for being in incident command — they're good at delegating, they're good at decision making. And an additional bonus that happens is when your product owners are actually on call, they suddenly care more about reliability and a lot of those technical stories suddenly manage to make their way up through the backlog. But that's just kind of a nice extra bonus.

So again, the incident commander is not a resolver. This is also a tricky one: the incident commander is the highest ranking person on the incident call, no matter what their normal role — no matter if they're just normally an individual contributor, an SRE, a QA person, a product owner. It doesn't matter what their normal role is in the org chart. On the incident bridge, they are the highest-ranking person, even higher than the CEO. This is tricky, but it's very, very important for effective incident command, because the incident commander has to not be second-guessing themselves. Do not surprise your executives with this — socialize this before you start bringing it into practice, because it will not go well for you.

A kind of history: this happened at PagerDuty. Our CEO, when she was new — maybe about a month or so into her tenure — we had a major incident and she got on the call and started barking orders. The incident commander, who was an individual contributor engineer, said, "Jen, you're being disruptive, I'm going to have to ask you to leave the call." And she did. And she got really upset. And then our VP of engineering and everybody kind of explained afterwards, "Jen, this is why we do things the way that we do." And now she completely understands and helps to espouse it. But it's usually a good idea to socialize this before.

This leads into another thing that can happen during an incident call, which is something we call the executive swoop. The executive swoop is when you might get senior management — somebody gets on the call and says something like, for example, "I want this incident solved within the next 10 minutes." How many people have ever had that happen to them? Not only is this incredibly demotivating — because it's implying that everybody was not already working as hard as they could until the CEO got in there and yelled at them — it doesn't actually help, because everyone's trying to do their best.

One of the things as the incident commander — because this can be really hard when this happens: it's a CEO, it's a CTO, this is my boss's boss's boss — here's the pro tip: if you are in that position, the incident commander, and the CEO or someone gets on and starts barking orders, there's a very easy way to handle this. You say, "PJ, are you taking command?" What will happen is 99% of the time they will say nothing, or say no. Because if they do say yes — fantastic, great, you're now the incident commander and I'm done. But usually they will not want to take command, and it also gives that executive a chance to save face, because they want to do the right thing — this is their business.

Another form of executive swoop that we probably all have heard is they can come in and say, "I need a list of all the affected customers right now." How many times have we heard that one? So what you say as the incident commander is simply, "We can either do that, or we can restore service. Our focus is on restoring service." Notice I didn't say, "Which do you want me to do?" I'm telling you, this is what we're doing.

There's a bunch of tricks and I'll point you to at the end of this talk to how you can see our whole incident response process and training, where we go into a whole bunch of other ways of dealing with executive swoop. But this goes back to: remember when I said we do a mentality shift? We are much more direct in how we communicate. We are doing less making sure everybody feels good about everything, because we need to restore service. And sometimes that means putting executives in their place.

Now, one of the ways you can help with executive swoop is by making sure you have a good mechanism for notifying stakeholders — something outside of the incident response call itself. Because how many times do we run into this: every five minutes you've got some other executive jumping on a call saying, "somebody bring me up to speed," and now we all have to stop what we're doing and bring you up to speed. So we need to have some kind of mechanism. At PagerDuty we have two different Slack channels we run during an incident. We have one which is the war room for the people doing the work, and then we have one that's a notification channel. So our incident commander, in conjunction with our communication liaison, notify stakeholders that way so they have a way to know what's going on without actually having to get on the bridge and be disruptive. So having a mechanism to notify your stakeholders is an absolutely key piece of that.

So how do we actually go and resolve the issue? How do we do this? We talked about how we have an incident commander — they don't resolve things, they have their subject matter experts. The way we do it is with a mechanism we call "sizing up." Because again, remember the incident commander is the one who's going to be making these decisions but they don't have all the information. So you go to your subject matter experts — the people who were brought in on the call, your SREs, maybe the application folks who support that particular system, your DBAs, whoever it might be — and you say first of all, what do we know? What are the symptoms? Get this information. Then you ask: what can we do? The important thing is the question I'm asking there is not "what should we do" — it's "what can we do." Tell me my options. And then I'm going to follow up: what are the risks? So now I as the incident commander have all the information I need to be able to assign an action.

And when we assign an action, one thing to watch out for — one of the most deadly phrases during incident response — is "can someone." I want to say, "Can someone go look at the firewall logs?" Instead I will say, "PJ, I want you to go look at the firewall logs and see if there was any intrusion in the last 30 minutes. I will check back with you in five minutes. Understood?" "Understood." So what happened there is it was assigned to a specific person, it was time-boxed, and I made sure that PJ understood what I was asking him to do. Now if I come back to PJ in five minutes and he doesn't have any more information, do I say I will check in with you in another five minutes? At this point I say, "How much more time do you need?" But again, the whole point is it's still time-boxed, it's still assigned to specific people.

And now we talk about actually making these decisions. Because again, the incident commander is the one who's making the decision, but they're not making it in a vacuum. We do have all of our subject matter experts there, and what we want to do is have a very effective and efficient mechanism for making these decisions and getting the information we need. For example: I propose that this background is blue — does everyone agree? Do you agree, do you agree, PJ do you agree, Joe do you agree? Okay, I don't have time for this. This takes a long time to reach consensus. Distributed consensus is really hard both when it comes to people and when it comes to systems like Raft. So let's try a different way. I propose this background is blue — are there any strong objections? Hearing none — the background is blue, let us proceed.

The key phrase there is "are there any strong objections" and you need to use that phrase exactly. Number one, we're optimizing for the 99%. Number two, I don't just say "are there any objections" — I say "are there any strong objections," because during incident response we're doing a mentality shift, we're not necessarily looking for the most optimal solution. You may have an objection but it's not a strong one — you're not so sure it's great but it's okay. That's a really important key about that. And what this also helps us avoid is hindsight bias, where someone later says, "Well, if you had just asked me I would have told you not to do that." This is giving everyone the opportunity to raise those objections.

This is another one I feel really, really strongly about. This is because I've worked for a lot of large financial and insurance institutions, as one does when one has spent 20 years working in operations in Chicago. How many people have done Incident Response at one point or another in their career — been on call, tried to troubleshoot problems, been on these terrible bridges? How many people have been on a bridge where there are like a hundred people on the bridge? Okay. So this is super stressful and this happens a lot in large organizations because they're like, let's get everybody we can think of who might possibly be able to help — let's get them all on the call. And what ends up happening is you've got a hundred people on the call, five of whom are doing any work, which means we have 95 people there doing nothing. So they're getting super stressed out and pissed off because they're like, I'm sitting here waiting for those five developers to go check their deploy and roll back their deploy. By the way, it's 2:00 in the morning and I'm super tired, I don't want to be here. But it's also incredibly stressful for those five people because they know there's 95 people sitting around waiting for them to do something.

In addition to adding stress, it's financially expensive — let's assume each of those people is costing you $100 an hour, that's a very expensive bridge to be on. And it's expensive in terms of time. If it's happening at 2:00 in the morning, all these people are not going to be very fresh the next day. We should also hopefully be humane to our coworkers — but if you don't care about that and you care about the money, the money super matters.

So we like to say: rally fast, disband faster. Get the people — it's easy to get people on the call — but give them the opportunity to drop. When you're sitting there, say, okay, right now we need to have database reliability engineers working on this, so anybody who's not DBA, feel free to drop off the call. We will page you if we need you. It is much, much better for your psychological safety to be able to drop off the bridge, have your phone on you, and know they will call you if they need you. This is a really, really huge thing to help make on-call much more humane.

And probably one of the best bits of advice: don't panic. Everything about being on call is designed to raise your adrenaline. Just hearing that PagerDuty alert makes you go "oh no." How many people are sitting there on the train and somebody's got their text tone set to the same thing as your incident response alert, and they get a text and you're like "oh no!" By the way, a little pro tip: I recommend rotating your alert sound for your on-call regularly because that will help you not have this physiological response that gets embedded in you.

But the point is, you can't keep yourself from feeling this way. We can't make ourselves not feel a certain way, but we can control our response. And the thing about this is that if we show panic, panic is contagious and will make the other responders we're working with start to freak out as well. Remember: fire is not an emergency to the fire department. When we were putting together our incident response process at PagerDuty, some of our folks actually trained with firefighters — they didn't go learn how to put out fires, but they talked about the process. One of the things some firefighters told them is: it's not uncommon to be called in to a house fire and the homeowner is — logically and rationally — freaking out, because their house is burning down. If your house is burning down, you're probably not cool as a cucumber. The problem is this panicking homeowner is actually getting in the way of the fire department — it's making it hard for them to do their job. So what they would say is: "Hey, this may be your first fire, but it's not our first fire."

One of the things that can really help with this is what PJ was talking about with chaos engineering. If we think about doing failure injection, doing game days — at PagerDuty we do something we call Failure Fridays. Every Friday we take a certain system, a certain part of our system, and we intentionally inject failure into it. It's our way of performing chaos engineering and seeing what the system does. And we run it like an actual incident. The upshot is we know what's wrong because we broke it, but we have an incident commander and we run our whole process. This does a couple really great things for us. Number one, it's great training for incident commanders. Before you go on as an incident commander at PagerDuty you have to have run a Failure Friday, because it's a safe place to do that. The other thing is it gives us the opportunity to practice our incident response in a predicted, calm, understood mechanism. So we create this physiological muscle memory, if you will. So when it happens at 2:00 in the morning, we're like, hey, you know what, this process doesn't raise my blood pressure because I'm used to doing this. I know how to get into the Slack team, I know how to get into PagerDuty, I know how our call works, I know how sizing up works. We do this all the time. Fire is not an emergency to the fire department.

Now we've actually restored service. Hoorah! Great. So now we can all just hop off the call and go back about our business? Unfortunately there are a couple things we need to do before we're done, and one of the most important things that has to happen before we're done is the postmortem — or post-incident review, or after-action report, whatever the heck you want to call it. That thing you do afterwards has to be assigned to someone, and it has to be assigned to a person. It doesn't mean that this person is the one who's responsible for writing it all, but this is saying: the postmortem is owned by PJ. So we know who's going to be setting up the meeting, who's going to be starting to put the template together, collecting all the information. And it is usually not the incident commander, but the incident commander is responsible for making sure it gets assigned to somebody. It's really important to make sure that you do that before the end of the call.

So speaking of which — postmortems. How many people are familiar with, have heard about, the term "blameless postmortem"? If you haven't, Google it — John Allspaw wrote a great blog post for Etsy's Code as Craft back in 2012. The idea is that when we're looking at these we're looking for what happened in the system, not for who do we fire. So we talk about having a culture of blamelessness because people are going to make mistakes, and if people are afraid of being punished for making mistakes, this does not make them make fewer mistakes. It makes people become subject matter experts in hiding their mistakes. And now you are well and truly screwed because you have no idea what's actually happening in your environment.

So the thing is, when it comes to postmortems, blameless is really, really important but it's just the beginning. They need to be useful. What that means is postmortems should not be a form you fill out like you're doing your taxes. It's great to have tools like PagerDuty or have something in Google Docs, have a template — that's great. The problem with stuff like that is, just like we talked about how observability helps you answer questions you didn't know to ask, a postmortem template is only going to have you answer the questions you've thought to ask. So postmortems should ask more questions than they answer. A template can help you with that for sure, but you also want to make sure you tell these stories, you communicate them. I'm giving a talk tomorrow called "Releasing the Organization Tribe" and I talk about why this is important — that's a little plug for my talk tomorrow.

And the other thing that matters is that things need to be communicated outside of just the team that was affected. Jay Paul Paiement wrote his dissertation on postmortems and one of the interesting things he discovered was: the larger the organization, the less likely teams were to share their postmortems outside of their team. The irony is the larger the organization, the more important it is to share that information outside of the team, because systems are interconnected — nobody works in a silo. Everything is connected. Also, other teams will think of things that you didn't necessarily think to ask. Those questions will help you answer things.

So as I said, we have open-sourced our incident response process at PagerDuty. If you'd like to learn more about it you can find it at response.pagerduty.com. There's a link to it, it's on GitHub so you can fork it and build your own process. Should you use our exact process? Absolutely not. You are not PagerDuty. You're a bank, you're Apartments.com, you're Lumio, or whatever you are. But they're places to start and places to understand. We also do take pull requests because we are continually learning, so we'd always like to hear more about that.

I know PJ and I are the people standing between you and happy hour, but we do have a couple of minutes if anybody has any questions about either the observability that PJ talked about or incident response practices — we're happy to answer your questions.