So my name is Matt Stratton. For a second I thought there was something wrong with a slide — the main overhead was like what happened. But I'm a DevOps advocate at PagerDuty. How many people here have heard of PagerDuty? Great, awesome. I mentioned PagerDuty so now I fulfilled my obligation to my employer. How many people here are currently on call or in an on-call rotation? Okay, cool. If you are currently on call right now, you do not need to silence your devices. We have empathy for you. I am also a Marvel fanboy. I like Batman — that's about as far in DC as I go — but I love Marvel. And you can find me on Twitter at Matt Stratton. So a couple things to think about: be advised this talk will contain spoilers for basically potentially almost anything in the Marvel Cinematic Universe, but especially Infinity War and Endgame. So if you haven't seen those movies and you don't want to be spoiled, I'm not gonna tell you to leave because I want you all to stay, but you have been warned. Also I feel like the statute is expired on those things, and you came to a talk called the Thanos Incident — I think you know what you're getting yourself into, right? And yeah, references will abound. I try to make my talks as inclusive as possible, but everybody has opted in to hear a talk about the Avengers. So let's kind of think about this — what happened. We want to talk, we're gonna sit here and talk about Incident Response, and we're gonna talk about a specific incident. The incident is what has been called the Snap, right? This is when Thanos snapped his fingers and wiped out half of life in the universe. I don't know how many P1s you've been on, but this one seems like a pretty big one, right? So we're gonna step through a couple things. First of all, we're going to create a post-mortem of the incident. We're gonna think about what occurred, and our approach will be to address this in a blameless fashion. So we're gonna talk about how we approach this in a blameless way, we're gonna see why the Avengers — spoiler alert upon spoiler alert — are terrible at that, and what we really want to understand is what happened as well as the process the Avengers used. A lot of times when we do a post-mortem or a post-incident review, we're gonna think about the specific technologies, the specific things in our system, but we also think about our process. In this case we're gonna focus a lot more on the process because the Avengers don't use Kubernetes. And for purposes of this discussion, the Avengers will refer not only to the team usually referred to as the Avengers but maybe other heroes within the Marvel Cinematic Universe who are generally not considered Avengers but maybe sometimes they are, sometimes they aren't. So bear with me — you don't need to explain how wrong I am when I call Peter Quill an Avenger. Alright, so granted, we're gonna talk about timelines, right? We're gonna construct a timeline. This is the first thing we do when we are creating and working on a post-mortem. And I actually want to make a quick statement here — again, just like I'm using the term Avengers to encompass maybe some folks who are not technically considered Avengers, there's a lot of different terms we use for post-incident reviews. We might call them a post-incident review, we might call them a post-mortem, we might call it an after-action report, we might call it a learning review. For sake of simplicity, I'm gonna continue to use the term post-mortem, but substitute in whatever your organization uses. So one of the first things we do when we want to do analysis on an incident is we need to construct a timeline. And granted, when we talk about timelines in this case it gets a little squirrely because there's time travel involved, but that's not what I mean by timelines. Generally speaking, in the post-mortems you do in your organization, you're not creating alternate timelines when you're having a problem with Terraform. Mostly, right? Maybe just branches — like feature branches are like timelines, some branching — but that metaphor falls apart pretty quickly. The thing to remember is when we create a timeline, we're not really ready to start analysis at this point, and we want to avoid hindsight bias. So we're gonna start our timeline at a point before the incident began and work our way forward instead of backward. When you're creating a timeline, you want to stick to the facts. We're again not starting analysis yet, we're constructing a "what happened." So it should be fact-based, it should be based upon data that we might have from maybe our monitoring systems, maybe from the chat logs that occurred during our incident call. We want to include key decisions and actions taken by responders — so not only things we did but what were decisions, what things happened. It doesn't mean every single thing that someone did, but what were the key points. And we want to avoid at this point evaluating what should or shouldn't have been done. We're trying to get the information so that we can then do our analysis. And again, as I said, we start our timeline at a point before the incident began to avoid hindsight bias. I'm gonna talk about biases and cognitive distortions in a little bit, but a lot of times the incident usually began before responders were aware of it. So what we want to do is start our timeline before we became aware of it as much as we understand, because that helps eliminate hindsight bias. So what's the high-level timeline of the Thanos incident? We're gonna kind of step through, and again this is the high-level timeline — going through the actual complete timeline would probably take the full 45 minutes, so if you wanted a recap of the entire Avengers MCU that's a different talk. So we kind of started: Thanos gets the Power Stone and the Space Stone in various ways, right? That's a thing that happens. Then you got Thor and the Guardians of the Galaxy — they decide to split up. So this is a decision that was made. Thor decides to go and get himself a replacement axe, and then the rest head to Knowhere because they're gonna try to intercept Thanos. A decision was made at this point. And then however, a thing happens: Thanos gets the Reality Stone from the Collector on Knowhere. That is a thing that occurs in our timeline. A decision, an action is taken by responders: Doctor Strange uses the Time Stone to view millions of possible futures. This would be really useful during a real incident, wouldn't it? Unfortunately we don't have this available to us. And then Thanos sacrifices Gamora on Vormir to obtain the Soul Stone. Then we have several team members attempt to recover the Infinity Gauntlet from Thanos on Titan. These are actions taken by responders — they're trying to at this point fundamentally restore service the best way they know how. Doctor Strange — a decision is made by a responder — he decides to exchange the Time Stone in exchange for Tony Stark's life. Shuri is now taking action to try to remove the Mind Stone from Vision so that it can be destroyed. We have various team members, various responders, attempting to defend Vision while their other team member is working to restore service. So we're working together, but actions are taken. Thanos obtains the Mind Stone from Vision. Thor attacks Thanos but is unable to defeat him. And then finally what we recognize as the actual incident: Thanos snaps his finger, wiping out half of all life. No big deal, right? Some additional items happened in here but we're not going to dig into all of it right now. So now that we have a timeline we can start to analyze our incident. By the way, thanks to Tiago Bob and Ryan Kitchens from Netflix for both being influences on the idea of Groot Cause Analysis — that was amazing timing. But we're gonna dig into what we can learn from this. Now we've constructed our timeline, so when we think about this in the course of an incident that's occurring within our organization we start by creating that timeline and understanding of what happened, and then we can begin our analysis. Here's the thing, right? Systems are complex, there's a lot going on. And we just had a really funny slide that made a reference to the term root cause analysis. This is probably going to be the last time you ever hear me endorse the word or the phrase root cause. So here's the thing: there is no single root cause of failure in complex systems. It's a combination of contributing factors. We always want to remind ourselves there is no root cause, and our goal in analyzing the incident is not to identify this smoking gun, this root cause, this thing that is why it happened. We want to understand the multiple factors that created an environment where this failure became possible. So if you learn nothing else from this talk, it's to maybe start using the term contributing factors. The reason that root cause can be dangerous to think about is because we will identify a root cause — we want to, our brains want us to do that, our bosses want us to do that — and but then we stop. We're like, well, it must have been this. I usually tell this when I talk about root cause and contributing factors: I make the comment that the only root cause is the Big Bang. In the case of the MCU that is actually literally true because that's when the Infinity Stones were created, and if there was no Big Bang there would be no Infinity Stones. So therefore there's our root cause — is that helpful for us to know? Absolutely not. We want to think about what were all the contributing factors that occurred that got us to this point, that there was failure in our team's ability to provide incident response. So we're gonna take a little time to talk about being blameless. How many people have heard about blameless post-mortems before, the idea of blamelessness? Great. So we're gonna talk about why it's important, but even more than that we're gonna talk about why it's hard. And we're not just gonna sit down and say well we should do that but it's hard. No, we're actually going to talk about how to be able to counter these things. So when we think about this — why does it matter? So first of all, this impulse to blame really has this unintended effect where it creates a situation where we don't share knowledge. A couple pithy quotes to think about: number one, you can't fire your way to reliability. And I always like to say that if people are afraid of being blamed for making mistakes, that is not going to cause them to create fewer mistakes. It actually causes them to become subject matter experts in hiding their mistakes. People don't speak up in a culture of blame, and then you have even less insight into things that are going on in your system and in your organization. So the goal of the post-mortem is to understand what systemic failures led to the incident, and also let's identify actions that can help improve the resiliency of our systems. You'll notice I didn't say identify actions that can prevent this from occurring again. This is my little pro tip to you: start getting the word "prevent" out of your language when you're working with your teams, because you don't want your big bosses to hear the idea that you're trying to prevent things. Because guess what happens if it happens again? "I thought you said you were going to prevent it." We might not be able to prevent them from occurring, but we can mitigate the impact and we can create more resilient systems. Keep this in mind: in today's systems they are always in some state of failure or degradation. Dr. Richard Cook has said it's amazing that our systems work at all, right? So we're always going to have some type of degradation. We want to build adaptive capacity into our systems, and our systems include our technical systems as well as the people who operate and deploy them, so that when these systemic things happen, our systems and our people are able to adapt and minimize the impact of these failures. And we want to stay focused on how a mistake was made, not on who did it. It does not fundamentally help us to know who did it. Sidney Dekker wrote a great book about human error and he talks about there's two different models of human error. There's kind of the older model which is the idea that there are bad actors — people who are bad at their job who are causing problems because of fundamental character flaws in them — and then the newer way to think about it is we're not thinking about bad actors, we're thinking about what happened in the system where any reasonable person might have made the same mistake, and what are the systemic factors that allowed that to happen. So why is it hard? Like, we kind of at this point — DevOps is like 10 years old, John Allspaw wrote about blameless post-mortems like seven years ago — we get it that this matters, so we should be able to just say let's all go be blameless now, right? We know that blaming sucks. Why is it hard? So the first thing is that when we're trying to compensate for failure, the human mind takes shortcuts. This is intuition, this is usually an advantage, this is our prefrontal cortex being able to make these intuitive leaps. But these shortcuts can be damaging. And this is another thing to think about: we are hardwired from millions of years of evolution to want to blame. And this is why John Paul Reid says blamelessness is something you cannot achieve — we should instead be blame-aware. We cannot change our neuropsychological evolution. These are the tendencies that we have as human beings, but when we're aware of them we're able to do something about it. And the human mind optimizes for timeliness over accuracy when processing failure, and this can lead to cognitive biases. So let's take a minute to talk about what some of those cognitive biases are. If we're aware of these biases we can identify when they occur and we can work to move past them. There are basically — no pun intended — fundamental attribution error, confirmation bias, hindsight bias, and negativity bias. There are a lot more cognitive distortions and more biases, but these are the four that tend to lead us in a path of blame. So first about fundamental attribution error: this is the tendency to believe that what people do reflects their character rather than their circumstances. This describes that old view of human error that Dekker talks about, which assigns responsibility to bad actors who are careless and incompetent. Ironically, we tend to explain our own actions by context, not our personality. And a great MCU example of this is Wanda, who takes it on herself after Civil War — that's fundamental attribution error applied to herself. So how do you combat fundamental attribution error? We focus instead in our analysis on situational causes rather than discrete actions that individuals took. This can be really really challenging when working with more senior people in your organization who are not aware of the damaging things that blame can cause. People at senior levels tend to want someone to fire, they want someone to blame, and that's because to get to a senior level you've probably been working in this industry for a long time, and for a long time we've subscribed to the old view of human error. So confirmation bias — this is pretty pervasive, and this is a tendency to favor information that reinforces our existing beliefs. When we're presented with ambiguous information, which our brains cannot handle, we tend to interpret it in a way that supports our existing assumptions. When we combine that with the old view of human error, it's dangerous for post-mortems, because if we want to find a bad apple we will find one. If our confirmation bias is that failure occurs because of human error, we will find someone, we will blame them for it. So Lindsey Holmwood would suggest that if you want to avoid confirmation bias, you can appoint someone in your analysis to play devil's advocate, to take contrarian viewpoints during investigation. But you have to be cautious to not introduce negativity or combativeness when you're playing devil's advocate. Another way is to invite someone from another team to ask any and all questions that come to their mind, because those of us that have been working in a system for a long time tend to find things that support what we believe based on our experience. This is also why junior engineers are fantastic people to include in your post-incident review, because they will ask questions that those of us who have been doing this for a long time would never occur to ask, and we'll continue to go down the same path. Hindsight bias comes up a lot. This is a memory distortion where we recall events to form a judgment. If we know the outcome, it's really easy to see the event as having been predictable — well, of course we should have seen this coming — despite the fact that there was little or no objective basis for predicting it. Again, most of our teams do not have the Time Stone. We actually cannot tell the future. An example is when a person who's analyzing the incident believes they knew it would happen: "If someone had only involved me I could have told them that this scaling issue would have come up. This should have been clear — why was this not seen?" Hindsight bias tends to also come up with folks at a more senior level: "Why could this not have been seen? This seems so obvious." And a way to avoid this is just like we talked about earlier — start your analysis at the beginning before the incident and move forward. We tend to want to move backwards, to say it started here and now let's build our timeline backwards, and that will lead us down the path of hindsight bias. And then finally we talk about negativity bias. This is the notion that things of a more negative nature have a greater effect on our mental state than those of neutral or even positive nature. Research has shown that negative information disproportionately impacts a person's impression of others. Again, this relates to the bad apple theory — we tend to believe negative information we hear about people or that we witness versus positive, and people are more likely to attribute negative outcomes to the intentions of another person than neutral or positive. The reality is things go right more often than they go wrong, but we tend to focus on and magnify the importance of negative events. So if we focus on and exaggerate negative events and internalize events as negative events, that's demoralizing — it can lead to burnout. We want to reframe incidents as learning opportunities. Another way to think about this — and this might be a stretch — is incidents are a gift, right? Incidents are a way of our systems telling us something we didn't know before. You can also help counter negativity bias when you're doing your post-incident analysis by not just focusing on what went wrong — ask questions about what went right, maybe even where did we get lucky. Those are important things and they're important factors. So how do we avoid blame? How do we practically deal with this stuff? So there are a couple really key ones. Ask what and how questions instead of who or why. This is the slide where everybody says "but what about the Five Whys?" And instead I would point you to an article John Allspaw wrote called the Infinite Hows. I will give a link at the end of this presentation that's got all the resources and all the articles I'm talking about. But if we ask how questions instead of why, this gets people to describe at least some of the conditions that allowed an event to take place. Lindsey Holmwood also notes that how questions help clarify technical details and it distances people from the actions that they took. When we ask why questions this tends to create a need for defensiveness — I have to explain my behavior. Why questions, even if it's not in post-incident review — when I'm helping organizations change, oftentimes the first thing is "well let me explain to you why we're doing things the way we did." Why questions tend to put people on their heels and make them feel like they have to justify their behavior, because fundamentally that is what you're asking them to do. We want to consider multiple and diverse perspectives. I talked about this a little bit — bringing in people from another team when you're doing your post-incident analysis. It's not just about the engineers who worked on that. There may be your incident commander, maybe the scribe, maybe the comms person. There are a lot of different people who were involved in the incident that may have different perspectives and ask questions that might not occur to you. And here's the thing: you want to ask yourself why a reasonable, competent, and decent person would have taken those actions. Because what happens when we're analyzing failure is we fall into victim, villain, and helplessness stories that propel emotions and attempt to justify our worst behavior. So we can move past this by telling the rest of the story — consider your and others' roles in the problem, ask yourself what could be the factors that would lead a reasonable, rational, and decent person to take that action that resulted in the incident. This helps turn attention to the systemic factors that led to the incident. And we want to abstract to an unspecific responder. This one is hard, but I'm really a big fan of this. When we're inquiring about a human action, we want to abstract to the unspecific — anyone could have made that same mistake. So I am a big believer of not using names in post-mortems, especially not in post-mortem documents. You may be able to have a conversation about it, but the problem with abstracting to an unspecific responder — where this gets actually really hard — is when you want to say something positive, because you need to keep with that consistency. So what this means is we can still talk about roles, but instead of talking about Ken I'm gonna say "the senior SRE." And even when you get to an organization that is good at blamelessness, you may feel like you can actually use names. This actually happened inside our organization: some folks started and said "hey, I noticed that someone's name was in this post-mortem document — I thought we don't do that." And another responder said, "Well, actually we're really good at blamelessness at PagerDuty, so we can get away with it." And that person was not necessarily wrong, but the reminder is: while the people there right now might be good at that, you are constantly bringing new people into the organization who have to learn by example. The other reason — and maybe this is purely selfish — while you might be able to use names in a conversation, documents are searchable. And the last thing you want is somebody searching post-incident reviews at the time of somebody's annual review who doesn't understand the concepts of blamelessness. And again, contrast what you did not intend with what you did. So even when you make your best effort to remain blameless, it's possible that someone may still become defensive during a post-mortem meeting if they feel that they're being blamed. When this happens, we want to work to restore mutual purpose and mutual respect so we can continue the conversation. It's not productive to have people feel defensive — it's okay to feel defensive by the way — and the way that we do this is reiterate that the goal is to understand systemic factors. People act out defensively when they feel that their character is being attacked. So we can contrast this by saying "I did not mean to imply that you're bad at your job," contrasted with what you did intend, which was "I wanted to inquire, I wanted to ask about what was the situation, what was happening, how did this happen." We're refocusing our question away from individual motivation because that implies blame. Also, by abstracting to the unspecific responder, this encourages other responders to be more active in contributing suggestions as to what could have also contributed. Because we also generally don't want to blame our co-workers — we like our co-workers. So it could be challenging. So here's the thing: all practitioner actions are gambles. We don't have the Time Stone, right? Actions take place in the face of uncertain outcomes. The person who said this is Dr. Richard Cook of Adaptive Capacity Labs. Another way to put that — to bring this back — is this comment: "You never know. You hope for the best and you make do with what you got." And that was said by Nick Fury. So you can either listen to Dr. Cook or you can listen to Fury — they're both fundamentally saying the same thing. We don't know what's going to happen. So remember that individual actions are always gambles based upon what we knew at the time. So what can we learn from the Avengers and how they responded to Thanos? One thing — and I talked about this a little bit — blame happens a lot on this team. They love to blame. Tony loves to blame Steve Rogers for almost everything, somehow, right? This is always a challenge at the beginning of Endgame — "you weren't there, you let me down" — it was about Cap's character, it wasn't about "why did this happen, how did this happen." It was about the individual. So we're gonna focus on the incident response process of the team. Again, this is not about prevention. And not to pick on Tony Stark, but here's another place he kind of messes it up: Tony is all about the idea of being able to prevent. He literally wants to build a suit of armor around the world to keep humanity from failing. If we try to address our systemic issues this way — if we just built enough technology we could have a hundred percent uptime — you can't. And the thing is we can't prevent Thanos, but we can learn how to get better at responding to incidents. So we want to think about the idea of having an incident commander during an incident response process. This is really critical. When the Avengers work well together they generally have someone calling the shots. In the Battle of New York, Cap was kind of in charge and things worked generally well. And it does seem like Cap tends to be the most likely person — he takes that leadership role. But he's also a responder, he's someone who fights the villains, who fights the baddies. And that can be a problem because — not to mix up our pop culture references — as Ron Swanson would say, "never half-ass two jobs, whole-ass one job." You can't be an incident commander and a responder at the same time. And when we talk a little bit about what the incident commander does — the first thing to bear in mind is the incident commander is a role, it's not necessarily someone's job. You may have people who have a title of incident manager or something like that, but we're not talking about where you fall in the org chart. We're actually talking about a role you take during an incident, and the role is called incident commander. The incident commander's main role during an incident is to delegate and coordinate. They are sitting there helping get the right people — the subject matter experts, the folks who know how to help resolve that issue and restore service — to enable them to do that, and also to help collect information. So they delegate and they coordinate. They're also a decision-maker, and this sometimes gets a little bit tricky. Nobody should take any action on an incident unless the incident commander has said so. And this gets really really tricky when we go to the next part about being an incident commander, which is they are the single source of truth on an incident call. And the incident commander outranks everybody else on the call, including the CEO. Now if you choose to take this on in your organization, do not surprise your executives with this during an incident call — that will not go terribly well. But the reason this is really essential to Incident Response is to understand that for the purposes of incident resolution the incident commander is the head honcho. And again, they should not be an incident responder because there's too much going on. They need to be able to look at the bigger picture, and also having the incident commander frees up our subject matter experts to be able to focus on their areas of domain knowledge and expertise. So who would be a great incident commander for the Avengers? You might think that it's Fury — Nick Fury would be a great incident commander. He actually super isn't, for a couple of reasons. One is Nick Fury has a position of authority within the organization. Sometimes that could be hard. A lot of times we want to make our incident commanders be managers, and then that creates this weird power dynamic during the call — "I don't know, is this Matt the manager or Matt the incident commander?" And my relationship with the other responders is different. One interesting thing we learned at PagerDuty: for a while we also felt like our incident commanders had to be very strong technically, and we have since found that some of our best incident commanders come from our product organization or our customer success organization, or project management roles — scrum masters, people like that are really good at being incident commanders. The reason I bring this up is: first of all, if you are the incident commander and you discover you are the best person to respond to something in the incident, you have to now delegate someone else to be the incident commander. Product owners are almost never the subject matter expert to log in to a server and restart something. They also tend to have really good delegation and communication skills. And finally, this is just a little bit of a selfish thing that comes up: when your product owners carry a pager, they suddenly care a little bit more about those resiliency stories, or trying to push through resilience work. At the end of the day, you're gonna have a hard enough time finding people willing to be incident commander, so you don't want to limit who can actually do that. So actually Maria Hill would be a fantastic incident commander. She's got great skills at communication, delegation, and decision-making, and doesn't necessarily have some of the challenges that Fury might have. So on-call rotations — one of the problems with the Avengers is they are always on call, like they always might be called. And here's the thing: if you're not part of a rotation, that means you're never off call. I have this conversation a lot with folks who are like, "oh, don't put me in the on-call rotation, I don't want to do that." Well, I got news for you: if you're not part of a rotation, you're always on call. The beautiful thing about being on call is you get to go off call. Being constantly on, constantly needing to be made available, is a massive amount of stress and burnout potential. And you want to escalate and ask for help. What are the things that we always kind of wonder when we watch the individual hero movies, like the Winter Soldier? Where's everybody else? What was Tony Stark doing during the events of the Winter Soldier when a whole bunch of helicarriers were crashing? Where was the rest of the team? And for our teams that means — again going back to our systems are complex — we may think that this is localized within our particular domain, but we need to have ways to call in resources from other areas that can help us out. We don't need to try to do everything ourselves. And this goes back to this idea of hero culture. I always think back years and years ago there was this poster I saw at a software company — I'm not gonna call them out — and it said "how does it feel to save the day every day? Not all heroes wear capes." And as a sysadmin in the late 90s I loved that poster. Today I look at it and I say I want to burn that thing. Because hero culture — again, you get Tony: "I have a plan," he goes off on his own, he's gonna save the day. And that makes you feel good maybe in the moment, but this leads to burnout. It also creates brittleness in your organization or team. You have these single points of failure when you have folks that are always the ones saving the day. And this is because — teamwork makes the dream work. Maybe you are the most senior person on your team, maybe you're the one who actually wrote that function, maybe you're the one who commits to Kubernetes so you must know it better than everyone else. You can't do this alone. Diverse perspectives and experiences all contribute — it's not just about the technical knowledge, it's about the questions you might ask. These things are all essential in incident response and incident analysis, so bringing in more folks into the conversation. So when we think about sharing on-call — and again this is sometimes a controversial statement — how many people here identify as dev or software engineers versus operations folks? So if I tell you I think you should go on call, you might have strong feelings about that. But here's a couple things to think about: in the Avengers, Carol is the only one who carries a pager. That's kind of a lot of stress. The more folks that are on call, the less the load for everyone. We're distributing the work, we're distributing the load. And you need to have a consistent mechanism for bringing in your experts during incident response. This is really key. We need to have that ability — if Captain America was the only one with a pager, how do I get a hold of all of the Avengers on my team when they're the right person? I don't know. And the problem is you might have issues that I don't know about — like I don't know if PagerDuty works in space. As my coworker said, yeah it probably does, but the latency is gonna kill you. So you need to have some consistent way of knowing how to get a hold of organizational teams. I'm not telling you you have to buy PagerDuty, but have some consistent mechanism. So what are the things that we have learned? One is this is all about trust. We can't work together without trust — without trusting our colleagues, without trusting our other teams, without trusting our leadership — to be able to feel that we are in a culture where we can speak up and we can contribute. Blame erodes trust. And then if nothing else, the most important thing you might learn is: never, ever put the Avengers on call for your software solution. They are quite literally the worst at all of this. So as I mentioned, if you go to speaking.mattstratton.com, these slides will be up there. I also have a whole bunch of supporting links and articles that you might find interesting. You can also find me on Twitter at Matt Stratton. So just some acknowledgments I'd like to give: Jeremy Mesa and Karissa Peth helped me come up with this idea at SCaLE earlier in the year. We had just seen Captain Marvel. We're trying to explain DevOps to a couple random people we met at a hotel and we're trying to explain on-call, and this is what came up. Nelson Al Harrington helped me kind of think about things a little more. Suraiya from PagerDuty gave me the ability to tweet about this. And Ryan Kitchens from Netflix, in addition to coming up with Groot Cause Analysis, is also the one who told me that I have the best hair out of any developer advocate. So we gotta love that. So with that I think we have a few minutes if anybody has questions. If not, I'm happy to let everybody go and get some lunch earlier. Come find me during the hallway track — I'd love to hear about what you're doing for incident response and incident analysis in your organization. Thank you.