Skip to content
← All talks

The talk

The Psychology of Chaos Engineering

Delivered 3 times · 2019–2020

Slides
Download PDF
The Psychology of Chaos Engineering, slide 1 of 35The Psychology of Chaos Engineering, slide 2 of 35The Psychology of Chaos Engineering, slide 3 of 35The Psychology of Chaos Engineering, slide 4 of 35The Psychology of Chaos Engineering, slide 5 of 35The Psychology of Chaos Engineering, slide 6 of 35The Psychology of Chaos Engineering, slide 7 of 35The Psychology of Chaos Engineering, slide 8 of 35The Psychology of Chaos Engineering, slide 9 of 35The Psychology of Chaos Engineering, slide 10 of 35The Psychology of Chaos Engineering, slide 11 of 35The Psychology of Chaos Engineering, slide 12 of 35The Psychology of Chaos Engineering, slide 13 of 35The Psychology of Chaos Engineering, slide 14 of 35The Psychology of Chaos Engineering, slide 15 of 35The Psychology of Chaos Engineering, slide 16 of 35The Psychology of Chaos Engineering, slide 17 of 35The Psychology of Chaos Engineering, slide 18 of 35The Psychology of Chaos Engineering, slide 19 of 35The Psychology of Chaos Engineering, slide 20 of 35The Psychology of Chaos Engineering, slide 21 of 35The Psychology of Chaos Engineering, slide 22 of 35The Psychology of Chaos Engineering, slide 23 of 35The Psychology of Chaos Engineering, slide 24 of 35The Psychology of Chaos Engineering, slide 25 of 35The Psychology of Chaos Engineering, slide 26 of 35The Psychology of Chaos Engineering, slide 27 of 35The Psychology of Chaos Engineering, slide 28 of 35The Psychology of Chaos Engineering, slide 29 of 35The Psychology of Chaos Engineering, slide 30 of 35The Psychology of Chaos Engineering, slide 31 of 35The Psychology of Chaos Engineering, slide 32 of 35The Psychology of Chaos Engineering, slide 33 of 35The Psychology of Chaos Engineering, slide 34 of 35The Psychology of Chaos Engineering, slide 35 of 35

Chaos Engineering, failure injection, and similar practices have verified benefits to the resilience of systems and infrastructure. But can they provide similar resilience to teams and people? What are the effects and impacts on the humans involved in the systems? This talk will delve into both positive and negative outcomes to all the groups of people involved - including users, engineers, product, and business owners.

After seeing this talk, attendees will have a better understanding of the human factors involved in chaos engineering, good practices to care for the people and teams working with chaos, and be even more excited about this practice.

Observability & ResilienceThe Human Side

Every delivery (3)

Resources

Transcript · 4,430 words · ~22 min read

Lightly edited for readability from the video’s captions. Download as text

Sometimes being the last talk is always super dope, but there's different reasons. The best thing about being the last talk is you can kind of tie things together. I can listen to all the other talks that happen throughout the day and then tell the story. The other thing is sometimes all the things you wanted to say have already been said, so talks over, we're done, bye, let's go. No, no, seriously, I'm gonna try to be a little flexible as how we're going based on these ideas of things that we've already talked about. Gonna take this opportunity — I've got some — oh that's weird.

Sorry, on that screen — notice you can't really see the top. Well, all the really good information is at the very top of the slide so you're missing all the real thought leadership over there.

So yeah, this talk is called "The Psychology of Chaos Engineering." It's thinking along the lines of a lot of the human factors that come into play. My name is Matty Stratton. I'm a DevOps advocate for PagerDuty. I'm not a huge fan of resume slides but I really like this one, so you're just gonna have to sit through it. It's cool.

I think I work for PagerDuty. How many people here have heard of PagerDuty? You already heard about it in some earlier slides today too. So if you didn't raise your hand, you're lying. Cool. That's as much as we're gonna talk about that right now.

Besides this, at the bottom of every slide I do — there's a podcast called Arrested DevOps, one of the longest-running still-running DevOps podcasts. If you're a listener and you want a sticker, come see me later — I got lots of stickers. I also have cool PagerDuty stickers too, because DevRel life is giving away stickers. That's my real job: sticker engineer.

I founded DevOpsDays Chicago, which is gonna be in September. So if you're into DevOps and you're gonna be in Chicago, you should come to it. And I help run DevOpsDays all over the world. This used to be my license plate, so one might say I'm invested in DevOps. But then, you know, DevOps is like 10 years old — it literally is, right? The first DevOpsDays was 10 years ago. I want to be cutting-edge and everything, so I actually had to get a new license plate. This is my current license plate — that's my car. I might have, you know, 200 dollars — more than cents might be there — cause that's what a vanity license plate costs in Illinois.

But in reality, what I like to do at this part of the talk is kind of get a level set and get some agreement so that we're kind of speaking a similar language. And in the case of this being towards the end of the day, we've seen all these talks. Fortunately, the level setting that I wanted to do has already been done. Nobody really said anything today that concretely disagrees with anything I was gonna say, so thank God, because that would have been awkward.

But a couple of things I do like to kind of stress — and you may find — the nice thing about doing this talk at the end of the day, especially at this conference, is that the things that you already know are review. It's on purpose — we're tying it all back together.

So first of all, I don't always give this talk at chaos engineering conferences. Sometimes I have to tell people what I think chaos engineering is, and by "I" I mean somebody else's definition that I like. This is from Principles of Chaos. Really, there's a couple of things about that that I always like to kind of bring up. We talked about experimenting — I think we're all pretty common in thinking about that — and we're looking at building confidence to be able to withstand these turbulent conditions. You'll notice there's nothing in here about prediction. That's just kind of my little fun point.

This is an old definition but we can go back. This was from the Netflix tech blog talking about Chaos Monkey, almost ten years ago. There are three things I like to pick out of this definition that are generally interesting. Again, at some conferences I'll be giving this talk as a brand new idea; here we're reviewing or getting on the same page. So they're running experiments in the middle of the business day. This is similar to how at PagerDuty we run our Failure Fridays during the day — they run it at lunchtime Pacific time — because frankly there's no good time for PagerDuty to be down. If there is any good time for us to take any kind of an incident, it's during the day in San Francisco when most of the people are in that particular office. People in our Toronto office will tell me that no, the best time is during the day in Toronto, because Canadians are just as important as Valley people, and that's totally true.

A carefully monitored environment is a big key part of this, and then again having your engineers standing by. This is all stuff that hopefully in this room we consider as table stakes. If any of this comes as a big surprise to you right now, you've probably been sleeping all day — which I totally get.

But perceptions: here's the thing — we just talked about what we understand this to be, but there's different perceptions around what chaos engineering is, what it provides, and what's involved. Sometimes people say perception is reality. What you're trying to do is effect change in an organization, likely, so perception super matters. It's not about being right — that's what Twitter is for. When you're doing change inside your organization, it's about understanding the perceptions.

This has been alluded to before, but this drives me up a freakin' wall: any time I try to go somewhere and collect information about chaos engineering, several people think they are the most clever person in the world who'll be the first ones to ever make this joke. They're not. Also, you see the people who come up to the PagerDuty booth and go "huh, I hate you, you wake me up in the middle of the night" — never heard that one before.

This is generally my response to that: first of all, you're wrong — that's why it's on social media — but also it's not even clever. That's the other thing. It offends me — to sort of paraphrase Jerry Seinfeld — it offends me as a chaos engineering advocate and it offends me as a comedian, because it's not even funny. If you're gonna troll, at least be funny, okay?

And again, it's not about breaking things. Our intent is not to actually break something and try to push it to its limit and go "haha, I figured out how to bust your system." That's a different thing. And if you don't believe me, believe Sylvia — because if you know anything about Sylvia Boutros from Ingrid, she is the expert at breaking things. And if Sylvia tells you it's not about breaking things, then it super isn't.

Look, I know you know this, right? So why am I bringing it up? One is it's kind of a nice way to round out the day. The other is we have to kind of continually remind ourselves of these points and these principles, because almost by virtue of you sitting in this room, you're at a certain level of understanding of this practice, this field, and the first principles surrounding it. The people you're working with in your organization may not be. It's very easy, as we become further advanced in our understanding of something, to kind of forget where we came from — or where people at a different mode of understanding might be. So again, I know you know this, I'm gonna say it anyway.

The good thing is this just sort of blends right into the end of the day, so this is almost just like bullshitting at the bar — we're kind of at a bar, so it's cool.

Alright. So I know you know this, but they're experiments. I think that's a really key thing, and it's a really key way to help with that understanding. We've talked about hypotheses. Our hypothesis should be that if we do this thing, if this condition exists, my hypothesis is it will still work. If my hypothesis is that if I shut down this node everything's gonna go to hell, I probably shouldn't run that experiment. I want — again — we're testing out assumptions and hypotheses. Now, if your hypothesis is everything will go terrible, then maybe you still want to run it, but you definitely should run that in a lower environment. That's a whole different talk — the myth of staging — so we're just not gonna talk about that right now.

So again, taking a scientific approach. I absolutely loved that convergence/divergence thing that Adrian talked about, and I'm going to steal it for the next iteration of this talk that I give when Adrian hasn't given that talk right before it, so that I will look like the clever one. I may actually attribute it to Adrian — we'll see how that goes.

So, we know this, right? Why does it matter? Because how we talk about things matter. Words matter, and sometimes they don't. Getting nerd-sniped on Twitter for someone using a word wrong just to be right is not helpful. Words matter when they affect how we think about something. I'm gonna take a couple of examples not directly related but to illustrate that point.

Before working at PagerDuty, I worked at Chef — I come from an infrastructure-as-code background. Automate all the things, that's amazing. But the components that make up Chef code — one of the elements of those are called recipes, because of course they are. By the way, if you hate food puns, you definitely should go use Puppet or Ansible; don't come to Chef, because Chef is a t-shirt company that also sells software and we also make a lot of puns. So I would often times — it's kind of like calling it a recipe — customers working with and users trying to adopt this would talk about "Chef scripts," and that's one I would correct gently and with an explanation, not because of "oh no, it's a recipe because chef and food blah blah blah," but it's a different way of thinking about what that actual application of a concept is doing. A script is iterative, stepwise; a recipe — maybe it's not a perfect analogy — but the point is that recipe is how we want to think, and script is not. So we're gonna use a different word.

There are other things that I choose not to be pedantic about. I work at PagerDuty; we talk about post-mortems a lot because: incidents. How many people are aware that there are some people that don't like calling them post-mortems? And for good reason. But it's not fundamentally changing how you think about it — by calling it a post-incident review, or a retrospective, or an after-action report, or whatever. You can have a very good reason for not wanting to call it a post-mortem, but it's not related to a change in behavior of how you do it.

How many of you follow me on Twitter? It's okay if the answer is no. But if you do, you can probably take a guess as to what the next word is that I'm going to say I'm picky about. And that's the word "root cause."

The reason — first of all, I'm not getting your points here. If you won't call it "root cause," that's great; you're a fine human being and I love you. But here's why I think "contributing factor" matters: because it changes how we think. When we use that word, it makes us think about a singular cause, which in complex systems is not there. My whole point of this is not about "stop using the word root cause." It's about the words we're gonna use when we're trying to affect change using chaos engineering within organizations.

The thing is, people get nervous. It's kind of how we live; it's kind of how we've survived as a species, because we're worried about risk. If we weren't worried about risk, we'd have all been eaten by antelopes — well, maybe not antelopes, I don't know, I'm not good at animals — but something bigger with teeth. Risk aversion is kind of baked into us, so we're gonna get nervous about things that seem to imply additional risk. "You're gonna do what in production?"

How many people — think about folks inside your organization whose entire focus and lens by which they view their relationship to your company's business is mitigating risk? There are a lot of people for whom that's their job, and that's great. But if you are going to come to them and say "I would like to do this thing," and they hear it as "I would like to create more risk," the conversation is now over. The irony is that we all know chaos engineering helps lower risk — it's good. They will love that once they understand it, but you want to be able to have that conversation. Adrian talked about that a little bit before, and again, this is why it matters.

When we're thinking about mitigating risk, I would like to say: use your monitoring like it's for real. We've already had conversations around this. Your chaos experiment, to be successful — and by "successful" not to prove your point but to not actually get you fired — you need to be looking at the impact to production. You need to be monitoring like it's for real, because it is.

Another way to think about this — and we've seen some examples — but at PagerDuty we run our Failure Fridays like a regular incident. We already start incident response as part of the failure experiment. There are two reasons for that: one is we're already tuned up for it. For us, it's also that we believe very strongly in always practicing incident response, and a failure experiment is a really great way to normalize the practice of incident response.

This is a little different from what Russell talked about — like a fire drill making you complacent. The difference here is just being used to the motions. It doesn't mean it's the same, because we do want to try to reduce some of the stress. For example, the way we train incident commanders at PagerDuty: if you want to be an incident commander and go on call as an incident commander, you have to have run a Failure Friday, because that's giving an opportunity to practice that under a relatively low-stress situation.

The reason I'm bringing this up: you might take this and say "oh well, then a failure game day is a great way to practice a stressful situation, a great way to practice stress." No, don't do that. Because the first thing is, your people don't need training and practice at being under stress — they get plenty of that already. What we want to do is the opposite. Because we have some insight into what's likely to happen, or at least the systems that are affected, and we know what we actually acted upon, it's a really great way to practice and go through the motions, which create a kind of physiological response.

So, as PagerDuty says, when something's broken it's your fault — in this case it actually is. You could be a little blame-full in chaos days, but it's good blame. That's cool.

Okay, so what about the people? We've talked a lot about tech. We've talked a lot about the systems, the technology, the providers, the Terraforming, the Kubernetes — that's all the fun stuff. It's also the easy stuff. The human piece has come in. When we think about the people that are involved, there are lots of people that come in: there's your employees, there's your delivery teams for the systems that we're running experiments on, there are the people that are engaging in the experiment, and — I don't know — you might have some customers or users that might be wanting to use these systems. Those are people; they're involved.

So I always kind of ask this question: if I say "how does it make you feel to know that someone like Netflix is practicing these principles?" We're actually pretty down with it. "That's cool, man, they're Netflix, they're doing DevOps and all the things." And that's cool. I know that I can, like, binge all the things. What about your bank? Again, everybody in this room buys into this, and this one still made you go "hmm." We totally know why it matters, and the thing is the blast radius is what matters.

I was pleased to see that Adrian also goes to Twitter like I do. I had a very scientific survey and asked "if you discover a service you consume uses chaos engineering in production, do you feel reassured or uneasy?" And most people said they were reassured. This is not scientific at all, by the way — there's a little selection bias. But graphs mean results, so I had to add some data.

A little more data, such as it is: I did a few surveys around words that people might use to describe how they feel. What words describe your personal feeling towards use of chaos engineering on your team? A lot of people were optimistic, but there was quite a fair amount of "uneasy" and "cynical" and everything. Then — and this is what tells you everything about engineers — when we flipped it and we actually said "what about if products you use" use chaos engineering, they were like "oh my god, that's fantastic, but not us, because we're terrible." I thought that was really interesting.

One of the things that happens is when there's an understanding of the effects, chaos engineering can actually have a really positive effect on your delivery teams, because we feel more comfortable making changes and we have greater trust in the system. But we have to understand it, because it has the opposite effect if we don't. When someone thinks it's about breaking things on purpose and all those things, it's actually going to make them feel uneasy. When there's a greater understanding, this really boils down to education being helpful.

We talk about people getting nervous, management can get very nervous, and we think about considering our words. This is usually the part when I say I don't have a great suggestion of what you should call it besides "failure injection" or "chaos engineering," because it might be different. Fortunately, a bunch of people have already given you a bunch of really good ideas. The thing that's important is accuracy is not the most important part. Like Russ talked about "system verification" — and the nerd-snipers in us went "well, technically it's not just a verification, because that's a formal process" — it doesn't matter. You're talking to your CFO. All you want to actually do is get in the door to talk about it, to have that conversation. I like what Adrian says — just call it "engineering" — but that doesn't work as well for when you're trying to explain a new practice. So I invite everybody to kind of think on this, and I'd love to talk about it. It will vary depending upon your organization.

Really, it comes down to the understanding of the philosophy. When you're trying to bring people along for a ride, you want to be somewhat like-minded. I really like this — Cordys says he doesn't really use chaos engineering fully, but had some interesting failures from interns, which I guess is some form of chaos engineering. He says, honestly, feeling excitement when confronted with a new error that hasn't occurred before. This is really a lot more about learning from incidents, but if you have those, things have to be coupled.

Learning from incidents is something we're all not very good at. By "all," I don't mean all of us are bad at it, but few of us are good at it, because we usually look at incidents as something to be avoided. We don't encourage them. We don't want to say, "boy, I sure wish that I had a ton and ton of incidents." Although, one potentially controversial way to say it: it's been said, "hey, if you want to get better at incident response, try having more incidents." And that immediately sounds funny, but actually the way it's phrased is: scope more things to be considered as an incident, and you will practice it more. Incidents are a gift. If that's a little hard to swallow, maybe you say incidents are an "unplanned investment." But if we're not focusing on being a learning culture, we're not gonna get a lot of value out of all the practices we've talked about today. It's about the learning.

That's what I thought was so great — that we had a talk about post-mortems as they apply to this. We need to run all the things we would do for an incident on our experiments as well. The only reason that makes sense is if your after-incident review, your PIR, your post-mortem, is focused on learning. If the whole reason you do a post-mortem is to write down the root cause, it doesn't make any sense to do it after a chaos experiment — you'll already know what the root cause was: it was "we turned off this thing." Those two things — I think these practices are so well coupled — which is why in a lot of people's minds they all kind of run under resiliency engineering practices. Learning from incidents — these things are all loosely coupled. They're all under being connected because they all come back to us wanting to learn and have better understanding and be able to reason about our systems more broadly.

So kind of — I'm going a little quick; part of it is because we're getting to the end of the day and I always start running fast, and I'd love to have a little more hallway track. But just a couple of things I'd like to consider again.

Safety first. This isn't the Safety-II idea of safety, but, you know, safety first. We want to think about all sorts of things we've talked about today — minimizing blast radius, making sure your responders are ready. By the way, speaking of one thing I forgot: was I the only one waiting for the last slide of Ramirez's talk about where the git repo was? There's — I don't know — we don't have that yet. So I want it. That was exciting to me because of the thoughts around the whole experiment.

A couple of things I think are very key to keep in mind — and these may seem like table stakes — but to use a terrible analogy: I used to do swing dancing, and we used to say about dancers: beginning dancers take intermediate classes, intermediate dancers take advanced classes, advanced dancers take beginner classes. So it's always helpful. As much as we're big experts in this room, a couple of these things — if it is common sense great, it's a good review. If it's table stakes, I assume you're already doing all of these.

Knowing your conditions of when to shut down the experiment — that's knowing what your key business indicators are. You're not shutting down the experiment because SQL Server is using too much memory all of a sudden. You're shutting down the experiment because your average card size has dropped below where it's supposed to be. Know what your key business metrics are — that's where you're just gonna call it.

And again, we want to build resiliency, and resiliency comes from people having adaptive capacity. What we're not trying to do here is stress test people. Even if Adrian's gonna break the necks of your key SREs — you're not trying to add stress, you're trying to find challenges. So it can be very tempting to look at a chaos experiment as a way to test people's ability to troubleshoot, or to simulate the stress of on-call. We want to have transparency — never surprise anybody with your chaos experiment.

And at the end of all these wonderful numbers about reducing MTTR and availability numbers and everything, there are always people. Where I bring that to mind is: at some of the scale at which we work, a small number is not a small number. "Oh well, we only impacted one tenth of a percent of our users" — okay, well, that might have been like 10,000 people that just had a shitty hour because we weren't really watching those key metrics. So at the other side of all of your graphs and all of this, there are humans. We always want to keep the humans involved and think about them. It doesn't mean we don't take risks, but remember that at the end of the day it's really easy to look at a graph, but all those little points at the end are probably a person, most likely. So trying to remember the humans is what matters.

Not that these slides are terribly exciting, but if you want to check them out — that's my speaking website. The slides are there, some supporting links, some other articles I found interesting. There are a couple of articles that I link to in there that I didn't talk about that are about how we do Failure Fridays and stuff at PagerDuty — not saying you should do exactly what we do, but you're all interested in this domain, so it's more things to know.

And yeah, if you like Twitter, that's where you can find me. I'm Matt Stratton, and PagerDuty says follow me.