Skip to content
← All talks

The talk

The Lifecycle Of A Service

Delivered 2 times · 2020

Slides
Download PDF
The Lifecycle Of A Service, slide 1 of 64The Lifecycle Of A Service, slide 2 of 64The Lifecycle Of A Service, slide 3 of 64The Lifecycle Of A Service, slide 4 of 64The Lifecycle Of A Service, slide 5 of 64The Lifecycle Of A Service, slide 6 of 64The Lifecycle Of A Service, slide 7 of 64The Lifecycle Of A Service, slide 8 of 64The Lifecycle Of A Service, slide 9 of 64The Lifecycle Of A Service, slide 10 of 64The Lifecycle Of A Service, slide 11 of 64The Lifecycle Of A Service, slide 12 of 64The Lifecycle Of A Service, slide 13 of 64The Lifecycle Of A Service, slide 14 of 64The Lifecycle Of A Service, slide 15 of 64The Lifecycle Of A Service, slide 16 of 64The Lifecycle Of A Service, slide 17 of 64The Lifecycle Of A Service, slide 18 of 64The Lifecycle Of A Service, slide 19 of 64The Lifecycle Of A Service, slide 20 of 64The Lifecycle Of A Service, slide 21 of 64The Lifecycle Of A Service, slide 22 of 64The Lifecycle Of A Service, slide 23 of 64The Lifecycle Of A Service, slide 24 of 64The Lifecycle Of A Service, slide 25 of 64The Lifecycle Of A Service, slide 26 of 64The Lifecycle Of A Service, slide 27 of 64The Lifecycle Of A Service, slide 28 of 64The Lifecycle Of A Service, slide 29 of 64The Lifecycle Of A Service, slide 30 of 64The Lifecycle Of A Service, slide 31 of 64The Lifecycle Of A Service, slide 32 of 64The Lifecycle Of A Service, slide 33 of 64The Lifecycle Of A Service, slide 34 of 64The Lifecycle Of A Service, slide 35 of 64The Lifecycle Of A Service, slide 36 of 64The Lifecycle Of A Service, slide 37 of 64The Lifecycle Of A Service, slide 38 of 64The Lifecycle Of A Service, slide 39 of 64The Lifecycle Of A Service, slide 40 of 64The Lifecycle Of A Service, slide 41 of 64The Lifecycle Of A Service, slide 42 of 64The Lifecycle Of A Service, slide 43 of 64The Lifecycle Of A Service, slide 44 of 64The Lifecycle Of A Service, slide 45 of 64The Lifecycle Of A Service, slide 46 of 64The Lifecycle Of A Service, slide 47 of 64The Lifecycle Of A Service, slide 48 of 64The Lifecycle Of A Service, slide 49 of 64The Lifecycle Of A Service, slide 50 of 64The Lifecycle Of A Service, slide 51 of 64The Lifecycle Of A Service, slide 52 of 64The Lifecycle Of A Service, slide 53 of 64The Lifecycle Of A Service, slide 54 of 64The Lifecycle Of A Service, slide 55 of 64The Lifecycle Of A Service, slide 56 of 64The Lifecycle Of A Service, slide 57 of 64The Lifecycle Of A Service, slide 58 of 64The Lifecycle Of A Service, slide 59 of 64The Lifecycle Of A Service, slide 60 of 64The Lifecycle Of A Service, slide 61 of 64The Lifecycle Of A Service, slide 62 of 64The Lifecycle Of A Service, slide 63 of 64The Lifecycle Of A Service, slide 64 of 64

Services are the backbone of our systems. They are the pieces that make up our businesses—whether they are literal microservices or functional components of a traditional application, we can’t do the computer thing without services.

When it comes to a service in your company or organization, who’s responsible for it? The cast of characters involved in the lifecycle of a service are more than just software engineers—they can include program managers, product owners, sustainability/SRE/ops, and business stakeholders, just to name a few.

Topics covered in this talk include:

  • Defining what a service means to you and your organization
  • Roles in service ownership
  • What are you observing about your service?
  • How you want a team to respond to a service
  • Managing the service in production
  • Tuning your service
  • Understanding business impact

Product & Service Management

Every delivery (2)

Resources

Transcript · 4,435 words · ~22 min read

Lightly edited for readability from the video’s captions. Download as text

Thanks Jay. I'm really pleased to be back here again at DevOps Days New York. I was here last year, it was an absolutely great experience and I'm glad to give you a, for the most part, pretty much brand new talk. This has only been shopped out at a meetup before, so buckle up, let's have a lot of fun. And again, my name is Matt, I'm a DevOps advocate and thought leader at PagerDuty. And I would tell you to go visit the booth but the booths are all shut down now so whatever, but I hope you went and had fun anyway.

Great. So I want you all to imagine a world where you understand what you're working on. You're clear on what your dependencies are, who relies on you, and what you're delivering. You have this clear vision of your impact on your business, your organization, and you know what you want to do to continue delivering value to those people that you care about. You innovate, you try new things, and you can solve problems effectively when they come up. You and your colleagues work together to bring value to your business without blame, and you can make changes without being afraid of unintended consequences. Sounds dope, right?

All right, let's see how we can get there. So here's the thing: we talked about this idea of service ownership, and this means that people take responsibility for what they deliver at every stage of a service's lifetime. And embracing service ownership is a way to get to that vision I talked about on the last slide.

But let's get started with a word here: what the hell is a service anyway? We could talk about this for an open space for 30 minutes about every different definition of a service. I could ask 10 engineers what a service is and I'll get 15 different definitions, right? And it can be a lot of different things. Maybe a service is a microservice — that's probably where your head went first because I mean it's DevOps Days so microservice all the things, right? But it could also refer to a slice of a monolith, it could be an internal tool, maybe we're thinking about a piece of functionality, maybe we're thinking about a component or a shared infrastructure, or even a feature. These are all things that could be a way we think about a service.

And here's what it boils down to: if it provides value to other people, that's a service. So the first thing is you need to understand what it means to you. There's no one right answer. The wrong answer is when you disagree with the rest of your organization about what a service is.

So I'll give you an example. This is how we think about a service at PagerDuty. Our definition is kind of specific to some form of infrastructure that might be composed of multiple distinct services that might be written as separate pieces of code. The thing is it's wholly owned by a team. I like to think about a service as a boundary of responsibility, and it's important to have this shared understanding of those boundaries and figure out who the stakeholders of that service are. If there are multiple teams contributing, maintaining, and supporting a given service, this shared understanding becomes even more important.

So you can start by considering who is responsible for this service that we're defining. A service should be wholly owned by the team that is on-call for it. Again, that's where I think about a service definition as a boundary of responsibility. If multiple teams share responsibility for a service, sometimes it's better to administratively, if you will, split that service up into separate ones. Some organizations call this service mitosis — making a rule that at a certain team size or a certain volume of code the service and team must be split up. And yes, lines of code is a crap metric, but you know, if you use that don't feel too bad about yourself, there's worse things you can do.

Services should be set up granularly enough to help identify where problems are coming from. So the thing is if two microservices always basically behave as one area and fixing a problem in one generally means fixing it somewhere else, then yeah, don't get pedantic and say "well technically these are two microservices so they should be defined separately." No, they're really one bit of business functionality as far as a boundary of responsibility goes.

So what about the monolith, our friend the monolith? Also, thanks to Erin for correcting this slide — I had Stonehenge up before and he reminded me that's not a monolith, that's something else. So pedantry is a thing, right?

So the thing is if you have a monolith, first of all don't feel bad, it's cool, but think about how you're going to address on-call responsibilities for it. Monoliths tend to be involved in a lot of incidents just because they're big and they span a lot of stuff. So it's okay, and sometimes they're actionable and sometimes they're not. So if one team owns the whole monolith they usually don't own any other services, unless somehow the on-call for the monolith is low. But if multiple teams share responsibility for this monolith, think about how you can carve up those areas of responsibility based on functionality, and then route the alerts related to that functionality to those teams who have the ownership, right? Each source of functionality can be represented as a different service in your documentation, enumerated in your runbooks and wikis along with your on-call ownership in something like PagerDuty.

So the thing is service ownership really is a shared responsibility. It's not just the software engineers who sling the code who are accountable to the ownership of what that service does. And what I'm going to do now is I'm going to step through some of those different roles in this shared responsibility. And as I talk about a role that you might identify with, you might find yourself sitting there going "Matt, I already know that." You know what, that's cool. Listen to the other parts because that's what's really interesting. I want you to think about the roles of the other folks who are involved in service ownership, and hey, maybe you'll learn something about your own role too. That'd be cool.

So let's think about folks who identify as devs, as Christine said yesterday. Developing a service as a software engineer is more than just writing the code. Yes, of course it involves the code in some type of shared repo, hopefully got some docs, and whatever contract that service provides other services might need to understand. The thing is, all these pieces — as you're developing a service you may not know the answers to all this stuff as you're initially designing and creating it, but you want to have some kind of standard process that your organization has agreed upon. It's similar to what Jana talked about yesterday with the production readiness review, right? So it's the same thing: you want a common way you do this.

And the thing is you're not the only person who's going to interact with this service, unless it's my code that sits in a repo that nobody ever looks at and nobody ever runs. My applications — but hopefully you don't have that problem. So you want to make sure that that code is reviewed with other members of your team by whatever process is appropriate for you. And because you're not always going to be around to answer questions about this stuff that you've created, it is incumbent upon you to make it understandable enough so you're not the single point of contact. And this takes being deliberate and thinking about it, and again the person you're helping might be future you. So maybe you want to be selfish. It's cool to say the two hardest things in computer science are caching strategies and naming. There's also an off-by-one joke in here somewhere.

And that's not initially wrong, right? So it's totally common to use clever and fun and silly names — whether they're Greek mythological figures, inside jokes, pop culture references, species of Pokemon, you know — as placeholders for your service names. And when you have a small organization you think that everyone's always going to remember all these inside jokes and references. But your organization's going to grow, the team's going to grow, you're going to bring in new people, and eventually the memories of that inside joke are going to fade. And now you have to explain it over and over again to everybody. And to be honest, at that point if you have to explain the joke, it probably wasn't that funny in the first place. So, to kind of paraphrase Seinfeld, I'm also offended as a comedian by your bad service names.

So here's the thing: you want to be specific. And this is boring, I get it, it's not fun, but name the service based on what it actually does. And if you already have services with these less-than-specific names, don't panic — you don't have to fix everything right away, this takes some time. So think about fixing forward, starting to name the new things, and then embracing that pattern of letting the other stuff fall off. You want to default to longer names instead of these really short ones. Well, here's the thing: you go too long, you know what people do with long names — they turn them into acronyms. And now you have the same problem you had with your inside joke. And now you've got to explain that acronym and what it really means. We always think acronyms save us time but they actually just add cognitive overload and they just generally suck. All right, that's the technical term.

So these are names that are specific and yeah I understand they're not very clever, but they say what they do. You could look at this service and you could probably reason about what the hell it's supposed to actually do. These are names that are less amazing — like don't name it Pac-Man. Burgundy B is not clever so just get over it, right? I have a colleague who's worked at four different organizations with a service named Artemis. This is probably not the last place she's worked at that will have it — she will see it again. So it doesn't tell you anything about it, right?

We want to think about how we describe the service beyond just its name. So what is the intent of this service? This is where you can record its purpose, its raison d'etre — why does it exist, right? More importantly, how does it deliver value? That kind of matters. What does it contribute to? Is it part of a customer-facing feature? Explain how it impacts customers. The description could also mention the other components that it may interact with, but know that those change as the service owners make changes to those components.

Dependencies are kind of a thing. I'm not going to go too much into this, but when you think about your service it may present itself as an API of some sort, and thinking about versioning provides a lot of value for the people who are consuming it. I always go back to this thought where when we would make changes to a service — this was pre-microservice days back in good old SOAP — and my CTO was like, "We have to test every single time we touch one of these things, we have to test all of the functionality through everything all the way back to the data warehouse." And I said, "Does Google, when they update their Maps API, reach out to every single one of their customers and ask them to check it?" No — because it's versioned, right? So you can work to that. So thinking about semver or API versioning, those things matter.

All right, so let's think about another role. So this is maybe your sustainability team — and this is kind of a broad strokes, you might call them SREs. You might not, because you might argue with me about whether or not SREs are a sustainability team. Call them what you want, maybe it's ops, whatever it is — these folks are kind of helping think about the care and feeding, if you will.

So runbooks are a thing, right? Things are going to go wrong. And over time as we learn about the different nuances of our service we want to keep a record of what we've tried that can help resolve these common issues. So in an ideal world we'll know all these things and there's all the known knowns. And if off-spot was here he'd be yelling at me like crazy right now when I just said you could predict anything, because you know what, you can't. And this is the problem with runbooks: you need to be able to ensure that you are updating them regularly, because when you make changes to your service the things you did yesterday are different. If you don't have the ability to keep your runbooks up-to-date, you should consider abandoning that runbook altogether, because an out-of-date runbook is considered harmful — it can cause more harm than good.

Work at PagerDuty — you might care a little bit about alerting. So here's the deal: you want to only alert on things that are actionable. This is how you keep people from getting burned out. There's nothing more annoying than getting that PagerDuty alert at 3:30 in the morning about something you can't do anything about, right? That's not helpful. And it also leads us to normalization of deviance where we start ignoring alerts. So I'm going to dig a little bit more into how that alerting works in a couple slides.

And you might think that your sustainability group is there to help ensure resiliency of your service. Well, how many people were here for the ignites yesterday? How many people know that John disagrees with us and he's not wrong? The things we think about from a sustainability perspective are not about resiliency — we might think more about robustness and reliability. But these are things where they think about high availability, disaster recovery, the things that are the things we can predict that we know could happen. We were talking about robustness and reliability, and your sustainability groups are really good at this. I used to say that good sysadmins are all cynics because their job is to think of everything that might possibly go wrong.

I think about program management — we just had a pretty awesome kind of thinking about this so I'm not going to go too deep, but I don't really have to go back and explain it all, but I had a great lead-in. The thing is there's an element of unpredictability in service ownership. And so when you think about the outcomes that come from our post-incident reviews or post-mortems, proactive maintenance — program management can help think about being mindful of the buffer that we need for that kind of additional work. And maybe you call this project management, it's okay, you don't have to feel bad about yourself, it's all right, I'm not judging. Not every organization, as we know, has program management, but these are some of the things they can think about. They're helpful to understand what "done" means, understanding things about having emotional awareness on the stress of the team from other factors, and thinking about that connective tissue between different teams and features and what does it mean to pull people away from other initiatives.

And then we think about product — these things are pretty closely aligned. The thing is product owners are thinking about translating the requirements of customers beyond what something looks like or is capable of. Customers will tell product owners about what they want a product to do, but they're rarely going to specifically ask for things around uptime, performance, or security in an interview or user forum. By that I mean they're not going to use those words, but they want it and they're going to ask for it. Because without uptime a customer can't get to that new wonderful feature that you've delivered. Without performance they're going to leave in frustration — our CEO likes to say that slow is the new down, right? Performance matters. And without security they sure as hell aren't going to trust that new feature.

And speaking of CEOs, senior leadership actually has a pretty important part to play in service ownership. This model works best when it's championed by your top leadership and is consistent across product and engineering. So leaders help set goals to balance business priorities, they have to make room in their roadmaps for investing in tech debt, and kind of driving a culture of cooperation and sharing.

So let's go a little bit into some specific things. This service that we're building and developing and iterating on — what are we observing about it, what are we noticing about it, what are we paying attention to? And usually at this point I like to kind of talk about observability versus monitoring. We talk about both of those things and that itself is a whole talk in itself. You kind of saw it yesterday, so I'm not going to go too deep, but here's a really great analogy from Liz about the difference between observability and monitoring. By the way, you need both of these things. But as Liz says, monitoring is when your bank tells you that you're overdrawn. Observability lets you know you're running out of money because you're spending too much money on chocolate and sweets and candy because you recorded data on what you've spent all your money on. This felt like it hit a little too close to home for me personally.

And along those lines I like to think about this idea of empathy-driven alerting. You want to focus on the customer experience — what are the key business metrics around the experience of your customers? And if this part of the talk isn't enough Honeycomb fanboying for you, here's the third reference I'm going to make, which is Charity Majors who loves to say "nines don't matter if your customers aren't happy."

And again, we could have an entire talk about SLOs and SLAs and SLIs — in fact there was one last year here at DevOps Days New York from Alex. But I'm going to make a quick reference because I want to make a point. So these are the different levels — if you're not familiar, an SLI is a service level indicator. These are specific things that we are measuring; they may be things like latency or throughput. They are not goals, but they define what the dial looks like that we're defining. And then we talk about a service level objective, and these are made up of SLIs, they're measured over time, and these are not contractually set — these are objectives. SLAs are the ones that are contractually set. These are the ones that when you break them you owe your customer money, right?

So where do you think we want to alert? I like to say first of all you want to alert on your SLOs, not your SLAs, because you don't want to be alerted when you break an SLA — it's a little too late at that point. So you want to alert on and think about where to set your SLOs. And this is something that I accidentally came up with as a term with Dr. Jennifer Petoff, who was the editor of the SRE book, and we call it the happiness point. This is the inflection between customer happiness and customer sadness where they directly meet, and that is where you set your SLO. Because if you set it too tight you're going to be alerted when your customers are still happy and you've wasted a little bit of time. If you set it a little too loose, you've got sad customers by the time you're being alerted. So you want to find this point. And this is a technical term and it will be in a book by O'Reilly — maybe, well, probably not. But speaking of Alex from last year, he's got a book about SLOs coming from O'Reilly.

Again, you want to alert on the SLO. This is when you have PagerDuty wake you up when you're hitting an SLO. And think a little bit about how do you want a team to respond to issues with this service. You want to think about, as you're observing this information, how do we continue to tune it and look for patterns — are we seeing consistency? And this is where your post-incident reports become really key. Are we seeing similar patterns in consumption of the service and incidents related to it?

Always be happy to prune your alerts. If you alert on five things and you understand all five of those things equally, that's pretty fair. If you alert on a thousand things and you treat each of those with equal priority, it's unlikely that any of them are getting the priority that they truly need. So always — and this is why you do post-mortems and PIRs even on things that turned out to not be incidents — because they're an opportunity to tune your alerting. You want to suppress non-actionable alerts. Work with your alerting tool to think about ways that you can suppress these non-actionable alerts. PagerDuty has a feature that does this, I'm just saying. Think about the business impact: do you know how your company makes money? If you don't, go find out. How does this service tie to your customers, to your revenue, and to the things you need to deliver? Do you understand the business metrics connected to this service?

And to kind of wrap this up, we're going to review the stages of the lifecycle that a service might go through and where all those things we just talked about fit into that. So when we think about designing a new service, this is the fun part — this is that greenfield, white piece of paper, start from scratch. A couple of things to think about: you need to understand your customers, product is great for this, your product team is great for this. Making sure that your sustainability teams are involved as early as possible because they're thinking about things that you might not be thinking about in this great green and bright sky-blue world. You want to start thinking about what those SLOs and SLIs are as early as possible, because that helps you tie them to the business and not just the tech.

And now when we think about maintaining and iterating — now we're moving, this is where we spend most of our life, continuing, because services are never done. So we talked about versioning the API — this is really key and this is where it matters because you want to be able to move forward without breaking your customers, and you need to be able to communicate to them and have a method and a methodology by which you share these changes. Continually address tech debt — Dave gave us a great ignite about that this morning so that's awesome.

But the thing is, eventually all good things come to an end. And when we retire a service we talk first about deprecating that service and then sunsetting. Deprecating is usually when we say it's still around, we're not fixing it anymore, we're not iterating, we're not improving it, but we don't want to just pull the plug on it right away and leave our customers in the dust. So we might deprecate it first. But to go through this we need to be able to identify the consumers of this service, and this is where your customer success and your support teams can actually help you with this a lot — they might know folks that are paying money for this service you want to get rid of. So you want to figure out what the business impact of doing that is, how that's going to impact some of your customers, and you want to be able to communicate and off-board them. And the most important thing is give them an alternative, give them a migration path. Don't just say "okay we're turning this thing off in two weeks, hope you don't mind" — that doesn't go super well. I can think of some products that do that.

The thing is, at the end of the day service ownership includes communication, compromise, and commitment. So this idea — it's not just about your technology, it's not just about writing cool alerts in PagerDuty and writing awesome post-mortems and things like that — it's all about this collaboration across all of these multiple roles, and this is where the stuff really matters.

Some acknowledgments I'd like to provide: Lilia Gutnik and Nick Doty really helped write a lot of this content. Hat tip to Charity and Liz. Images came from Pixabay. If you enjoyed this talk and want to hear some other things you may find interesting, I run a podcast called Arrested DevOps. I organize DevOpsDays Chicago — our CFP is open, submit some talks. You can find me on Twitter. The deck — I'll post the link in Slack as well but the slides are up on my speaking page. And yeah, that's my license plate.

Quick couple of PagerDuty plugs: one is we're doing an event here in New York in a couple weeks. This is a practitioner-oriented event, it's not slide decks and sales. We're doing a workshop on post-mortems and a bunch of other stuff — come check that out, it's free, should be a lot of fun. And hey, if you want to do the job that I do and be a DevOps advocate, go to PDuty.me/workwithpagey — we have an opening on the team, and we're also hiring for other stuff at pagerduty.com/careers. Thank you very much.