The Lifecycle Of A Service

A presentation at DevOpsDays New York City 2020 in March 2020 in New York, NY, USA by Matt Stratton

Slide 1

Slide 1

The Lifecycle of a Service Matt Stratton DevOps Advocate & Thought Validator, PagerDuty @mattstratton

Slide 2

Slide 2

Imagine a world where you understand what you’re working on, you’re clear on what your dependencies are, who relies on you and what you’re delivering. You have a clear vision of your impact on your business, your customers, and you know what you want to do to continue delivering value to the people you care about. You innovate, you try new things, and you can solve problems effectively when they come up. You and your colleagues work together to bring value to your business without blame, and you can make changes without being afraid of unintended consequences.

Slide 3

Slide 3

Service Ownership means people take responsibility for what they deliver, at every stage of a service’s lifetime. @mattstratton

Slide 4

Slide 4

What is a service? @mattstratton

Slide 5

Slide 5

A service can be a lot of things Microservice Slice of a monolith Internal tool Piece of functionality Component Shared infrastructure Feature @mattstratton For today, let’s call them all a kind of ‘service’. If it provides value to other people, it’s a service.

Slide 6

Slide 6

A service can be a lot of things If it provides value to other people, it’s a service @mattstratton For today, let’s call them all a kind of ‘service’. If it provides value to other people, it’s a service.

Slide 7

Slide 7

Define what a “service” means to you @mattstratton

Slide 8

Slide 8

A service is a discrete piece of functionality that provides value that is wholly owned by a team @mattstratton At PagerDuty, our definition is more specific to an infrastructure that is composed of distinct services that are written as separate pieces of code, but can be applied to parts of a monolithic codebase as well, or even external tools.

Slide 9

Slide 9

Shared understanding @mattstratton It’s helpful to come to a shared understanding of the boundaries of a given service and figure out who all the stakeholders are. If there are multiple teams contributing, maintaining, and supporting a given service, this gets even more important.

Slide 10

Slide 10

Who is responsible? @mattstratton Start by considering who is responsible for the service you are defining. A service should be wholly owned by the team on-call for it.

Slide 11

Slide 11

“Service mitosis” @mattstratton If multiple teams share responsibility for a service, it’s better to split that service up (if possible) into separate services. Some organizations call this “service mitosis”, making a rule that at a certain team size or the volume of code in a code base, the service and team must split up.

Slide 12

Slide 12

Service definitions help with problem resolution @mattstratton Services should be set up granularly enough to help identify the source of problems. If two microservices always behave as one, and fixing a problem on one means fixing it also on another, it might make sense to combine them.

Slide 13

Slide 13

What about a monolith? @mattstratton If you have a monolith, consider how you will address on-call responsibilities for it. Monoliths tend to be the cause of a lot of incidents, both actionable and nonactionable.If one team owns the monolith, they usually don’t own any other services unless the on-call load for the monolith is low. If multiple teams share responsibility for the monolith, consider gradually identifying different sources of functionality and routing those alerts to teams that have the right context. Each source of functionality can be represented as a different service for your documentation, enumerating runbooks and wiki’s, along with your oncall ownership in PagerDuty.

Slide 14

Slide 14

Roles in service ownership @mattstratton Service ownership is a shared responsibility – engineers aren’t the only people accountable to the ownership of the code they write.

Slide 15

Slide 15

Development Team @mattstratton Developing a service is more than writing code. Yes, of course, it involves code in a shared repo with documentation, generally a Readme that explains what the code does, and the contract that other services can understand if they need to interact with it. You may not have everything in place as you’re initially developing a service, but you should go through a standard process that your organization has agreed upon.

Slide 16

Slide 16

Your service should make sense to other people who will interact with it @mattstratton Remember, you are not the only person who will interact with this service. Make sure your code is reviewed by other members of your team, with whatever process. You will not always be around to answer questions about what your code is intended to do, so it is your responsibility to make it understandable so that you aren’t a single point of contact.

Slide 17

Slide 17

Naming @mattstratton They say that the two hardest things to do in computer science is caching strategies and naming – and they’re not wrong. It’s totally normal to use silly names (Greek mythical figures, inside jokes, pop culture references, space missions) as placeholders for service names. When you have a small organization, you think that everyone will always remember every inside joke – but you’ll grow your organization, you’ll welcome new members to your team and eventually the memories of that inside joke will fade, and that inside joke will be something that you end up having to explain to everyone. At that point, if you have to explain the joke, it’s not even funny enough to be worth it.

Slide 18

Slide 18

Be specific @mattstratton Consider naming your service something specific; naming it to explain what it does. If you already have some services with less-than-specific names, don’t panic. You don’t have to make changes to everything all at once. Fix forward, and start naming new services with more specific, clear names. We understand this is not a trivial task. Ask your colleagues what they would expect this thing to be named. Default to long-ish names, instead of shorter ones, but not too long. Names that are too long often get shortened into acronyms, and acronyms are just as hard to understand.

Slide 19

Slide 19

Names that are specific • “User authenticator” • “Payment processor” • “Shopping cart” • “Login” • “Report generator” • “Email tracking code” @mattstratton Save your creativity for NaNoWriMo (national novel writing month)

Slide 20

Slide 20

Less amazing names • PacMan (unless you’re actually building PAC-MAN, which I doubt) • Apollo • BurgunDB • Artemis @mattstratton I have worked at 4 different organizations that have named a service “Artemis”. I am sure I will see it again.

Slide 21

Slide 21

Descriptions @mattstratton

Slide 22

Slide 22

Descriptions • What is the intent of this service, component, this slice of functionality? • How does this thing deliver value? • What does it contribute to? • How will this impact customers? @mattstratton What is the intent of this service, component, this slice of functionality? This is where you can record the purpose of this service. How does this thing deliver value? What does it contribute to? If this is part of a customer-facing feature, explain how this will impact customers, how it rolls up to the larger business component. The description can also mention the other components that this service may interact with, but know that these may change as other service owners make changes to those components.

Slide 23

Slide 23

API • Versioning • Clear documentation / examples @mattstratton What do you owe to the people who consume your service?

Slide 24

Slide 24

Sustainability team @mattstratton This is the group that you might call SRE, ops, etc

Slide 25

Slide 25

Runbooks @mattstratton Things will go wrong. As you learn about the different nuances of your service, you should keep a record of what you’ve tried that can resolve common issues. In an ideal world, you have the ability to resolve these common issues at the source of the problem within the source. This is a contentious topic. Runbooks are only helpful if they are accurate. It needs to be part of your SLDC to update the runbook with every change in functionality.

Slide 26

Slide 26

Alerting @mattstratton You should work to only alert on items that are actionable, and that are related to business goals/functionality. We will dig more into how to set these thresholds and monitoring a little bit later.

Slide 27

Slide 27

Resiliency Robustness and reliability @mattstratton This does not mean “high availability”. The better terms to use are “robustness” and “reliability”

Slide 28

Slide 28

Program management @mattstratton There’s an element of unpredictability in service ownership. Outcomes from post-mortems, proactive maintenance. As a program manager, you have to be mindful of the necessity for a buffer. Note - in some organizations, this might be considered “project management”. And not all orgs will have this role.

Slide 29

Slide 29

Responsibilities of program management • Defining what ‘done’ is • Emotional awareness of stress of the rest of the team • Connective tissue work between different teams and features (help understand and mitigate dependencies) • Awareness of what it means to pull people away from other projects to solve a problem @mattstratton

Slide 30

Slide 30

Product owner @mattstratton Product owners are responsible for translating the requirements of the customers, beyond what something looks like or what is capable of. Customers will tell product owners about what they want a product to do, but they’ll rarely specifically ask for uptime, performance, or security in an interview or user forum.

Slide 31

Slide 31

Customers are always asking for uptime, performance, and security – they just don’t usually use those words @mattstratton Consider, without uptime, a customer can’t get to the new wonderful feature you’ve delivered. Without performance, they’ll leave in frustration, tired of waiting for the feature to load. Without security, they won’t trust that new feature.

Slide 32

Slide 32

Senior leadership @mattstratton Service ownership works best when it’s championed by top leadership, and consistent across the entire product and engineering organization

Slide 33

Slide 33

Responsibilities of senior leadership • Make room in the roadmap for investing in tech debt • Encourage a culture of cooperation and sharing • Set goals that balance business priorities with achievable engineering goals @mattstratton Leaders also help set goals that balance business priorities with achievable engineering goals. If leadership expects 100% uptime for your main product, it’ll be pretty challenging to create a reasonable plan for how the various components or services can possibly support that. That’s destined to fail. On the other hand, having no SLA or a “do your best” policy is not going to create a compelling product for your customers or your business. Leadership establishes a cooperative and blameless service ownership culture, defines top-level SLA’s, and makes room for investment in maintenance, documentation, and sustainability.

Slide 34

Slide 34

Going deeper @mattstratton

Slide 35

Slide 35

What are you observing about this service? @mattstratton

Slide 36

Slide 36

Observability vs monitoring @mattstratton This is a talk all in itself. But here are a few things to keep in mind (note: you want both of these things)

Slide 37

Slide 37

@mattstratton

Slide 38

Slide 38

@mattstratton

Slide 39

Slide 39

Empathy-driven alerting @mattstratton Focus on the customer experience. What are your key business metrics around the experience of your customers? “9’s don’t matter if your customers aren’t happy” - Charity Majors

Slide 40

Slide 40

A brief overview of SLA / SLO / SLI @mattstratton Again, we could spend a whole talk on this. But just to have a refresher.

Slide 41

Slide 41

Service Level Indicators (SLI) • Latency • Throughput • Availability @mattstratton There are lots more. These are some examples. But remember these are building blocks. Note that these are not “goals”, but rather “what is the measure?”

Slide 42

Slide 42

Service Level Objectives • Made up of SLIs • Measured over time • Not contractually set @mattstratton

Slide 43

Slide 43

Service Level Agreements • Composed of SLOs • Contractually/legally binding • Basically, this is where you owe your customer money @mattstratton

Slide 44

Slide 44

The “hadness” point @mattstratton This is where you want to set your SLO’s - the point directly between customer happiness and customer sadness. And this is where you want to alert.

Slide 45

Slide 45

Alert on SLOs @mattstratton

Slide 46

Slide 46

How does a team respond to this service? @mattstratton

Slide 47

Slide 47

Tuning your service @mattstratton

Slide 48

Slide 48

Investigate patterns @mattstratton

Slide 49

Slide 49

What alerts do you actually need? @mattstratton If you alert on 5 things, and you understand all 5 of those things. If you alert on 1000 things, and treat each those with equal priority, it’s unlikely that any of them are getting the attention they need.

Slide 50

Slide 50

Suppression of non-actionable alerts @mattstratton This is why you always, always do a postmortem, even if it wasn’t an actual incident. This is the part when you refactor your alerting. Work with your alerting tool for suppression of non-actionable alerts. I hear that PagerDuty has a neat feature about this.

Slide 51

Slide 51

Understand business impact @mattstratton Do you know how your company makes money? If not, go find out. I’ll wait. What does this service tie to with regard to customers/revenue? Can you understand the business metrics associated with your service?

Slide 52

Slide 52

Lifecycle steps @mattstratton We will wrap up by reviewing all the stages of the service lifecycle, where the previous things tie in.

Slide 53

Slide 53

Designing a new service @mattstratton

Slide 54

Slide 54

Design phase • Understand the customers (product is a key role here) • Load testing / staging • Ensure SRE / sustainability teams are involved early • Define SLI/SLO/SLA • Identify alerting requirements • Documentation (API, runbook, functional service registry if applicable) • Perform all security checks @mattstratton Keep in mind that staging is always a bit of a myth, but there are various ways to do this.

Slide 55

Slide 55

Maintaining and iterating @mattstratton

Slide 56

Slide 56

Maintenance and iteration • Version the service API • Communicate to consumers • Proactive maintenance • Address tech debt consistently • Testing and deploying/releasing the service (CI/CD, testing in prod, etc) @mattstratton

Slide 57

Slide 57

Retiring a service @mattstratton

Slide 58

Slide 58

Retiring and sunsetting • Identify consumers • Determine business impact of retiring • Communicate / offboard consumers @mattstratton

Slide 59

Slide 59

Service ownership includes communication, compromise, and commitment. @mattstratton

Slide 60

Slide 60

Acknowledgements Lilia Gutnik - @superlilia Julian Dunn - @julian_dunn Charity Majors - @mipsytipsy Liz Fong-Jones - @lizthegrey Images provided by @mattstratton

Slide 61

Slide 61

If you enjoyed this talk, here’s more about me arresteddevops.com devopsdayschi.org twitter.com/mattstratton speaking.mattstratton.com @mattstratton

Slide 62

Slide 62

PagerDuty Connect NYC March 26 meet.pagerduty.com/connectnyc20 @mattstratton

Slide 63

Slide 63

pduty.me/work-with-pagey (also pagerduty.com/careers) @mattstratton

Slide 64

Slide 64

@mattstratton