The Lifecycle of a Service Matt Stratton DevOps Advocate & Thought Validator, PagerDuty @mattstratton
A presentation at DevOpsDays New York City 2020 in March 2020 in New York, NY, USA by Matt Stratton
The Lifecycle of a Service Matt Stratton DevOps Advocate & Thought Validator, PagerDuty @mattstratton
Imagine a world where you understand what you’re working on, you’re clear on what your dependencies are, who relies on you and what you’re delivering. You have a clear vision of your impact on your business, your customers, and you know what you want to do to continue delivering value to the people you care about. You innovate, you try new things, and you can solve problems effectively when they come up. You and your colleagues work together to bring value to your business without blame, and you can make changes without being afraid of unintended consequences.
Service Ownership means people take responsibility for what they deliver, at every stage of a service’s lifetime. @mattstratton
What is a service? @mattstratton
A service can be a lot of things Microservice Slice of a monolith Internal tool Piece of functionality Component Shared infrastructure Feature @mattstratton For today, let’s call them all a kind of ‘service’. If it provides value to other people, it’s a service.
A service can be a lot of things If it provides value to other people, it’s a service @mattstratton For today, let’s call them all a kind of ‘service’. If it provides value to other people, it’s a service.
Define what a “service” means to you @mattstratton
A service is a discrete piece of functionality that provides value that is wholly owned by a team @mattstratton At PagerDuty, our definition is more specific to an infrastructure that is composed of distinct services that are written as separate pieces of code, but can be applied to parts of a monolithic codebase as well, or even external tools.
Shared understanding @mattstratton It’s helpful to come to a shared understanding of the boundaries of a given service and figure out who all the stakeholders are. If there are multiple teams contributing, maintaining, and supporting a given service, this gets even more important.
Who is responsible? @mattstratton Start by considering who is responsible for the service you are defining. A service should be wholly owned by the team on-call for it.
“Service mitosis” @mattstratton If multiple teams share responsibility for a service, it’s better to split that service up (if possible) into separate services. Some organizations call this “service mitosis”, making a rule that at a certain team size or the volume of code in a code base, the service and team must split up.
Service definitions help with problem resolution @mattstratton Services should be set up granularly enough to help identify the source of problems. If two microservices always behave as one, and fixing a problem on one means fixing it also on another, it might make sense to combine them.
What about a monolith? @mattstratton If you have a monolith, consider how you will address on-call responsibilities for it. Monoliths tend to be the cause of a lot of incidents, both actionable and nonactionable.If one team owns the monolith, they usually don’t own any other services unless the on-call load for the monolith is low. If multiple teams share responsibility for the monolith, consider gradually identifying different sources of functionality and routing those alerts to teams that have the right context. Each source of functionality can be represented as a different service for your documentation, enumerating runbooks and wiki’s, along with your oncall ownership in PagerDuty.
Roles in service ownership @mattstratton Service ownership is a shared responsibility – engineers aren’t the only people accountable to the ownership of the code they write.
Development Team @mattstratton Developing a service is more than writing code. Yes, of course, it involves code in a shared repo with documentation, generally a Readme that explains what the code does, and the contract that other services can understand if they need to interact with it. You may not have everything in place as you’re initially developing a service, but you should go through a standard process that your organization has agreed upon.
Your service should make sense to other people who will interact with it @mattstratton Remember, you are not the only person who will interact with this service. Make sure your code is reviewed by other members of your team, with whatever process. You will not always be around to answer questions about what your code is intended to do, so it is your responsibility to make it understandable so that you aren’t a single point of contact.
Naming @mattstratton They say that the two hardest things to do in computer science is caching strategies and naming – and they’re not wrong. It’s totally normal to use silly names (Greek mythical figures, inside jokes, pop culture references, space missions) as placeholders for service names. When you have a small organization, you think that everyone will always remember every inside joke – but you’ll grow your organization, you’ll welcome new members to your team and eventually the memories of that inside joke will fade, and that inside joke will be something that you end up having to explain to everyone. At that point, if you have to explain the joke, it’s not even funny enough to be worth it.
Be specific @mattstratton Consider naming your service something specific; naming it to explain what it does. If you already have some services with less-than-specific names, don’t panic. You don’t have to make changes to everything all at once. Fix forward, and start naming new services with more specific, clear names. We understand this is not a trivial task. Ask your colleagues what they would expect this thing to be named. Default to long-ish names, instead of shorter ones, but not too long. Names that are too long often get shortened into acronyms, and acronyms are just as hard to understand.
Names that are specific • “User authenticator” • “Payment processor” • “Shopping cart” • “Login” • “Report generator” • “Email tracking code” @mattstratton Save your creativity for NaNoWriMo (national novel writing month)
Less amazing names • PacMan (unless you’re actually building PAC-MAN, which I doubt) • Apollo • BurgunDB • Artemis @mattstratton I have worked at 4 different organizations that have named a service “Artemis”. I am sure I will see it again.
Descriptions @mattstratton
Descriptions • What is the intent of this service, component, this slice of functionality? • How does this thing deliver value? • What does it contribute to? • How will this impact customers? @mattstratton What is the intent of this service, component, this slice of functionality? This is where you can record the purpose of this service. How does this thing deliver value? What does it contribute to? If this is part of a customer-facing feature, explain how this will impact customers, how it rolls up to the larger business component. The description can also mention the other components that this service may interact with, but know that these may change as other service owners make changes to those components.
API • Versioning • Clear documentation / examples @mattstratton What do you owe to the people who consume your service?
Sustainability team @mattstratton This is the group that you might call SRE, ops, etc
Runbooks @mattstratton Things will go wrong. As you learn about the different nuances of your service, you should keep a record of what you’ve tried that can resolve common issues. In an ideal world, you have the ability to resolve these common issues at the source of the problem within the source. This is a contentious topic. Runbooks are only helpful if they are accurate. It needs to be part of your SLDC to update the runbook with every change in functionality.
Alerting @mattstratton You should work to only alert on items that are actionable, and that are related to business goals/functionality. We will dig more into how to set these thresholds and monitoring a little bit later.
Resiliency Robustness and reliability @mattstratton This does not mean “high availability”. The better terms to use are “robustness” and “reliability”
Program management @mattstratton There’s an element of unpredictability in service ownership. Outcomes from post-mortems, proactive maintenance. As a program manager, you have to be mindful of the necessity for a buffer. Note - in some organizations, this might be considered “project management”. And not all orgs will have this role.
Responsibilities of program management • Defining what ‘done’ is • Emotional awareness of stress of the rest of the team • Connective tissue work between different teams and features (help understand and mitigate dependencies) • Awareness of what it means to pull people away from other projects to solve a problem @mattstratton
Product owner @mattstratton Product owners are responsible for translating the requirements of the customers, beyond what something looks like or what is capable of. Customers will tell product owners about what they want a product to do, but they’ll rarely specifically ask for uptime, performance, or security in an interview or user forum.
Customers are always asking for uptime, performance, and security – they just don’t usually use those words @mattstratton Consider, without uptime, a customer can’t get to the new wonderful feature you’ve delivered. Without performance, they’ll leave in frustration, tired of waiting for the feature to load. Without security, they won’t trust that new feature.
Senior leadership @mattstratton Service ownership works best when it’s championed by top leadership, and consistent across the entire product and engineering organization
Responsibilities of senior leadership • Make room in the roadmap for investing in tech debt • Encourage a culture of cooperation and sharing • Set goals that balance business priorities with achievable engineering goals @mattstratton Leaders also help set goals that balance business priorities with achievable engineering goals. If leadership expects 100% uptime for your main product, it’ll be pretty challenging to create a reasonable plan for how the various components or services can possibly support that. That’s destined to fail. On the other hand, having no SLA or a “do your best” policy is not going to create a compelling product for your customers or your business. Leadership establishes a cooperative and blameless service ownership culture, defines top-level SLA’s, and makes room for investment in maintenance, documentation, and sustainability.
Going deeper @mattstratton
What are you observing about this service? @mattstratton
Observability vs monitoring @mattstratton This is a talk all in itself. But here are a few things to keep in mind (note: you want both of these things)
@mattstratton
@mattstratton
Empathy-driven alerting @mattstratton Focus on the customer experience. What are your key business metrics around the experience of your customers? “9’s don’t matter if your customers aren’t happy” - Charity Majors
A brief overview of SLA / SLO / SLI @mattstratton Again, we could spend a whole talk on this. But just to have a refresher.
Service Level Indicators (SLI) • Latency • Throughput • Availability @mattstratton There are lots more. These are some examples. But remember these are building blocks. Note that these are not “goals”, but rather “what is the measure?”
Service Level Objectives • Made up of SLIs • Measured over time • Not contractually set @mattstratton
Service Level Agreements • Composed of SLOs • Contractually/legally binding • Basically, this is where you owe your customer money @mattstratton
The “hadness” point @mattstratton This is where you want to set your SLO’s - the point directly between customer happiness and customer sadness. And this is where you want to alert.
Alert on SLOs @mattstratton
How does a team respond to this service? @mattstratton
Tuning your service @mattstratton
Investigate patterns @mattstratton
What alerts do you actually need? @mattstratton If you alert on 5 things, and you understand all 5 of those things. If you alert on 1000 things, and treat each those with equal priority, it’s unlikely that any of them are getting the attention they need.
Suppression of non-actionable alerts @mattstratton This is why you always, always do a postmortem, even if it wasn’t an actual incident. This is the part when you refactor your alerting. Work with your alerting tool for suppression of non-actionable alerts. I hear that PagerDuty has a neat feature about this.
Understand business impact @mattstratton Do you know how your company makes money? If not, go find out. I’ll wait. What does this service tie to with regard to customers/revenue? Can you understand the business metrics associated with your service?
Lifecycle steps @mattstratton We will wrap up by reviewing all the stages of the service lifecycle, where the previous things tie in.
Designing a new service @mattstratton
Design phase • Understand the customers (product is a key role here) • Load testing / staging • Ensure SRE / sustainability teams are involved early • Define SLI/SLO/SLA • Identify alerting requirements • Documentation (API, runbook, functional service registry if applicable) • Perform all security checks @mattstratton Keep in mind that staging is always a bit of a myth, but there are various ways to do this.
Maintaining and iterating @mattstratton
Maintenance and iteration • Version the service API • Communicate to consumers • Proactive maintenance • Address tech debt consistently • Testing and deploying/releasing the service (CI/CD, testing in prod, etc) @mattstratton
Retiring a service @mattstratton
Retiring and sunsetting • Identify consumers • Determine business impact of retiring • Communicate / offboard consumers @mattstratton
Service ownership includes communication, compromise, and commitment. @mattstratton
Acknowledgements Lilia Gutnik - @superlilia Julian Dunn - @julian_dunn Charity Majors - @mipsytipsy Liz Fong-Jones - @lizthegrey Images provided by @mattstratton
If you enjoyed this talk, here’s more about me arresteddevops.com devopsdayschi.org twitter.com/mattstratton speaking.mattstratton.com @mattstratton
PagerDuty Connect NYC March 26 meet.pagerduty.com/connectnyc20 @mattstratton
pduty.me/work-with-pagey (also pagerduty.com/careers) @mattstratton
@mattstratton