The Lifecycle of a Service

A presentation at DevOps Minneapolis Meetup in January 2020 in Minneapolis, MN, USA by Matt Stratton

Slide 1

Slide 1

The Lifecycle of a Service Matt Stratton DevOps Advocate & Thought Validator, PagerDuty @mattstratton

Slide 2

Slide 2

Slide 3

Slide 3

Service Ownership means people take responsibility for what they deliver, at every stage of a service’s lifetime. @mattstratton

Slide 4

Slide 4

Communicate across your organization with partners and stakeholders @mattstratton

Slide 5

Slide 5

What is a service? @mattstratton

Slide 6

Slide 6

A service can be a lot of things Microservice Slice of a monolith Internal tool Piece of functionality Component Shared infrastructure Feature @mattstratton

Slide 7

Slide 7

A service can be a lot of things If it provides value to other people, it’s a service @mattstratton

Slide 8

Slide 8

Define what a “service” means to you @mattstratton

Slide 9

Slide 9

A service is a discrete piece of functionality that provides value that is wholly owned by a team @mattstratton

Slide 10

Slide 10

Shared understanding @mattstratton

Slide 11

Slide 11

Who is responsible? @mattstratton

Slide 12

Slide 12

“Service mitosis” @mattstratton

Slide 13

Slide 13

Service definitions help with problem resolution @mattstratton

Slide 14

Slide 14

What about a monolith? @mattstratton

Slide 15

Slide 15

Roles in service ownership @mattstratton

Slide 16

Slide 16

Development Team @mattstratton

Slide 17

Slide 17

Your service should make sense to other people who will interact with it @mattstratton

Slide 18

Slide 18

Naming @mattstratton

Slide 19

Slide 19

Be specific @mattstratton

Slide 20

Slide 20

Names that are specific • “User authenticator” • “Payment processor” • “Shopping cart” • “Login” • “Report generator” • “Email tracking code” @mattstratton

Slide 21

Slide 21

Less amazing names • PacMan (unless you’re actually building PAC-MAN, which I doubt) • Apollo • BurgunDB • Artemis @mattstratton

Slide 22

Slide 22

Descriptions @mattstratton

Slide 23

Slide 23

• What is the intent of this service, component, this slice of functionality? • How does this thing deliver value? • What does it contribute to? • How will this impact customers? @mattstratton

Slide 24

Slide 24

Dependencies • Look for circular dependencies • Is there a single point of failure? • Who consumes this service? • What does it depend on? @mattstratton

Slide 25

Slide 25

API • Versioning • Clear documentation / examples @mattstratton

Slide 26

Slide 26

Tiers of services @mattstratton

Slide 27

Slide 27

Tier 1 Services at PagerDuty • 24/7 on-call • Multiple levels of robustness • Disaster recovery plan • Clear and updated runbook @mattstratton

Slide 28

Slide 28

Tier 2 & 3 Services at PagerDuty • Monday-Friday support expectation • Supporting functionality, not critical path • New services that are not generally available @mattstratton

Slide 29

Slide 29

Sustainability team @mattstratton

Slide 30

Slide 30

Runbooks @mattstratton

Slide 31

Slide 31

Alerting @mattstratton

Slide 32

Slide 32

Robustness and reliability @mattstratton

Slide 33

Slide 33

Program management @mattstratton

Slide 34

Slide 34

Responsibilities of program management • Defining what ‘done’ is • Emotional awareness of stress of the rest of the team • Connective tissue work between different teams and features (help understand and mitigate dependencies) • Awareness of what it means to pull people away from other projects to solve a problem @mattstratton

Slide 35

Slide 35

Product owner @mattstratton

Slide 36

Slide 36

Customers are always asking for uptime, performance, and security – they just don’t usually use those words @mattstratton

Slide 37

Slide 37

Management @mattstratton

Slide 38

Slide 38

• Make room in the roadmap for investing in tech debt • Encourage a culture of cooperation and sharing • Set goals that balance business priorities with achievable engineering goals @mattstratton

Slide 39

Slide 39

Going deeper @mattstratton

Slide 40

Slide 40

What are you observing about this service? @mattstratton

Slide 41

Slide 41

Observability vs monitoring @mattstratton

Slide 42

Slide 42

Baron Schwartz Founder and CTO, VividCortex Monitoring tells you whether the system works. Observability lets you ask why it’s not working. @mattstratton

Slide 43

Slide 43

Empathy-driven alerting @mattstratton

Slide 44

Slide 44

A brief overview of SLA/SLO/SLI @mattstratton

Slide 45

Slide 45

Service Level Indicators (SLI) • Latency • Throughput • Availability @mattstratton

Slide 46

Slide 46

Service Level Objectives • Made up of SLI’s • Measured over time • Not contractually set @mattstratton

Slide 47

Slide 47

Service Level Agreements • Composed of SLO’s • Contractually/legally binding • Basically, this is where you owe your customer money @mattstratton

Slide 48

Slide 48

The “hadness” point @mattstratton

Slide 49

Slide 49

Alert on SLO’s @mattstratton

Slide 50

Slide 50

How does a team respond to this service? @mattstratton

Slide 51

Slide 51

Escalation policies @mattstratton

Slide 52

Slide 52

DevOps Model @mattstratton

Slide 53

Slide 53

First level @mattstratton

Slide 54

Slide 54

Second level @mattstratton

Slide 55

Slide 55

Third level @mattstratton

Slide 56

Slide 56

Escalating @mattstratton

Slide 57

Slide 57

Manual escalations @mattstratton

Slide 58

Slide 58

Other escalation models • Central Ops • Hybrid Ops @mattstratton

Slide 59

Slide 59

Tuning your service @mattstratton

Slide 60

Slide 60

Investigate patterns @mattstratton

Slide 61

Slide 61

What alerts do you actually need? @mattstratton

Slide 62

Slide 62

Suppression of non-actionable alerts @mattstratton

Slide 63

Slide 63

Understand business impact @mattstratton

Slide 64

Slide 64

Lifecycle steps @mattstratton

Slide 65

Slide 65

Designing a new service @mattstratton

Slide 66

Slide 66

• Understand the customers (product is a key role here) • Load testing / staging • Ensure SRE / sustainability teams are involved early • Define SLI/SLO/SLA • Identify alerting requirements • Documentation (API, runbook, functional service registry if applicable) • Perform all security checks @mattstratton

Slide 67

Slide 67

Maintaining and iterating @mattstratton

Slide 68

Slide 68

• Version the service API • Communicate to consumers • Proactive maintenance • Address tech debt consistently • Testing and deploying/releasing the service (CI/CD, testing in prod, etc) @mattstratton

Slide 69

Slide 69

Retiring a service @mattstratton

Slide 70

Slide 70

• Identify consumers • Determine business impact of retiring • Communicate / offboard consumers @mattstratton

Slide 71

Slide 71

Service ownership includes communication, compromise, and commitment. @mattstratton

Slide 72

Slide 72

Acknowledgements Lilia Gutnik - @superlilia Julian Dunn - @julian_dunn Charity Majors - @mipsytipsy Baron Schwartz - @xaprb Images provided by @mattstratton

Slide 73

Slide 73

If you enjoyed this talk, here’s more about me arresteddevops.com devopsdayschi.org twitter.com/mattstratton speaking.mattstratton.com @mattstratton