The Lifecycle of a Service
Matt Stratton DevOps Advocate & Thought Validator, PagerDuty
@mattstratton
Slide 2
Slide 3
Service Ownership means people take responsibility for what they deliver, at every stage of a service’s lifetime.
@mattstratton
Slide 4
Communicate across your organization with partners and stakeholders
@mattstratton
Slide 5
What is a service?
@mattstratton
Slide 6
A service can be a lot of things Microservice
Slice of a monolith Internal tool
Piece of functionality Component Shared infrastructure
Feature @mattstratton
Slide 7
A service can be a lot of things If it provides value to other people, it’s a service
@mattstratton
Slide 8
Define what a “service” means to you
@mattstratton
Slide 9
A service is a discrete piece of functionality that provides value that is wholly owned by a team
@mattstratton
Slide 10
Shared understanding
@mattstratton
Slide 11
Who is responsible?
@mattstratton
Slide 12
“Service mitosis”
@mattstratton
Slide 13
Service definitions help with problem resolution
@mattstratton
Slide 14
What about a monolith?
@mattstratton
Slide 15
Roles in service ownership
@mattstratton
Slide 16
Development Team
@mattstratton
Slide 17
Your service should make sense to other people who will interact with it
@mattstratton
Slide 18
Naming
@mattstratton
Slide 19
Be specific
@mattstratton
Slide 20
Names that are specific •
“User authenticator”
•
“Payment processor”
•
“Shopping cart”
•
“Login”
•
“Report generator”
•
“Email tracking code” @mattstratton
Slide 21
Less amazing names •
PacMan (unless you’re actually building PAC-MAN, which I doubt)
•
Apollo
•
BurgunDB
•
Artemis
@mattstratton
Slide 22
Descriptions
@mattstratton
Slide 23
•
What is the intent of this service, component, this slice of functionality?
•
How does this thing deliver value?
•
What does it contribute to?
•
How will this impact customers?
@mattstratton
Slide 24
Dependencies •
Look for circular dependencies
•
Is there a single point of failure?
•
Who consumes this service?
•
What does it depend on?
@mattstratton
Slide 25
API
•
Versioning
•
Clear documentation / examples
@mattstratton
Slide 26
Tiers of services
@mattstratton
Slide 27
Tier 1 Services at PagerDuty •
24/7 on-call
•
Multiple levels of robustness
•
Disaster recovery plan
•
Clear and updated runbook
@mattstratton
Slide 28
Tier 2 & 3 Services at PagerDuty •
Monday-Friday support expectation
•
Supporting functionality, not critical path
•
New services that are not generally available
@mattstratton
Slide 29
Sustainability team
@mattstratton
Slide 30
Runbooks
@mattstratton
Slide 31
Alerting
@mattstratton
Slide 32
Robustness and reliability
@mattstratton
Slide 33
Program management
@mattstratton
Slide 34
Responsibilities of program management •
Defining what ‘done’ is
•
Emotional awareness of stress of the rest of the team
•
Connective tissue work between different teams and features (help understand and mitigate dependencies)
•
Awareness of what it means to pull people away from other projects to solve a problem
@mattstratton
Slide 35
Product owner
@mattstratton
Slide 36
Customers are always asking for uptime, performance, and security – they just don’t usually use those words
@mattstratton
Slide 37
Management
@mattstratton
Slide 38
•
Make room in the roadmap for investing in tech debt
•
Encourage a culture of cooperation and sharing
•
Set goals that balance business priorities with achievable engineering goals
@mattstratton
Slide 39
Going deeper
@mattstratton
Slide 40
What are you observing about this service?
@mattstratton
Slide 41
Observability vs monitoring
@mattstratton
Slide 42
Baron Schwartz
Founder and CTO, VividCortex
Monitoring tells you whether the system works. Observability lets you ask why it’s not working.
@mattstratton
Service Level Objectives •
Made up of SLI’s
•
Measured over time
•
Not contractually set
@mattstratton
Slide 47
Service Level Agreements •
Composed of SLO’s
•
Contractually/legally binding
•
Basically, this is where you owe your customer money
@mattstratton
Slide 48
The “hadness” point
@mattstratton
Slide 49
Alert on SLO’s
@mattstratton
Slide 50
How does a team respond to this service?
@mattstratton
Slide 51
Escalation policies
@mattstratton
Slide 52
DevOps Model
@mattstratton
Slide 53
First level
@mattstratton
Slide 54
Second level
@mattstratton
Slide 55
Third level
@mattstratton
Slide 56
Escalating
@mattstratton
Slide 57
Manual escalations
@mattstratton
Slide 58
Other escalation models
•
Central Ops
•
Hybrid Ops
@mattstratton
Slide 59
Tuning your service
@mattstratton
Slide 60
Investigate patterns
@mattstratton
Slide 61
What alerts do you actually need?
@mattstratton
Slide 62
Suppression of non-actionable alerts
@mattstratton
Slide 63
Understand business impact
@mattstratton
Slide 64
Lifecycle steps
@mattstratton
Slide 65
Designing a new service
@mattstratton
Slide 66
•
Understand the customers (product is a key role here)
•
Load testing / staging
•
Ensure SRE / sustainability teams are involved early
•
Define SLI/SLO/SLA
•
Identify alerting requirements
•
Documentation (API, runbook, functional service registry if applicable)
•
Perform all security checks @mattstratton
Slide 67
Maintaining and iterating
@mattstratton
Slide 68
•
Version the service API
•
Communicate to consumers
•
Proactive maintenance
•
Address tech debt consistently
•
Testing and deploying/releasing the service (CI/CD, testing in prod, etc)
@mattstratton
Slide 69
Retiring a service
@mattstratton
Slide 70
•
Identify consumers
•
Determine business impact of retiring
•
Communicate / offboard consumers
@mattstratton
Slide 71
Service ownership includes communication, compromise, and commitment.
@mattstratton