- don’t expect a tool to solve
- cultural change and need “believers” in senior role to advocate within company
- people need to absorb info within their own mindset
- it is a process that can span 6-9 months in orgs w/ 5000 engineers; nothing happens immediately
- Step 1: “I want to be reliable when I grow up” (you must believe you have problem first)
- Step 2: “Read the book!” and watch SRE v DevOps
- Step 3: “Panic!” (myth: fire team and retrain; not the case and can retrain team in house)
- Step 4: Start small, be patient, celebrate each step - spread the word
- developers incentivized with rapid change (ship more)
- operators incentivized w/ stability (limit change)
- research in cognitive psychology ( human perception cannot tell diff w/ 3 9s to 4 9s )
- “art of an SLO” - figure out how much error budget we can still keep customers happy with
- not about number of 9s (not badge of honor) “we need to be good enough and keep our customers happy”
- it may be more beneficial to invest engineering efforts in features vs. more reliability
- defining goals is fundamental collaboration between dev and ops together
- not everything needs to be measured (i.e. 3 clicks deep in left nav panel)
- Shared ownership - reduce org silos
- Error budgets - accept failure as normal (blame is not helpful)
- Reduce cost of failure - implement gradual changes
- Automate common case - leverage tooling and automation (if you have a decision tree, or processes not written down, automation helpful)
- Measure toil and reliability - measure everything* (alert only on CUJ, but measure all)
- need to be very transparent in what we measure
- latency difficult end-to-end (browser through to load balancer and backend)
- initial discussion “how many 9s can we achieve” (wrong approach)
- correct: what does success look like to customer and their expectation
- customer research, review report cases, measure actual experienced latency
- if lacking info: 1st goal should be current experience (current latency, availability)
- should be re-evaluated week 1, 4, 6 - first 6 weeks critical to nail down goals
- after 6 weeks, associate an alert to it (need it around long enough before too much noise)
- at least once/quarter re-evaluate; sit with all parties involved and decide whether to increase target
- SLOs have a life of their own; think about how to maintain over time (not set it and forget it)
- need to be meaningful to usage of customer; if customer use changes then need to change/evaluate
- avoid “collecting 9s as badge of honor”; starting with 3 9s is great (or even 2)
- our internal SLO exposed to public
- legal consequences; refund is strongest use case
- if you have an SLA, you should create stricter SLO and catch violations earlier (create error budget buffer)
- should not publish SLA to customer with unreasonable number of 9s
- should not publish SLA on something you historically fail on
- be very careful how you think about it to not erode trust of customers
- consider: global and local SLAs (CUJ for each: i.e. NA, EMEA, APAC)
- dev teams like receiving allowance for ice cream (spend all in 1 day, or spread it out)
- don’t spend budget, it’s taken away - if you don’t spend miss innovation opportunity (don’t horde it)
- gives opportunity to take risk without angering others
- historically if you don’t know budget, you become risk averse
- Common incentives for DEVs and SREs - find right balance between innovation and reliability
- DEV teams can manage the risk themselves - they decide how to spend their error budget
- Unrealistic reliability goals are unattractive - dampen velocity of innovation
- DEV team becomes self-policing - error budget is valuable resource for them
- Shared responsibility for system uptime - infra failures eat into dev’s error budget
Error budget agreement
What happens if we exhaust our budget? (agree up front, not when things on fire)

SRE Customer Journey
SLO Setup Guide