Implementing SLOs and Error Budgets in Practice
99.99% availability sounds great until you realize that’s 4 minutes and 19 seconds of downtime per month. Four minutes. That’s barely enough time to get paged, open your laptop, authenticate to the VPN, and find the right dashboard. You haven’t even started diagnosing anything yet.
I’ve watched teams commit to four-nines SLOs because someone in a leadership meeting said “we need to be best in class.” No capacity planning. No discussion about what it would cost. No understanding that the jump from 99.9% to 99.99% isn’t a 0.09% improvement — it’s a 10x reduction in your margin for error.
This article is about doing SLOs properly. Not the theory — I’ve covered the foundations in my SLO/SLI implementation guide and SRE fundamentals. This is about the messy, political, surprisingly human work of making error budgets actually function in an organization.
The SLO That Nearly Broke Us
I need to tell you about the time a VP set an SLO without talking to engineering.
We had a payment processing service. Solid system, well-architected, running on a microservices stack with decent observability. We were comfortably hitting 99.9% availability and had the monitoring to prove it. Life was good.
Then quarterly business review happened. A VP saw a competitor’s marketing page claiming “99.99% uptime” and decided we needed to match it. An email went out on Friday afternoon: “Effective immediately, our availability target for payment processing is 99.99%.”
No consultation. No error budget discussion. No analysis of what our actual failure modes looked like.
Within two weeks, we’d burned through the new error budget. Not because anything changed with the system — the same volume of minor blips that were invisible under a 99.9% target were now critical incidents under 99.99%. Engineers were getting paged for transient network hiccups that resolved themselves in seconds. Morale cratered. Three people updated their LinkedIn profiles that month.
The fix wasn’t technical. It was a series of uncomfortable conversations where we walked leadership through the math, showed them what 99.99% actually required in terms of infrastructure spend and on-call burden, and negotiated an SLO that reflected reality: 99.95% for the critical payment path, 99.9% for everything else.
That experience taught me something I now consider non-negotiable: SLOs are a contract between engineering and the business. You don’t get to set them unilaterally from either side.
Starting With SLIs, Not SLOs
Most teams get this backwards. They pick an availability number and then try to figure out how to measure it. You need to start with your Service Level Indicators — the actual measurements — and let the data inform your objectives.
For a typical web service, I start with three SLI categories:
Availability: The proportion of requests that succeed.
# Availability SLI - ratio of successful requests over total
sum(rate(http_requests_total{job="payment-api", code!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-api"}[5m]))
Latency: The proportion of requests faster than a threshold.
# Latency SLI - proportion of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{job="payment-api", le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="payment-api"}[5m]))
Correctness: This one’s service-specific. For payment processing, it was the proportion of transactions that produced the expected result. You can’t always measure this with standard HTTP metrics — sometimes you need custom instrumentation.
Collect at least two weeks of data before setting any targets. A month is better. You want to see your system’s natural behavior across deploy cycles, traffic patterns, and the occasional weird Thursday where everything goes sideways for no apparent reason.
Setting Objectives That Actually Mean Something
Here’s my framework for choosing SLO targets, and it’s deliberately simple:
- Look at your historical performance over 30-90 days
- Identify your worst reasonable week (not the catastrophic outage, the normal-bad week)
- Set your SLO slightly above that worst reasonable performance
- Validate with the business that this level of reliability meets user expectations
If your service has been running at 99.95% naturally, don’t set a 99.99% target. You’re just creating a world where you’re perpetually in error budget violation. Set it at 99.9% and give yourself room to actually ship features.
I encode SLOs in a YAML spec that lives in the service repository. This isn’t optional — if the SLO isn’t in version control, it doesn’t exist:
# slo.yaml - lives in the service repo root
service: payment-api
slos:
- name: availability
description: "Proportion of non-5xx responses"
target: 0.999
window: 30d
sli:
type: ratio
good:
metric: http_requests_total{job="payment-api", code!~"5.."}
total:
metric: http_requests_total{job="payment-api"}
- name: latency-p99
description: "99th percentile latency under 500ms"
target: 0.99
window: 30d
sli:
type: ratio
good:
metric: http_request_duration_seconds_bucket{job="payment-api", le="0.5"}
total:
metric: http_request_duration_seconds_count{job="payment-api"}
This spec drives everything downstream — Prometheus recording rules, Grafana dashboards, alerting configuration. One source of truth.
Error Budgets: The Part Everyone Gets Wrong
An error budget isn’t a target to hit. It’s permission to take risks.
If your SLO is 99.9% over 30 days, your error budget is 0.1% — roughly 43 minutes of total downtime, or the equivalent in failed requests. That budget is yours to spend. Deploy a risky migration? That costs error budget. Run a chaos engineering experiment? Error budget. Ship a feature fast without extensive testing? Error budget.
The math is straightforward:
Error budget = 1 - SLO target
Monthly budget (minutes) = 43,200 minutes × (1 - SLO target)
For 99.9%: 43,200 × 0.001 = 43.2 minutes
For 99.95%: 43,200 × 0.0005 = 21.6 minutes
For 99.99%: 43,200 × 0.0001 = 4.32 minutes
I track error budget consumption with a Prometheus recording rule:
# Recording rule for remaining error budget (30-day window)
- record: slo:error_budget_remaining:ratio
expr: |
1 - (
(1 - (
sum(increase(http_requests_total{job="payment-api", code!~"5.."}[30d]))
/
sum(increase(http_requests_total{job="payment-api"}[30d]))
))
/
(1 - 0.999)
)
When this value hits zero, you’ve exhausted your budget. What happens next is where most organizations fall apart.
Error Budget Policies: Where Reliability Meets Politics
You need a written error budget policy. I can’t stress this enough. Without one, error budget conversations devolve into arguments about priorities every single time.
Here’s a simplified version of what I’ve used:
# error-budget-policy.yaml
policy:
budget_remaining_thresholds:
- level: normal
remaining: "> 50%"
actions:
- "Standard development velocity"
- "Feature work proceeds as planned"
- "Chaos experiments permitted"
- level: caution
remaining: "25% - 50%"
actions:
- "Reduce risky deployments"
- "Prioritize reliability-related backlog items"
- "Review recent incidents for patterns"
- level: critical
remaining: "< 25%"
actions:
- "Feature freeze for this service"
- "All engineering effort on reliability"
- "Postmortem required for any further budget consumption"
- level: exhausted
remaining: "0%"
actions:
- "Hard feature freeze"
- "Rollback any recent changes not related to reliability"
- "Executive review required to resume feature work"
escalation:
- "Engineering manager notified at caution"
- "Director notified at critical"
- "VP notified at exhausted"
The feature freeze at budget exhaustion is the part that gets pushback. Product managers hate it. But it’s the mechanism that makes the whole system work. Without consequences for burning through your error budget, the SLO is just a number on a dashboard that nobody looks at.
I’ve found the most effective way to get buy-in is framing it as a trade-off conversation, not a technical mandate. “We can ship Feature X this sprint, but it’ll cost us roughly 15 minutes of error budget based on similar past deployments. We have 20 minutes remaining. Do we want to spend it here?” That’s a business decision, not an engineering one.
Burn Rate Alerts That Don’t Wake You Up at 3 AM
Traditional threshold alerts on error rates are terrible for SLO monitoring. A brief spike to 1% errors might look alarming but barely dents a monthly budget. Meanwhile, a sustained 0.2% error rate will quietly eat your entire budget over a week.
Burn rate alerting fixes this. The concept: alert based on how fast you’re consuming your error budget relative to the window.
A burn rate of 1 means you’ll exactly exhaust your budget at the end of the window. A burn rate of 14.4 means you’ll burn through your entire 30-day budget in roughly 2 days. That’s worth waking someone up for.
I use a multi-window, multi-burn-rate approach straight from the Google SRE workbook:
# Prometheus alerting rules for burn-rate alerts
groups:
- name: slo-burn-rate
rules:
# Critical: 2% of budget consumed in 1 hour (burn rate 14.4x)
- alert: PaymentAPIHighErrorBurnRate
expr: |
(
sum(rate(http_requests_total{job="payment-api", code=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="payment-api"}[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{job="payment-api", code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="payment-api"}[5m]))
) > (14.4 * 0.001)
labels:
severity: critical
annotations:
summary: "Payment API burning error budget at 14.4x rate"
description: "At this rate, the 30-day error budget will be exhausted in ~2 days."
# Warning: 5% of budget consumed in 6 hours (burn rate 6x)
- alert: PaymentAPIElevatedErrorBurnRate
expr: |
(
sum(rate(http_requests_total{job="payment-api", code=~"5.."}[6h]))
/ sum(rate(http_requests_total{job="payment-api"}[6h]))
) > (6 * 0.001)
and
(
sum(rate(http_requests_total{job="payment-api", code=~"5.."}[30m]))
/ sum(rate(http_requests_total{job="payment-api"}[30m]))
) > (6 * 0.001)
labels:
severity: warning
annotations:
summary: "Payment API burning error budget at 6x rate"
# Ticket: 10% of budget consumed in 3 days (burn rate 1x)
- alert: PaymentAPISlowErrorBudgetDrain
expr: |
(
sum(rate(http_requests_total{job="payment-api", code=~"5.."}[3d]))
/ sum(rate(http_requests_total{job="payment-api"}[3d]))
) > (1 * 0.001)
labels:
severity: ticket
annotations:
summary: "Payment API slowly draining error budget"
The dual-window check (long window AND short window) prevents alerting on brief resolved spikes. The short window confirms the problem is still happening right now.
This approach cut our false-positive pages by about 70%. Engineers actually started trusting the alerts again, which is half the battle with on-call.
Making It Visible
SLOs that live only in Prometheus are SLOs that get ignored. You need visibility at multiple levels.
For engineering, I build a Grafana dashboard per service showing current SLI values, error budget remaining (as a percentage and in absolute minutes), burn rate over the last hour/6 hours/24 hours, and a 30-day trend line. The budget remaining gauge uses traffic-light colors. Green above 50%, yellow 25-50%, red below 25%. Simple, but it works because people actually glance at it.
For leadership, I send a weekly automated report: a one-line summary per service showing SLO target, current performance, and budget remaining. No graphs, no PromQL — just “Payment API: 99.94% (target 99.9%), 62% budget remaining.” If someone wants to drill in, they can click through to the dashboard.
For product teams, error budget gets a slot in sprint planning. “Here’s how much budget we have, here’s what we plan to spend it on.” It becomes part of the prioritization conversation rather than an afterthought.
The Feedback Loop Nobody Talks About
SLOs aren’t set-and-forget. I review them quarterly, and here’s what I’m looking for:
If you never come close to violating your SLO, it’s too loose. Tighten it or acknowledge that you’re over-investing in reliability for this service. If you’re constantly in violation, either the target is unrealistic or you have genuine reliability problems that need dedicated investment. Both are useful signals.
The quarterly review is also where I check whether the SLIs still reflect user experience. I’ve seen services where the HTTP-level SLI looked healthy but users were actually having a terrible time because the errors were happening in a critical workflow that represented 2% of total traffic but 80% of revenue. Weighted SLIs or journey-based SLOs can help here, but that’s a topic for another post.
Connect your SLO data to your automated remediation pipelines. When burn rate crosses a threshold, don’t just alert — trigger automated responses. Drain a bad instance, scale up capacity, flip a feature flag. I’ve written about this pattern in the context of SRE for serverless architectures where auto-remediation becomes even more critical because the failure modes are different.
Where Teams Get Stuck
After helping several teams adopt SLOs, the failure modes are predictable:
Too many SLOs. Start with one or two per service. Availability and latency. That’s it. You can add correctness, freshness, and throughput SLOs later once the practice is established. I’ve seen teams define fifteen SLOs for a single service and then ignore all of them because the cognitive load was unmanageable.
No error budget policy. Without agreed-upon consequences, the SLO is decorative. Get the policy written and signed off before you instrument anything.
Measuring at the wrong layer. Your SLI should measure as close to the user as possible. An internal health check endpoint returning 200 OK tells you nothing about whether users can actually complete a purchase. Measure at the load balancer or API gateway, not at the application’s self-assessment.
Treating SLOs as a ceiling instead of a target. Teams sometimes optimize to barely meet the SLO, which leaves zero room for unexpected incidents. Your normal operating performance should be comfortably above the SLO. The gap between actual performance and the SLO target is your error budget — that’s the space where you can move fast.
The hardest part of SLOs isn’t the math or the monitoring. It’s the organizational discipline to actually use them for decision-making. The PromQL is the easy part. Getting a product manager to accept a feature freeze because the error budget is exhausted — that’s the real work.
Start small. One service, one SLO, one error budget policy. Prove the value, then expand. And for the love of everything, don’t let a VP set your targets based on a competitor’s marketing page.