Reliability Engineering Explained: Building Systems That Don't Break
On July 2, 2019, Cloudflare experienced a global outage that took down their network for 27 minutes, affecting millions of websites that relied on their CDN and DDoS protection. The cause was not a cyberattack. It was not a hardware failure. It was a single regular expression in a web application firewall rule that contained a pattern requiring backtracking---a computational process that consumed 100% of CPU on every server worldwide when the rule was deployed.
The response time from the first alert to full service restoration was 27 minutes. Cloudflare published a detailed post-mortem within 48 hours, including the exact regular expression that caused the outage, a timeline accurate to the minute, an explanation of why backtracking regular expressions are dangerous, and five specific action items with owners and deadlines. The post-mortem is widely cited as an example of exemplary incident communication.
What Cloudflare demonstrated in that incident was not just good incident response. It was reliability engineering in action: the discipline of treating system failures as engineering problems with measurable targets, systematic responses, and learnable lessons rather than as emergencies to survive and forget.
The Origins of Site Reliability Engineering
Site Reliability Engineering (SRE) as a formal discipline was created at Google in 2003 when Ben Treynor Sloss was asked to lead a team responsible for making Google's services more reliable. His decision to staff the team with software engineers rather than traditional system administrators established the defining characteristic of SRE: applying software engineering methods to operations problems.
At the time, Google was scaling rapidly. The traditional approach---adding more operations staff proportionally as infrastructure grew---would not work. Managing 10,000 servers with 10 operators was feasible. Managing 1,000,000 servers with 1,000 operators was not. SRE's answer: automate everything that can be automated, define reliability mathematically, and create mechanisms that make the speed-stability tradeoff explicit and data-driven rather than political.
Google formalized their practices in the Site Reliability Engineering book (O'Reilly, 2016), making them available to the industry. The book's influence has been profound: SRE practices are now standard at large technology companies and increasingly adopted at organizations of all sizes.
What SRE Is and Isn't
SRE is frequently misunderstood as "an operations team that knows how to code" or "DevOps with a fancy name." The differences are substantive.
SRE is not traditional operations: Traditional operations teams primarily respond to events: servers go down, alerts fire, tickets come in. SRE teams spend a significant fraction of their time (Google targets 50%) doing engineering work: writing tools, automating toil, improving systems. The goal is that operational work decreases over time as automation replaces manual processes.
SRE is not pure DevOps: DevOps is a philosophy and cultural approach. SRE is a specific implementation of those principles with well-defined practices: SLOs, error budgets, chaos engineering, and production readiness reviews. Google describes SRE as "what you get when you treat operations as a software engineering problem."
The key SRE mindset: Reliability is not something you achieve once and maintain. It is a continuous engineering effort against a constantly changing system in a constantly changing environment. The question is never "are we reliable?" but "how reliable are we, is that reliable enough, and what is the next thing that will reduce reliability?"
The SLI/SLO/SLA Framework: Measuring What Matters
The SRE approach to reliability begins with precise measurement. Before you can improve reliability, you must define what reliability means for your specific service and measure whether you are achieving it.
Service Level Indicators (SLIs)
SLIs are metrics that quantify service behavior from the user's perspective. The emphasis on "user's perspective" is critical---internal metrics that do not correlate with user experience are not useful SLIs.
Availability SLI: What fraction of requests to this service succeeded (returned a non-error response within an acceptable time)?
Latency SLI: What fraction of requests were served within the target response time? Common formulation: "X% of requests completed in under Y milliseconds."
Error rate SLI: What fraction of requests produced errors? (Often expressed as the inverse: success rate)
Throughput SLI: Is the service handling the volume of requests it needs to handle?
Freshness SLI (for data services): Is the data that users see recent enough to be useful?
The selection of SLIs is an engineering judgment. A service's SLIs should capture the metrics that correlate with user satisfaction. A search engine cares about query response time; users directly experience slow searches. An analytics pipeline cares about data freshness; users experience stale data as incorrect results.
Example: Google's SLI for Google Search is heavily weighted toward latency (how fast do search results appear) and result quality (are the top results relevant). An internal metric like "cache hit rate" might be important for engineering analysis but is not a direct user-experience SLI.
Service Level Objectives (SLOs)
SLOs are targets for SLIs. They define "reliable enough" for the specific service.
Common SLO formulations:
- "99.9% of requests return a success response within the rolling 28-day window"
- "95% of user searches complete in under 200ms and 99% complete in under 1000ms"
- "API error rate stays below 0.1% over any 1-hour window"
Setting appropriate SLOs requires balancing user expectations against engineering cost. The relationship is nonlinear:
| Availability | Monthly Downtime | Annual Downtime | Engineering Complexity |
|---|---|---|---|
| 99% ("two nines") | 7.3 hours | 3.65 days | Low |
| 99.9% ("three nines") | 43 minutes | 8.76 hours | Moderate |
| 99.95% | 22 minutes | 4.38 hours | High |
| 99.99% ("four nines") | 4.3 minutes | 52 minutes | Very high |
| 99.999% ("five nines") | 26 seconds | 5.26 minutes | Extreme |
Each additional nine of availability is roughly 10x more expensive to achieve than the previous one. Going from 99% to 99.9% is achievable with basic redundancy. Going from 99.9% to 99.99% requires active failover, sophisticated load balancing, and significant operational investment. Going from 99.99% to 99.999% requires designing out entire classes of failure modes.
The right SLO is the lowest number that still provides an acceptable user experience for the service's purpose. Internal tools can tolerate 99% availability. A payment processing API might require 99.95%. A life-safety system might require 99.999%.
SLOs should not be 100%. An SLO of 100% is operationally equivalent to saying "we will never deploy anything, never change anything, and never run anything on hardware that can fail"---which is impossible. 100% SLOs create fear of change and slow innovation without providing users any benefit over a well-calibrated 99.9% SLO.
Service Level Agreements (SLAs)
SLAs are contractual commitments to customers, with defined consequences (credits, refunds, penalties) if the commitments are not met. SLAs are almost always less aggressive than internal SLOs to provide a buffer against measurement methodology differences and normal operational variance.
If the internal SLO is 99.9%, the external SLA might be 99.5%. If the service achieves its internal SLO, the SLA is safe. The buffer absorbs measurement differences and provides room to investigate issues before SLA violations trigger financial consequences.
Error Budgets: The Key Innovation
The error budget is SRE's most distinctive contribution to software operations thinking. It resolves the eternal conflict between engineering velocity (developers want to ship features fast) and operational stability (operators want systems to stay stable) by making the tradeoff explicit and data-driven.
The error budget is calculated directly from the SLO: if the SLO is 99.9% availability over a rolling 28-day period, the error budget is 0.1% of requests (or approximately 43 minutes of allowed downtime per month). The error budget is "spent" by outages, degraded performance incidents, and failed deployments.
How Error Budgets Change Behavior
When the error budget is healthy (significant budget remaining): The service is more reliable than the SLO requires. The team has demonstrated the system can absorb risk. This is the right time for:
- Higher deployment frequency (more changes per day)
- Risky architectural changes
- Experiments and exploratory features
- Infrastructure migrations
When the error budget is nearly depleted: The SLO is at risk of being breached. The team must shift focus to stability:
- Freeze non-essential deployments
- Allocate engineering time to reliability improvements
- Investigate and fix the root causes of recent incidents
- Reduce deployment batch sizes to minimize each change's risk
When the error budget is depleted: The SLO has been breached. The deployment freeze becomes mandatory until reliability is restored to SLO.
The elegance of this mechanism is that it converts a values conflict into a data-driven decision. "Should we ship this risky feature?" is no longer a debate between operations (who want stability) and development (who want features). It is a question with an objective answer: does the current error budget balance support taking this risk?
Example: A team at a major cloud provider was debating whether to deploy a significant architectural change that might cause brief service disruption. The error budget calculation showed they had 38 minutes of remaining budget in the current period and the deployment risk was estimated at 10 minutes. They deployed. When the actual deployment caused 12 minutes of degradation, they spent 12 minutes of budget but remained within SLO. The decision framework worked as intended.
Designing for Reliability
Reliability cannot be added to a system after the fact. It must be designed in from the beginning.
Redundancy: Eliminating Single Points of Failure
Any component that can fail without the system continuing to function is a single point of failure (SPOF). Eliminating SPOFs through redundancy is the foundational reliability technique.
Compute redundancy: Run multiple instances of every service. When one fails, others continue serving. Use load balancers to distribute traffic and health checks to route away from failed instances automatically.
Storage redundancy: Replicate data across multiple disks (RAID), multiple instances (database replicas), and multiple data centers (cross-region replication). The RPO (Recovery Point Objective) and RTO (Recovery Time Objective) drive how much replication is necessary.
Network redundancy: Multiple network paths between critical components. Multiple ISP connections. Multiple DNS providers. Anycast routing that automatically redirects traffic around network failures.
Geographic redundancy: Deploying in multiple availability zones (within a cloud region) protects against single-datacenter failures. Deploying in multiple regions protects against regional failures (AWS's us-east-1 region has had significant outages in December 2012, June 2012, October 2012, and December 2021).
Example: Netflix deliberately operates across multiple AWS regions simultaneously. They do not treat multi-region as a disaster recovery configuration---they treat it as normal operation. Traffic is distributed globally, and the system continuously routes around any region showing elevated error rates. This architecture allowed Netflix to maintain service during the December 2021 AWS us-east-1 outage that took down dozens of other services.
Graceful Degradation
When some components fail, well-designed systems degrade gracefully rather than failing completely. The principle: partial service is almost always better than no service.
Implementation patterns:
- Feature flags for non-essential features: Disable the recommendation algorithm if the recommendation service is unhealthy; continue showing products
- Cached fallbacks: Serve cached responses when the live data source is unavailable
- Read-only mode: Allow reads when writes are unavailable
- Default values: Return sensible defaults when personalization is unavailable
Example: Amazon's e-commerce system is designed with hundreds of independent services. When the recommendation service is unavailable, product pages display without personalized recommendations rather than failing to load. When the review service is slow, pages display without reviews rather than waiting. Users experience degraded but functional service rather than complete outages.
Circuit Breakers
The circuit breaker pattern (coined by Michael Nygard in Release It!) prevents cascading failures across service dependencies.
When service A calls service B:
- Closed (normal): Calls proceed. Failures are counted.
- Open (tripped): Too many failures occurred. Calls immediately return an error without contacting service B. The circuit "opens" to protect service A from waiting for a service that is not responding.
- Half-open (testing): After a timeout, a test request goes to service B. If it succeeds, the circuit closes. If it fails, it opens again.
Without circuit breakers, slow or failed downstream services cause upstream services to accumulate waiting threads, consume memory, and eventually fail themselves---a cascading failure that can take down an entire microservice architecture starting from a single unhealthy service.
Timeout and Retry Policies
Every network call should have an explicit timeout. An unbounded wait for a service that is not responding will eventually exhaust all available resources.
Retry policies must be designed carefully:
- Exponential backoff with jitter: Retry after 1 second, then 2 seconds, then 4 seconds, adding random jitter to prevent thundering herd (all clients retrying simultaneously)
- Retry only idempotent operations: Do not automatically retry operations that create state (write operations) unless you can verify the original attempt failed
- Retry budgets: Limit total retry attempts to prevent retry storms from overwhelming recovering services
Chaos Engineering: Controlled Failure as Practice
Chaos engineering is the practice of deliberately introducing failures into production systems to test their resilience. The philosophy: if failures are inevitable, you should discover your resilience gaps in controlled experiments rather than during real incidents at the worst possible moment.
Netflix created Chaos Monkey in 2011: a service that randomly terminates EC2 instances in their production environment during business hours. The logic was that engineers needed to design services that survived instance failure, and the best way to enforce this was to make instance failure a regular occurrence rather than a hypothetical.
Netflix subsequently developed the Simian Army: a collection of chaos tools including:
- Chaos Gorilla: Simulates entire AWS availability zone failure
- Chaos Kong: Simulates regional failure
- Latency Monkey: Introduces artificial delays in service communication
- Conformity Monkey: Shuts down instances that violate best practices
The Principles of Chaos Engineering (2016, from Netflix engineers) formalize the methodology:
- Start by defining "steady state" (normal, healthy behavior measured by SLIs)
- Hypothesize that steady state will continue in both the control group and the experimental group
- Introduce variables that reflect real-world events (server failure, network latency, disk full)
- Disprove the hypothesis by finding a difference in steady state between control and experimental groups
When chaos experiments find a difference, they have discovered a reliability gap. That gap becomes an engineering priority before it causes a real incident.
Example: LinkedIn uses chaos engineering to test their infrastructure continuously. In 2018, they ran a chaos experiment simulating the failure of their primary Kafka cluster (used for all event streaming). The experiment discovered that several dependent services had insufficient error handling for Kafka unavailability. The team fixed those services over the following months. When Kafka experienced a real outage months later, those services degraded gracefully rather than failing completely.
Starting chaos engineering does not require Netflix-scale infrastructure. Begin with:
- Terminate a single non-critical instance during business hours with the full team watching
- Verify the system handles it correctly (redundancy routes around the failure)
- Gradually increase scope and severity of experiments
- Build confidence in the system's resilience through accumulated evidence
Toil: The Reliability Engineer's Nemesis
Toil is SRE's term for operational work that is:
- Manual: Requires human execution rather than automation
- Repetitive: The same work performed again and again
- Automatable: Could be handled by a script or automated system
- Tactical: Reactive responses to events rather than proactive improvements
Examples of toil: manually restarting a service that crashes every few days, manually rotating SSL certificates, manually responding to a specific alert type by running the same commands every time, manually creating infrastructure by clicking through a cloud console.
Toil is not "bad work." It may be entirely necessary. The problem is when toil consumes too much engineering capacity, crowding out the engineering work that would reduce toil---a negative feedback loop.
Google targets a toil cap of 50% of SRE engineer time. Above that level, SRE teams are chronically overloaded and unable to invest in the reliability improvements that reduce future toil. Measuring toil explicitly (tracking hours spent on manual, repetitive work) provides data to justify automation investments and resist being dragged into unsustainable operational work.
Toil Reduction Strategies
Runbook-to-automation pipeline: Document manual responses to alerts in runbooks. Then automate the runbooks. Every manual response is a temporary state until it is automated.
Alert quality improvement: Alerts that fire without requiring action are noise that creates toil. If an alert fires and the response is "wait and see," the alert needs a better threshold or should be removed.
Self-healing systems: Automated systems that detect and correct their own failures reduce toil directly. Kubernetes automatically restarts failed pods. Auto-scaling groups automatically replace unhealthy instances. Database replication automatically promotes a replica when the primary fails.
Incident Management
When incidents occur despite preventive measures, the quality of incident response determines the outcome.
During the Incident
Prioritize mitigation over diagnosis: The first goal is restoring service, not understanding what caused the failure. Diagnosis is valuable but secondary. If rolling back the last deployment restores service, do that immediately and investigate why afterward. Many incidents are prolonged by engineers trying to understand the problem before addressing it.
Example: During a 2021 Facebook outage (six hours, caused by a BGP routing misconfiguration), incident responders initially attempted to diagnose and fix the routing issue remotely while access systems depended on the same failed infrastructure. The restoration was delayed by hours because engineers tried to fix the root cause before falling back to the slower but reliable approach: sending engineers physically to the data center.
Designate an incident commander: One person coordinates the response, directs investigation, communicates with stakeholders, and makes decisions. Without this coordination role, responders work in parallel without communication, make conflicting changes, and fail to keep stakeholders informed.
Timebox investigations: If a diagnosis approach has not yielded results in 15-20 minutes, pivot to a different approach. Tunnel vision on a hypothesis while the outage continues is a common failure mode.
Status communication: Proactive, regular communication to stakeholders via a status page (statuspage.io, Atlassian Statuspage) prevents the secondary crisis of stakeholders demanding updates from the incident responders who need to focus on resolution.
Post-Incident: The Blameless Post-Mortem
Post-mortems (also called incident reviews or retrospectives) convert incidents into organizational learning. The blameless principle---the investigation focuses on systems and processes, not individual culpability---is not about avoiding accountability. It is about recognizing that focusing on individual blame forecloses the systemic analysis that produces improvement.
A thorough post-mortem contains:
- Executive summary: What happened, duration, impact
- Timeline: Detailed chronology from first symptom to full resolution, accurate to the minute
- Root cause analysis: What condition made the incident possible
- Contributing factors: What made the impact worse or the response slower
- What went well: Aspects of the response that worked effectively
- Areas for improvement: What could have been better
- Action items: Specific, owned, dated tasks to prevent recurrence or improve response
The action items are the most important output. A post-mortem that produces only "be more careful" or "review the runbook" has failed. Effective action items are specific and bounded: "Add canary deployment to the configuration change pipeline (owner: Alex, target: Sprint 24)" or "Implement automated backtracking regex detection in the WAF deployment pipeline (owner: Security team, target: August 30)."
Example: Cloudflare published nine action items from their 2019 regex outage. Each had a specific owner and was closed within weeks. The categories: engineering process (staged rollout requirements), testing (performance regression testing for WAF rules), and tooling (static analysis for catastrophic backtracking). The post-mortem and its action items have been widely shared as a model.
Production Readiness Reviews
Production Readiness Reviews (PRRs) are structured evaluations conducted before new services launch in production, ensuring they meet reliability standards.
A typical PRR checklist covers:
- Architecture: Does the service have appropriate redundancy? Any single points of failure?
- Capacity: Has load testing been performed? Are auto-scaling policies configured?
- Monitoring: Are SLIs defined and instrumented? Are alerts configured?
- Documentation: Are runbooks written for common operational scenarios?
- Deployment: Can the service be deployed and rolled back safely?
- Data: Are backups in place? Is data recovery tested?
- Security: Has a security review been conducted? Are secrets properly managed?
- Dependencies: Are all service dependencies identified and their SLOs acceptable?
PRRs shift reliability investment to the beginning of a service's lifecycle rather than retrofitting it after production incidents reveal gaps. It is considerably cheaper to implement proper monitoring before launch than to discover the need for it at 3 AM during an incident.
The Reliability Engineering Culture
The technical practices of SRE only work within an appropriate organizational culture.
Psychological safety: Engineers who fear punishment for mistakes hide information, avoid risky changes, and write post-mortems that obscure rather than illuminate. Psychological safety---the belief that one can speak honestly about failures, mistakes, and concerns without personal repercussions---is the cultural foundation that makes blameless post-mortems and honest incident reporting possible.
Reliability as shared responsibility: Reliability is not the operations team's problem. It is every engineer's problem. Development teams that design unreliable systems and expect SREs to absorb the operational burden create unsustainable situations. Shared responsibility means development teams own reliability from design through production.
Reliability as a feature: The best reliability programs treat reliability as a user-facing feature with clear value, not as a tax on development velocity. A service that is fast to implement but unreliable creates user frustration that erodes the business value of the feature. Building reliability in costs less in the long run than retrofitting it after user complaints.
Understanding how CI/CD pipelines interact with reliability engineering reveals how deployment automation, feature flags, and deployment strategies are reliability tools as much as development velocity tools.
References
- Beyer, Betsy, Jones, Chris, Petoff, Jennifer, and Murphy, Niall Richard. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. https://sre.google/sre-book/table-of-contents/
- Beyer, Betsy, Murphy, Niall Richard, Rensin, David, Kawahara, Kent, and Thorne, Stephen. The Site Reliability Workbook. O'Reilly Media, 2018. https://sre.google/workbook/table-of-contents/
- Nygard, Michael T. Release It!: Design and Deploy Production-Ready Software, 2nd ed. Pragmatic Bookshelf, 2018.
- Rosenthal, Casey and Jones, Nora. Chaos Engineering: System Resiliency in Practice. O'Reilly Media, 2020.
- Netflix Technology Blog. "Chaos Engineering Upgraded." netflixtechblog.com, 2020. https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa
- Cloudflare. "Details of the Cloudflare Outage on July 2, 2019." blog.cloudflare.com, 2019. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/
- Principles of Chaos Engineering. principlesofchaos.org. http://principlesofchaos.org/
- Forsgren, Nicole, Humble, Jez, and Kim, Gene. Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018.
- Atlassian. "Incident Management Handbook." atlassian.com. https://www.atlassian.com/incident-management
- PagerDuty. "Incident Response Documentation." response.pagerduty.com. https://response.pagerduty.com/