SiteReliabilityEngineering

This is an ongoing list of notes on SRE I've learned over the past couple of years.

I pay careful attention to metrics and the math behind them 👨‍🔬

<h1>Site Reliability Engineering - Google Style</h1> <h3>Helpful Links</h4>

Reliability Mathematics

The 'S' Curve (Preface)

The 'S' Curve (Math)

How Complex Systems Fail (Chicago)

Google SRE Coding Questions

Path to SRE (manager)

MIT 6.033 Systems Engineering

Linux Command(s) (First-Principles)

USENIX Short Topics on SysAdmin (Interview Prep)

IIT Madras - Systems Engineering Course

Steps to Analyze a System

SLO (Service Level Objective) - A quantitative measurement of time or quantity of actions that must take place to enter SLA (repercussions). Internal thresholds set to alert the SLA violation. Quantitatively stronger than SLA. Services can have multiple SLO’s.

Example: HTTP (SLO) 200ms. If a request takes longer than 200ms you will enter SLA (usually financial repercussions). An SRE engineer needs to be able to anticipate (ideally) or remedy (more common) a failed SLO.

SLA (Service Level Agreement) - Essentially the consequences of a failed SLO. Usually comes in the form of direct or indirect monetary compensation.

Example: GCP breaks their HTTP SLO. GCP reimburses the company with $100 in cloud credits.

The Happiness Test - The minimum threshold to ensure that customers are happy.

<h3>Measuring Reliability</h3>

Example: Netflix - Playback latency (HTTP). Packet loss in the middle of a video.

SLI (Service Level Indicators) - The metrics you define to quantitatively measure your system performance.

Example: Error Rate (Network Health) - (success / total requests) * 100

Example: Error Rate (Network Health) - (success / throughput) * 100

Measuring Reliability (Edge Case) - Not every organization and/or system is linear. There are cases when you will need exponentially better service to a customer versus your standard service you normally offer.

Example: Black Friday - It is expected that Company Y will have an N% increase in their website (read: client) and thus will require X% increase in the “triangle of success”.

<h3>Triangle of Success - R.A.S.</h3>

Reliability - % of time that the system functions properly for the user. Availability - % of time that the system is up and running. Scalability - # of users that the system can serve reliably.

Availability is non-linearly related to customer happiness.
Availability is inversely proportional to the ability to push out new features.
Reliability can be increased by decreasing the dev release cycle, increasing testing, and more manual analysis.

Never Want 100% - The marginal cost to make an already reliable system more reliable often times exceeds the value of delivering this to the customers.

Marginal cost in this case is how much it would cost (engineer time, compute cost, etc.) to make a proposed change.

Value to customers in this case could be thought of the probability that new customers use the service due to proposed change and/or the probability of risk that you will lose a customer.

<h3>How To Determine Reliability</h3>

Measure your SLO achieved and be above the target.

❓: What do the users need and how does the system currently perform?
Measure how SLI is performing against the target.

❓: Will increasing the service availability result in positive externalities or negative externalities to the business function?

Note: If you make your service more reliable than an individuals ISP, your customer is going to blame the ISP, not you.

<h3>Iteration Process</h3>

Review a new SLO after 3 months. Follow up review after 6-12 months.

<h3>Error Budgets</h3>

Requires executive buy-in.
Balance reliability with feature velocity.

Error Budget = 1 - SLO

Allowed Downtime = SLO * 28 (days) * 24 (hours/day) * 60 (minutes/hour)

⚠️ The single largest source of outages is change to a system. New features = lower service availability.

Note: non-linear correlation between the relationship of new features and lowered service.

Example: To improve reliability of a new feature incorporated into a system you could find that it will cost 10x the previous amount to ensure that the new system is reliable.

Advanced Error Budget Topics:

“Dynamic release cadence”

Throttle back the grip on disallowing features to be released due to an error budget that was overly frugal.

“Rainy day fund”

Rollover error budget that covers unexpected events.

“Budget based alerts”

Send alert if recent errors are > X% of your monthly budget.

“Silver Bullets”

Error budget is already out. SRE doesn’t want to support the new feature. SWE says new feature is vital to company and has N-silver bullets. The SWE would have to have seniority and use one of their silver bullets in this case. DO NOT ROLLOVER.

⚠️ ️️️Silver Bullets are treated as a failure and would require a post-mordem. ⚠️

<h3>Trade-Off Theory</h3>

How to make devs happy?

Integration testing, automated canary analysis (ACA), rollback.

How to reduce scale of failure amongst users?

Route traffic to a small percentage of users with a new image and study how the system responds to the changes. This is also a great way to discover and eliminate SPOF (single point of failure).

TTD - Time to detect an issue in a system.

TTR - Time to resolve the issue in the system.

TTF - Time elapsed between failures.

Error Impact (TBF) = (TTD + TTR) * impact (%) / TTF

How to improve reliability?

Reduce numerator OR increase denominator

How to improve TTD?

Implement systems to get alerts to the right person faster (reduce detection time).

How to improve TTR?

Implement systems to fix outages faster.

Examples: develop a playbook, increased data parsing and log analysis. Take a failed zone offline and redirect traffic to an available zone while the affected zone is getting repaired.

How to improve impact % ?

Implement system to roll out new features to a very small set of users (note: Find users that fall within DAU and are not your “core” user base. Find users that you “can afford to lose” and test it on them.) Give changes time to bake.

How to reduce TTF?

Decrease the probability that a failure ever happens again.

Example: re-routing traffic from a failed region over to a region that is healthy.

<h3>Reliability Operations Best Practices</h3>

Periodically report the worst customers, worst region, uneven error budget distribution. Focus extra hard on those regions.
Standardize infrastructure.
Consult SWE on system design.
Rollback speed.
Phased rollouts.

<h3>Quantifying User Satisfaction</h3>

Think about the reliablility from the users point of view.

How to measure the happiness?

We define a SLI and measure how it changes over time.
We want an SLI that has a linear relationship (predictable) with the happiness of the users.
Predictability is very important because you will be making engineering changes based on the data.
Relationship between latency and user happiness is an “S” curve (non-linear).

Example: Website is slow to load or respond to other embedded features. User leaves site. Count up the speed and the quantity of users that left the site in this window of time as a ratio of users that didn’t. You will have a quantified metric of how unhappy the event made users.

<h3>Properties of SLI</h3>

Standard (computer) operational metrics: Load average, CPU util, memory usage, bandwidth.

CPU bound = slow service = unhappy user

SLI = good events / valid events

SLI is a measurement of user experience (quantitative)

Services internal state metrics: thread pull fullness, request queue length, request queue outages

SLI Range: 0%-100%

Benefits: Consistent format

SLI aggregated over a long time period is needed to make a decision on the validity of the metric. Want high signal, low noise.

<h3>Measuring SLIs (Order of User Proximity HI->LOW)</h3>

Processing server side request logs
1. Request backfill SLI logs. Get retroactive data to build a model prior to conception.
2. Convoluted logs (data stitching), processing jobs, etc. can be stitched together and exported as a refined “good event” counter.
3. Note: ingestion and processing will add significant latency between event TO observation in SLI.
4. With the aforementioned in mind this is a poor way to build an emergency metric.
5. Events that don’t hit the application server can’t be observed in the logs at all.
6. Hard to measure complex user journeys (stateless server).
7. Easy to capture application level metrics even easier to add.
8. Application level metrics are lower latency than logs processing.
Service requests.
1. Extract metrics off of the client load balancer.
2. Closest point to the user and most data will not go unmeasured.
3. Little to no engineering effort required since most cloud providers have this feature built in.
4. Load balancers are stateless which makes it impossible to track sessions.
5. Load

SiteReliabilityEngineering

Install / Use

README

SiteReliabilityEngineering