Howtheysre
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
Install / Use
/learn @upgundecha/HowtheysreREADME
How they SRE

Introduction
How They SRE How They SRE is a curated knowledge repository of Site Reliability Engineering (SRE) best practices, tools, techniques, and culture adopted by leading technology or tech-savvy organizations.
Numerous organizations frequently share their insights and expertise, encompassing best practices, tools, and techniques that shape their engineering culture. They do this through various public platforms such as engineering blogs, conferences, and meetups. This repository compiles and presents content gathered from these sources.
Topics
- Site Reliability Engineering
- Hiring and Building SRE teams
- SRE Culture
- DevOps
- Monitoring & Observability
- Alerting
- Incident Response & Post-Mortem
- On-Call
- Testing in Production
- Chaos Engineering
- Automation
- Performance
- Platform Engineering
Organizations
<details> <summary>Achievers</summary>Blog Posts
- Enter the Abattoir - Building 'à la carte' gitops tooling
- Scaling Production Globally — The service mesh facelift (Part-1)
- Scaling Production Globally - Solving observability problems for developers (Part-2)
- Load Testing Kubernetes: Building a Framework (Part-1)
- Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)
Blog Posts
- Automated Incident Management Through Slack
- Detecting Vulnerabilities With Vulnture
- Alerting Framework at Airbnb
- When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb
- Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb
- Production Secret Management at Airbnb
- Automating Data Protection at Scale, Part 1
- Automating Data Protection at Scale, Part 2
- Automating Data Protection at Scale, Part 3
- Dynamic Kubernetes Cluster Scaling at Airbnb
Blog Posts
</details> <details> <summary>Alibaba Cloud</summary>Blog Posts
- Why Are the Top Internet Companies Choosing SRE over Traditional O&M?
- Architecture and Practices of Bilibili's Real-time Platform
Blog Posts
- How Asana uses Asana: Security incident response
- How Asana ships stable web application releases
- Analysis of recent downtime & what we’re doing to prevent future incidents
- Developer environment: Achieving reliability by making it fast to reset
- Three security tactics for every IT leader to consider this fall
Blog Posts
- Playing the blame-less game
- A day in the life of… Cat S (Head of Reliability Engineering)
- An AKS Performance Journey: Part 1 — Sizing Everything Up
- An AKS Performance Journey: Part 2 — Networking It Out
- Cyber Security @ ASOS.com
- Security Operations 24x7
- The skills we look for in Cyber Security Incident Response
Blog Posts
- Best practices for change management in the age of DevOps
- Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code
- How to export Kubernetes events for observability and alerting
- Incident Postmortem Template
Blog Posts
</details> <details> <summary>Baidu</summary>Videos
- Anomaly Detection on Golden Signals
- NetRadar: Monitoring the Datacenter Network
- Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity
Blog Posts
- Inside a CODE RED: Network Edition
- Three Basecamp outages. One week. What happened?
- Basecamp 2 and Basecamp 3 search outage report
- Reducing Incident Escalations at Basecamp
Books
</details> <details> <summary>Bloomberg</summary>Videos
- Capacity Planning and Performance Enhancement with Page Reference Sampling
- Why SREs can't afford to NOT do Chaos Engineering
- Tracing Real-Time Distributed Systems
- The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation
- Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest
Blog Posts
- How Reliability and Product Teams Collaborate at Booking.com
- Incidents, fixes, and the day after
- Troubleshooting: A journey into the unknown
Videos
- Sailing the Database Seas: Applying SRE Principles at Scale
- SLOs for Data-Intensive Services
- [Benefits of Taking the
Related Skills
healthcheck
344.4kHost security hardening and risk-tolerance configuration for OpenClaw deployments
tmux
344.4kRemote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.
prose
344.4kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
Writing Hookify Rules
99.2kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
