Skip to content

Latest commit

 

History

History
496 lines (468 loc) · 53.5 KB

sre.md

File metadata and controls

496 lines (468 loc) · 53.5 KB

Site Reliability Engineering (SRE)

Learning Resources for DevOps, SRE, Cloud & Engineering Management

BINPIPE Learn DevOps! BINPIPE

A curated list of resources for SRE

What is Site Reliability Engineering?

Site Reliability Engineering: How Google Runs Production Systems, written by a group of Google engineers, is considered the definitive book on site reliability engineering. Google vice president of engineering Ben Treynor Sloss coined the term back in the early 2000s. He defined it as: "It's what happens when you ask a software engineer to design an operations function."

Sysadmins have been writing code for a long time, but for many of those years, a team of sysadmins managed many machines manually. Back then, "many" may have been dozens or hundreds, but when you scale to thousands or hundreds of thousands of hosts, you simply can't continue to throw people at the problem. When the number of machines gets that large, the obvious solution is to use code to manage hosts (and the software that runs on them).

Also, until fairly recently, the operations team was completely separate from the developers. The skillsets for each job were considered completely different. The SRE role tries to bring both jobs together.

Contents

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

Blogs

Conferences & Meetups

Twitter

SRE Tools

BINPIPE aims to simplify learning for those who are looking to make a foothold in the industry. 
Write to me at learn@binpipe.org if you are looking for tailor-made training sessions. 
For self-study resources look around in this repository, the Binpipe Blog and Youtube Channel.

📒 Maintainer: Prasanjit Singh | www.binpipe.org


License