I’m on my way home from SRECon19 Americas, and it was incredible. Lots of really awesome folks had things to say that made me think. Thanks to all of you for speaking!
Plenty of people are going to do recaps and have already done live tweetstreams. Instead of taking notes on everything, this year I decided to try something different.
Below is just the parts of each talk I went to that blew my mind or made me think. I’m hoping that just reading each of these will be enough to get you thinking and open your mind.
I’m not trying to summarize each talk, and I definitely don’t want to give the impression that these were the only good parts of each talk! It all depended on what was new to me and my mood and my caffeine level and any number of other things.
In most cases, I’ve hurriedly written down something the speaker said. I’m sure I got the words wrong, so please consider the below to be a paraphrase rather than a direct quote.
Welcome and Opening Remarks
- code of conduct given top billing!
- live captioning by real humans!
What Breaks Our Systems: A Taxonomy of Black Swans
- explicitly plan for thundering herds
- know what you’ll do and test in advance
Complexity: The Crucial Ingredient in Your Kitchen
- definition of a complex system: no one human can understand how it all works
- all components can be 100% correct and a system can still be unreliable
- challenger accident: the fact that redundancy was there contributed to the failure because of a false sense of complacency
- redundancy causes some incidents
- regular exposure to risk helps us do a better job during incidents because it allows us to improve our adaptive capacity
Fixing On-Call When Nobody Thinks It’s (Too) Broken
- when your on-call is this bad [thousands of pages a month] the problem is usually cultural, not technical
- reducing pages made people nervous that something was wrong
- “silence anxiety”
How Did Things Go Right? Learning More from Incidents
- overly focusing on preventing incidents will prevent fewer of them
- availability numbers are made up
- the error bar is greater than the next nine you’re after!
- incident = a combination of normal things that go wrong all the time, all happening at once
- trying to put in guardrails blocks people from doin the everyday work they do to keep things running
- reduces adaptive capacity
- let’s rename incidents to “surprises”
SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
- [ed note: Jen is my coworker at Fastly]
- SREs need to do product management — and already do
What I Wish I Knew before Going On-call
Chie Shu, Dorothy Jung, and Wenting Wang
- on-call choose-your-own-adventure game!
- a runbook should be a guide on how to jump-start your car — not how to build it
Zero to SRE
- when hiring junior SREs to train them up, most of the work for upper management comes before candidates ever enter the pipeline
- dedicate 20% of a junior SRE’s time to learning (e.g. read a chapter of a book, deploy a kube cluster)
- the team structure changes Kim suggests are actually good for SRE teams of all skill levels, not just those containing junior engineers
One on One SRE
- incident responders may share more when debriefed 1:1 vs. a group retrospective
- best 1:1 retrospective question: “what surprised you in this incident?”
- individuals can impact the reliability of the whole organization by building a network of 1:1 relationships
- automation must respect the roles of humans and their adaptive capacity
You Don’t Have to Love Your Job
- “love is patient, love is kind” (wedding vows)
- this does not sound like something that should apply to your job!
- Elon Musk has millions of shares in his company. He has a large incentive to trick workers into working 80 hour weeks.
- workers with a 60-hour work week are 25% less productive than those working a 40-hour week
- that’s 25% less in absolute terms, not per hour
- love inspires heroics, and heroics are an SRE nightmare
- crowing about how one should love their job is an expression of privilege
Mindfulness in SRE: Monitoring and Alerting for One’s Self
- physical and “ego” (conceptual) threats result in the same body stress response
- a retrospective is about understanding the full context behind the actions someone took during an incident
- mindfulness can help us do that immediately in the moment and modulate our actions before taking them
- like an instant retro
Resilience Engineering Mythbusting
- [ed note: Will is my coworker at Fastly]
- you can’t build resilience into software and systems
- they can’t think and so by definition can’t be resilient
- computers and systems aren’t resilient, people are
- complexity is requisite in our systems, so how can we cope with it?
- chaos engineering isn’t about finding bugs
- it’s about developing intuition about the system
- “best practices” don’t guarantee that the practice is safe
- “best practice” tends to be a post-hoc evaluation that invites counterfactuals
- error budgets don’t control risk because “risk” is post-hoc evaluation
- actions that look safe often contribute to incidents
- actions that look unsafe are often commonly done to prevent incidents
Why Are Distributed Systems So Hard?
- Raft consensus algorithm — “raft” isn’t an acronym, it’s just a collection of logs
Comments are closed here.