SRECon19 Americas interesting tidbits

March 30, 2019 | By lex | Filed in: SRE.

I’m on my way home from SRECon19 Americas, and it was incredible. Lots of really awesome folks had things to say that made me think. Thanks to all of you for speaking!

Plenty of people are going to do recaps and have already done live tweetstreams. Instead of taking notes on everything, this year I decided to try something different.

Below is just the parts of each talk I went to that blew my mind or made me think. I’m hoping that just reading each of these will be enough to get you thinking and open your mind.

I’m not trying to summarize each talk, and I definitely don’t want to give the impression that these were the only good parts of each talk! It all depended on what was new to me and my mood and my caffeine level and any number of other things.

In most cases, I’ve hurriedly written down something the speaker said. I’m sure I got the words wrong, so please consider the below to be a paraphrase rather than a direct quote.

Welcome and Opening Remarks

Liz Fong-Jones

code of conduct given top billing!
live captioning by real humans!

What Breaks Our Systems: A Taxonomy of Black Swans

Laura Nolan

explicitly plan for thundering herds
- know what you’ll do and test in advance

Complexity: The Crucial Ingredient in Your Kitchen

Casey Rosenthal

definition of a complex system: no one human can understand how it all works
all components can be 100% correct and a system can still be unreliable
challenger accident: the fact that redundancy was there contributed to the failure because of a false sense of complacency
- redundancy causes some incidents
regular exposure to risk helps us do a better job during incidents because it allows us to improve our adaptive capacity

Fixing On-Call When Nobody Thinks It’s (Too) Broken

Tony Lykke

when your on-call is this bad [thousands of pages a month] the problem is usually cultural, not technical
reducing pages made people nervous that something was wrong
- “silence anxiety”

How Did Things Go Right? Learning More from Incidents

Ryan Kitchens

overly focusing on preventing incidents will prevent fewer of them
availability numbers are made up
- the error bar is greater than the next nine you’re after!
incident = a combination of normal things that go wrong all the time, all happening at once
trying to put in guardrails blocks people from doin the everyday work they do to keep things running
- reduces adaptive capacity
let’s rename incidents to “surprises”

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager

Jen Wohlner

[ed note: Jen is my coworker at Fastly]
SREs need to do product management — and already do

What I Wish I Knew before Going On-call

Chie Shu, Dorothy Jung, and Wenting Wang

on-call choose-your-own-adventure game!
a runbook should be a guide on how to jump-start your car — not how to build it

Zero to SRE

Kim Schlesinger

when hiring junior SREs to train them up, most of the work for upper management comes before candidates ever enter the pipeline
dedicate 20% of a junior SRE’s time to learning (e.g. read a chapter of a book, deploy a kube cluster)
the team structure changes Kim suggests are actually good for SRE teams of all skill levels, not just those containing junior engineers

One on One SRE

Amy Tobey

incident responders may share more when debriefed 1:1 vs. a group retrospective
best 1:1 retrospective question: “what surprised you in this incident?”
individuals can impact the reliability of the whole organization by building a network of 1:1 relationships

Pragmatic Automation

Max Luebbe

automation must respect the roles of humans and their adaptive capacity

You Don’t Have to Love Your Job

Leslie Carr

“love is patient, love is kind” (wedding vows)
- this does not sound like something that should apply to your job!
Elon Musk has millions of shares in his company. He has a large incentive to trick workers into working 80 hour weeks.
workers with a 60-hour work week are 25% less productive than those working a 40-hour week
- that’s 25% less in absolute terms, not per hour
love inspires heroics, and heroics are an SRE nightmare
crowing about how one should love their job is an expression of privilege

Mindfulness in SRE: Monitoring and Alerting for One’s Self

Tommy Lutz

physical and “ego” (conceptual) threats result in the same body stress response
a retrospective is about understanding the full context behind the actions someone took during an incident
- mindfulness can help us do that immediately in the moment and modulate our actions before taking them
- like an instant retro

Resilience Engineering Mythbusting

Will Gallego

[ed note: Will is my coworker at Fastly]
you can’t build resilience into software and systems
- they can’t think and so by definition can’t be resilient
- computers and systems aren’t resilient, people are
complexity is requisite in our systems, so how can we cope with it?
chaos engineering isn’t about finding bugs
- it’s about developing intuition about the system
“best practices” don’t guarantee that the practice is safe
- “best practice” tends to be a post-hoc evaluation that invites counterfactuals
error budgets don’t control risk because “risk” is post-hoc evaluation
- actions that look safe often contribute to incidents
- actions that look unsafe are often commonly done to prevent incidents

Why Are Distributed Systems So Hard?

Denise Yu

Raft consensus algorithm — “raft” isn’t an acronym, it’s just a collection of logs

Comments are closed here.