Tech Conference Diaries: Day 1 of SRECon 2017

5 min readAug 31, 2017

I’ve been tweeting about my SREcon trip for a couple of weeks so I thought I’ll share my adventures here with all my internet friends :)

Day 1 was a great day of learning about key reliability engineering concepts and the importance of building reliable and scalable systems. As a software engineer who spends majority of the time writing code to implement new features, this conference has given me a more holistic view on my approach to system reliability. It’s just day one and I’ve learnt so many important lessons. Here are a few from today:

Lesson #1: Your approach to reliability engineering should be value based.

Narayan Desai shared how Google values have influenced the nature of their SRE teams.

Reliability is paramount(except when it isn’t): Engineers shouldn’t spend all their time firefighting(being reactive to incidents a.k.a toil). Majority of the time should be spent improving the service. Reliability is considered as important as adding new features to a service.
Precise promises: Service Level Objectives(SLOs) are taken very seriously and a lot of investment is made to make them as accurate as possible. The payoff is imense as the expectations become aligned between dev teams(both the service dev team and the team consuming the service).
Assuming best intentions: Google adopts a culture of having blameless post-mortems. Blame is assigned to processes not people. Post mortems should address the work to be done to implement system controls. Accountability is another conversation entirely.

Lesson #2: Changes to your system should be performed with a proper balance of urgency and diligence.

More incidents occur when developers decide to implement new features for a system. Jason Hiltz-Laforge from Shopify asked some key questions Reliability engineers should have the answers to:

Question 1: How fast can new features be implemented in a system?

Use a disatser matrix.See images below.

The more green you have on your matrix, the faster you can grow your system. In the case of the red, you should kick off projects to drive the red to yellow/green.

You can also use availability testing, error budgets and outage reports as other quantitative measures to find out how quickly new features can be implemented. Qualitative ways are simple too- talk to the service consuming teams about how they feel about the SLOs.

Answer: New features should only be implemented as fast as the cost of failure lets you.

Question 2: When systems do blow up, how do you deal with the humans?

Look into patterns/repetitive issues. Perform (blameless) root cause analysis. Fix unique failures and ensure they never occur again. Explore technical debt- the clue here is looking into how much of the system’s code is rewritten over time.

Answer: Give humans control and safety. Reliability engineers would prefer not to be called about incidents when they have never had influence over how a system is designed.

Lesson #3: Consider a radical approach to reduce alert fatigue a.k.a suspend reliability engineering support until alert volumes are reduced. P.S You will face resistance initially from dev teams.

Over monitoring and alert fatigue is an issue very prominent in the Tech industry. A parallel can be drawn to the medical field as well. ECG monitor alarms are often perceived as false/clinically insignificantly by doctors.

At Zynga, SRE was best effort. Kishore Jalleda shared some tips on what we should NOT do:

more alerts raised =more reliability engineers added to the team (waste of resources!)
filter noise by adding color coding for alerts (you will soon run out of colours!)
create more tools to filter noise based on algorithms (waste of investement that can be used to get to address root cause!)

The solution: saying “NO”. Denying dev teams reliability engineering support forces them to reduce alert volume and define alerts properly. The nature of reliability engineers allows for driving down low volumes.

I found this interesting. A hierarchy of needs(from a reliability engineering team’s point of view):

Lesson #4: Data will never be useful unless you use it wisely.

After a system incident, answers to questions regarding what caused it can be solved by collecting more data. What do we mean by data? Data is anything that provides you with confidence about your system reliability, in this case: logs,metrics and testing.

Ingride Epure shared a couple of thoughts on data paranoia. There is a danger of:

…Having Too much data and no structure. Imagine 18 million log lines being written to servers in 20 minutes. Solution:Look into cannonical logging tools.

…Acquiring data with no story. Engineers are unable to answer questions if data isn’t tagged/searchable. Solution:Use useful metrics that help you answer questions. Present them in the right medium:gauges/counters/timers/histograms.

…Unnecessary visualisations. How do you feel when you look at hundreds of dashboards? If graphs show zero sentiment, you have a problem. Solution:use load tests to derive your thresholds and SLOs.

My favourite slide from her presentation :)

Lesson #5: Make sure your users are happy by using practical quantitative and qualitative tools.

Perry Statham shared a different perspective on how engineers should approach system reliability- by keeping the users in mind. He explained the concept of Outside-in System reliability engineering, where users perspective is prioritised.

Outside in tools:

Personas: models of system users/stakeholders which includes their skills and goals
Scenarios: goal told as a user story/how user interacts with a system

Ways to measure user happiness:

Time: how quickly can user achieve their goal.E.g what is the UI response time for a user signing into our platform?
Scenario divergence: how much diversion has gone into an expected scenario/ has the user done extra interaction with the system and given up along the way?
Collecting feedback: Thumbs up/thumbs down buttons,0–5 star ratings,”how was your experience?” tooltips
Talking to people: Speak to the users, product mangers, sales people, social media. Note the disdvantages though, human’s are naturally biased so you need to filter out what is useful.

Also bare in mind, that unhappy users tend to provide more unsolicited comments. As they say “the squeaky wheel gets the grease”. So if negative feedback is all you read that will skew your perception of the reliability of your system.

Summary of day 1 done! Check back on my medium page for updates for Day 2 and Day 3.

To learn more about SREcon click here.

If you enjoyed reading this, please give me an applause by clicking the applause button below. This will allow other people that didn’t get a chance to attend SREcon to find this post on medium :)