Bootstrapping SRE, a Product Engineer’s Perspective

Brett Nelson
Tech @ Earnin
Published in
4 min readNov 16, 2020

As your company has grown and processes matured, you’ve likely started thinking of building a site reliability engineering team. A great resource to start learning about SRE is Google’s book here. Earnin followed this process in our pursuit of continuous improvement when we created our SRE team in 2019. We took a first principles approach looking at the biggest problems to our reliability and stability. As the lead of the product engineering team bootstrapping this initiative, I’ll offer my perspective on what we did to make this initiative successful.

Earnin’s journey to SRE started with a team we called product operations — prod-ops (the naming doesn’t really matter, it’s the concept and charter that’s important).

Earnin had hired a director of SRE, but not a team. To give the SRE team the space to implement their charter, provide engineering muscle, and provide access to our tribal knowledge, my team was tasked with filling the gap.

The prod-ops team’s charter was to bootstrap the SRE function. We went into the trenches to find and rank issues that were impacting Earnin’s service stability and reliability. We prioritized things that were impacting our Community Members and draining precious Engineering time with issues stemming from a change in production. The immediate issues we identified for stability and reliability were:

  • Reliable deploys
  • Post mortems
  • Interviewing SRE’s

Reliable deploys

Challenge

We had started down the path towards microservices. Our new microservice architecture had deployments that provided health checks and rollbacks. Unfortunately our legacy code (or as a friend of mine at another company calls it Revenue Code) hadn’t kept pace. Our deploys for our monolith application had been left to their own devices. Deployment was far from a push button task and only a few, mostly unwilling souls had access to run this task.

Goal

The team’s immediate task was to modernize the monolith client facing app deployment; ensure that it is deployed at least once a day and in an observable manner.

Result

The team delivered a reliable deployment solution. The deployment process included canary deployments; 0 downtime deploys; easy rollbacks. The deployment was also instrumented to ensure whoever deployed could gauge success in the deployment.

Post Mortems

Challenge

Teams had started learning from past incidents but this knowledge was siloed. We knew that Post Mortems provided a way to root cause issues and an opportunity to learn how to prevent repeating the same mistake. Reporting a problem and raising an incident needs to be something that is celebrated.

Goal

This brings up the second part of post mortems, a move to a blameless culture. To have a highly effective environment, this needs to be embraced. Each incident is a learning opportunity. The learnings need to be captured, publicized, and used to learn and bring all teams forward. Management and all teams must embrace this shift. The dividends your company will see are tremendous. As an organization we have seen an increase in lower severity incidents, most of which are detected by the team responsible now. These lower severity incidents will prevent the high severity issues.

Result

As an organization we implemented a blameless post mortem culture. Here’s a good overview of how it works from the SRE handbook. All high severity incidents had post mortems. Teams weren’t pointing blame, but using it as a learning opportunity with action items to improve the stability of the system. As we continued down this path, the number of high severity incidents decreased. Monitoring increased and ultimately ownership of the systems and monitoring also increased. The number of incidents identified by the responsible team increased along with impact and duration.

Interviewing SRE’s

Challenge

Hire SRE’s and fast!

Goal

We needed to stand up a new team of SRE’s, we had a director of SRE but no one to act on the team’s charter. We needed to hire and transition the tribal knowledge of our systems to the new team we were hiring.

Result

Leveraging the collective networks of our organization, we landed some outstanding SRE’s who were experienced in the field. Leverage your network!

Overall Results

As we shifted towards having our SRE team, we also embraced some cultural changes within the organization, for example, the blameless post mortem culture. This resulted in better quality code as well as higher levels of ownership over the systems teams owned. Reduced incidents freed engineering time to focus on business objectives. The integration of an SRE team in our organization has enhanced our reliability of our systems. When we do have an incident (yes you’ll still have incidents even with SRE), the SRE team helps to ensure they run smoothly and coordinate resources needed to mitigate the issue and the plan to ensure the fix is performed.

Our journey and the three areas above aren’t comprehensive. As product engineering, we tackled immediate problems giving the team cover to learn, grow and implement. When your company begins implementing an SRE culture, take a first principles approach. Your organization will need to determine its top priorities to build an effective SRE team. Tackle the biggest problems for your systems reliability/stability.

Interested in being a part of a collaborative engineering culture? Come join us at Earnin.

--

--