Why we skipped SRE and switched to Platform Engineering

Kristina Kondrashevich
5 min readAug 13, 2024

--

Imagine this: You’re an SRE, drowning in a sea of direct messages. New databases, fresh regions, even the occasional “How to access this tool?”, “Why this build failed?”, “Can I have a prod access?” (Spoiler alert: NO! but read till the end). Any production issues? SREs swoop in, fix it, then vanish. Developers? Clueless.

It worked, for a small team.

Here’s the deal: developers wrote applications, SREs ran the infrastructure. We monitored dashboards, created cloud resources, and pipelines, and improved infrastructure resilience…a cryptic language only SREs spoke. It functioned… ish. But we grew. Developers exploded (20x!), SREs? A measly 3x. We saw the looming storm: overwhelmed SREs, and bottlenecked development.

Each request involved requirement gathering, resource creation, testing, and bug fixing — a time-consuming process. Limited team capacity meant juggling tasks wasn’t an option.

As many people who work in the field of infrastructure, we asked ourselves if this is our future.
But how we would love our future to look like? SRE team could take a permanent vacation, and the development teams would keep the production house humming. Developers would be empowered to not only maintain, but also launch new features, all without relying on overburdened SREs.

This story details our journey to achieving that future but with a twist: the SRE team wouldn’t become obsolete. Instead, we’d focus on building open-source tools to empower developers further.

But first, let’s go back to where it all started.

What is SRE in our org?

We all know Google’s excellent book on SRE practices. The question was: how could we translate those practices into reality for our company? We needed to figure out what problems we were aiming to solve first.

Our journey can be described in three steps:

Step 1: Moving to SaaS

As a small team, we made a strategic decision: move to SaaS wherever we could.

You can read this article on how we started to responsibility for monitoring and troubleshooting from the Platform Team with dev teams with a new observability platform support.

We were using Jenkins configured by a consultancy company with one pipeline for over 40 services. Every time, the Platform team needed to change pipeline code based on requests. We chose a CI/CD platform and provided our dev teams with:

  • Python, Java, Lambda serverless shared packages for building their applications
  • Delivery images into ECR since we use AWS
  • Package for deploying all services in K8s
  • Self-hosted runners to be more secure and save costs by running most builds in our own infrastructure

After providing these, developers learned how CI/CD works and started to customize and maintain their pipelines.

This shift to SaaS solutions empowered dev teams to take ownership of their pipelines and troubleshoot their infrastructure, freeing up the Platform team and accelerating development.

Step 2: Automating the Most Common Requests

With the free time we gained, we started automating things developers asked for the most:

  1. Infrastructure provisioning
  2. Service onboarding
  3. User onboarding/offboarding (granting/revoking access)

We used Backstage as a single platform for our automation and implemented these as plugins. However, we realized it wasn’t easy for developers to navigate among all tools and plus adapt to new plugins.

We decided to create an Internal Developer Platform (IDP) as a single entry point for all the things dev teams need. Leveraging the Scaffolder Backstage plugin, we wrote our own backend services and some UI, covering major SDLC steps.

We created templates for developers to create cloud resources, starting with AWS like EKS, Kafka, Redis, and later adding MongoDB, Vault, and more, eventually supporting more then 60 resources.

For service onboarding, we added repo creation, secret management, container registry, and CI/CD pipeline, etc.

By building an Internal Developer Platform, we created a self-service environment where developers can easily manage their infrastructure and services, significantly boosting efficiency and autonomy.

Step 3: Rebuilding IDP as an Open-Source Product

We realized that dev teams sometimes needed resources that our IDP wasn’t designed to handle.
However, our IDP wasn’t designed this way. So we started to rebuild our IDP as an open-source product named InfraKitchen. Read more via the link.
First small part of IDP has been recently released as separate product InfraWallet. https://github.com/electrolux-oss/infrawallet

As our organization grew, we faced challenges in educating and training developers due to our small team size. Instead of scaling up training, we provided dev teams with platforms that helped them adapt SRE practices independently.

This image outlines the core practices of Site Reliability Engineering (SRE) and their current status within an organization. The color coding in the image indicates the current status of each practice:

  • Green: Implemented and actively used
  • Blue: Ongoing work and development
  • Yellow: Planned for implementation in the next quarters

We believe that implementing SRE principles through a Platform Engineering approach is significantly more efficient, as it empowers developers to self-serve infrastructure needs, freeing up SRE teams to focus on strategic initiatives and tool development.

From SRE to Platform Engineering

This evolution marked our transition from traditional SRE to Platform Engineering. Electrolux is adopting this approach, handling exceptions where developers need to go beyond the platform. Platform Engineering has brought us benefits like increased productivity, better visibility into infrastructure usage, and faster time to market.

Our journey from overwhelmed SREs to a future where developers and SREs thrive together — powered by automation and collaboration — highlights the power of self-service models and the importance of empowering developers.

This is our story. How will you write yours?

P.S. We found the way to give developers full production access to their infrastructure via our IDP. Want to learn how we achieved it without compromising security or stability? Drop a comment below!

--

--