The Boozt Way of SRE

Gastón Rial Saibene
Boozt Tech
Published in
8 min readJul 11, 2024

In the world of Site Reliability Engineering (SRE), organizations constantly strive to balance innovation with stability. Our journey as SREs in Boozt, the largest e-commerce company in the Nordic region, offers a unique perspective on this balancing act. Boozt.com has developed a distinctive approach to SRE that is both pragmatic and effective, allowing it to meet business goals without compromising customer experience.

Introduction: The Context

Initially, we operated in a centralized fashion, managing all tasks related to infrastructure management, software operations, and incident response. This method, while effective at first, proved unsustainable as the company grew. Consequently, our team evolved its processes, adopting tools and practices focused on reliability (such as production readiness reviews and post mortems) while maintaining business continuity and supporting other teams. This article aims to share the lessons learned from building and evolving Boozt’s SRE team to be both effective and successful.

The evolution of the SRE processes at Boozt

The Foundation: Minimizing Toil

Boozt’s SRE team prioritizes minimizing repetitive, manual tasks to enhance efficiency and reduce human error. The ability to handle toil efficiently and to focus on highly impactful and meaningful reliability work is extremely important to keep us motivated and engaged with the company’s vision, thus preventing burn out and other hazards to our well-being. Having such a diverse team, with people that have a background ranging from systems administrators to networking engineers, proved the potential to create automation in order to achieve these goals. Nevertheless, as the company grows and the footprint of our work increases, it’s easy to feel overwhelmed and lose sight of how to scale our team’s efforts while keeping or increasing the SRE’s headcount sublinearly. Therefore, collaboration with other teams is fundamental to set the stage for our scope not to drift to something else other than ensuring the continuity of business critical services.

To allow us to concentrate on our core responsibilities while supporting operational tasks, Boozt implemented a new process called “Resource Management Requests”. This system enqueues infrastructure operations and cloud resource allocations, preventing these tasks from overwhelming our schedule. By streamlining these processes, the SRE team can better manage their workload and focus on enhancing system reliability. Currently, the responsibility of working on RMRs goes beyond our team, since more people within the engineering organization have agreed to participate on a weekly basis assisting to fulfill developers’ needs in terms of infrastructure provisioning and management.

The SRE Bible: Google’s Influence

Google’s SRE practices significantly influenced Boozt’s approach, providing a robust framework for designing reliable systems. A key principle from Google’s SRE book is the use of Service Level Objectives (SLOs) as benchmarks for system performance and reliability. However, we emphasize that while SLOs are essential, they must be balanced with the need for agility and innovation. As a consequence, it is important to highlight that although our values were inspired by the guidelines developed by Google, we recognize that smaller companies cannot adhere to SLOs as rigidly as tech giants do.

For them, some key products that have established market dominance must remain as reliable as possible, thus consumers will continue to trust and believe that the service they provide is almost 100% available. Nevertheless, for Boozt, and other small to medium size companies, the priority is to roll out new features quickly to stay competitive, even if it means occasionally relaxing strict SLO benchmarks. This flexible approach enables Boozt to innovate rapidly and enhance our webshops customer’s experience, while maintaining a reasonable level of reliability.

The path to success: Knowledge sharing

To share our reliability engineering capabilities, we engage with the broader platform organization through a structured model. As shown in the picture, we request developers to engage early with us when starting a greenfield project to enable them to build reliable systems from the ground up and reduce any future technical debt. Resource Management Requests (or RMR) should be submitted when unplanned assistance is needed to manage infrastructure. If an incident is on-going the engineer on-duty must be informed immediately and, for this purpose, an automatic message in slack or a page will notify them. Last but not least, if people have a question for SRE, they can reach us in our team’s slack channel.

Knowledge sharing is also facilitated through the “Reliability Tech Chapter”, a bi-weekly meeting where developers interested in improving the company’s reliability posture can learn and discuss tools and practices. This collaborative approach ensures that the principles of reliability engineering permeate the entire organization. Events such as this one encourages communications between different teams and our team, which is of utmost importance to ensure that the reliability goals aren’t dismissed by work that might be considered more relevant to those teams. It helps in striking a balance, and strengthening the commitment of the Boozt developer organization to guarantee the best possible customer experience, without compromising the ability to be agile and deliver on new features that the business deem necessary.

During the tech chapter meetings, the members with the longest tenure in the SRE team, such as staff engineers, communicate the potential improvements in the critical user journeys (CUJs) of the systems that provide services like the webshop, we discuss recent outages and do a walk-through of the postmortems with the aim of learning about how things might have been handled better and come up with a resolution plan that involves every relevant stakeholder. These meetings are where we seek alignment with the leads of the different development teams.

Incident Response: Tailoring Practices to Fit

Our team’s experience highlights the importance of adapting SRE practices to the specific needs and constraints of the organization. While best practices provide a valuable starting point, they must be tailored to fit the unique context of each company. For Boozt, this means focusing on automation, proactive incident detection, and rapid response, without becoming bogged down by the pursuit of perfect SLOs. If the opposite were to be done, the company might halt innovation for the sake of attaining reliability goals that would bring a marginal value of returns when compared to what new features could enhance for both performance and end customer experience. Therefore, a balanced approach to reliability must be found by enabling conversations between our team and developers.

The Boozt Platform’s incident response process is designed to manage the complexities of both internal and external facing systems by providing a clear and evolving set of guidelines for handling reliability and security incidents. Reliability incidents pertain to disruptions in system availability or performance due to technical issues, while security incidents involve unauthorized access or breaches affecting information integrity, confidentiality, or availability. Effective communication is vital, with incident discussions centralized in designated Slack channels and specific protocols for escalating unresponsive situations.

When an incident occurs, the initial response involves checking the status on the outage channel, acknowledging any alerts, and declaring the incident in the reporting platform used, if not already done. The triage process includes determining the impact, assessing risks of corrective actions, and isolating the cause using telemetry tools. Communication should be active and blameless, with all steps documented in a comprehensive event log, especially for security incidents which may involve regulatory communication.

Corrective actions are executed based on team collaboration, with updates posted in Slack and escalation steps outlined if needed. These updates are visible to everyone in the organization, which makes the whole process transparent to every single business unit.

After stabilization, a postmortem is generated to analyze the incident thoroughly, ensuring timelines and root causes are clearly documented. High-priority action items are addressed within 24 hours to prevent recurrence, maintaining the platform’s reliability and security.

Looking Ahead: Continuous Improvement

Despite the heavy reliance on automation, we acknowledge the importance of the human element in SRE. It is of utmost importance the need for a diverse skill set within our team, blending technical prowess with practical problem-solving mindset that only the right set of people has.

In an environment where being on-duty and minimizing downtime is critical for business continuity, we cannot overlook enough people being the cornerstone of our SRE efforts. Each minute our website is down can significantly impact our business, so having a dedicated, skilled team is paramount.

At Boozt, we place immense value on the dedication and mindset of our team members. Our team is not just a group of technical experts; they are problem-solvers, innovators, and collaborators who understand the unique pressures and demands of our business. The right mindset is crucial — we need individuals who can remain calm under pressure, think on their feet, and make swift, effective decisions during high-stakes situations.

Last but not least, collaboration is at the heart of our approach, thus we ensure that everyone is aligned and working towards the common goal of maintaining by fostering a culture of open communication and teamwork. This is the path we have taken and the one that we will continue to improve on.

Conclusion: The Boozt Way

The Boozt way of SRE is a testament to the ongoing evolution of the field. By continuously refining our practices and embracing flexibility, Boozt demonstrates that companies of all sizes in any business domain can achieve a high level of reliability without sacrificing innovation. Our team’s leadership and pragmatic approach ensure that Boozt.com remains resilient in the face of challenges, poised to adapt and thrive in a competitive market.

It is our belief that the experience we had in building our team, setting up standard practices and processes provide some valuable lessons for other organizations navigating the complex landscape of reliability engineering. By prioritizing automation, breaking silos, maintaining flexibility, and tailoring practices to their specific context, Boozt.com successfully balances stability with rapid innovation. Our journey from systems administrators to SRE experts encapsulates the dynamic and evolving nature of this critical field, providing a roadmap for others seeking to enhance their own SRE practices.

About the author:

Gaston Rial Saibene, SRE Team Lead at Boozt

Boozt: Website, Career page, LinkedIn

--

--