Highlights from DevOpsDays Detroit 2023

Criteo R&D
Criteo Tech Blog
Published in
7 min readNov 24, 2023

Article by Clark Peters and Dan Vukelich

DevOpsDays Detroit took place in the heart of Detroit, a city where engineering plays a critical role in the community and its surrounding areas. This conference went beyond the mere discussion of tools and explored the profound cultural impact that DevOps brings to organizations. Our team had the opportunity to meet experts on diverse topics, share knowledge, and learn from other developers.

Takeaways from Criteos

Two Site Reliability Engineers took valuable notes and information on the latest trends and best practices in software development. We’re excited to share their takeaways from DevOpsDays Detroit 2023!

Photo by Aaron Burden on Unsplash

Dan’s notes

DevOpsDays Detroit was an excellent opportunity to connect with Southeast Michigan’s DevOps and SRE communities. While business goals and product requirements differ between companies, ops and infra focused teams, all need to solve the same sorts of problems to empower their dev teams to produce reliable software on time. While many speakers presented on systems, tooling, and policies that improve outcomes for deliverables, Redhat engineer Michael Shen presented a case study in reducing a different sort of outcome: SRE team burnout.

Michael’s talk description

When Redhat began offering Kubernetes clusters as a managed service, they set very rigid SLOs. From a product standpoint, this was to ensure that customers would be confident in Redhat’s management of their cluster. From an SRE standpoint, tight SLOs with strict alerting would put their teams in a strong position to triage and resolve any incidents that came up. Indeed, as the product grew, these guarantees were well-received by customers and managed well-enough by teams. This wouldn’t last, and Michael’s team eventually realized that as their customer base had grown, so too had the number of interrupting SLO alerts. Team members had to deal with an average of 7 alerts per hour. This means that every ~8 minutes, a team member had to drop what they were doing to investigate a potential SLO breach, contributing to burnout and lack of productivity.

Michael characterized this as a “boiling frog” situation: customer growth over several years had slowly “turned up the heat”. He laid out the contributing factors:

  • Many alerts weren’t actionable. While the circumstances around the alerts could be indicative of service degradation, the majority of alerts could be closed after looking at the logs and seeing that they were routine operations. As a result, the team was physically able to manage 7 alerts per hour, but the mental overhead built up and led to team burnout.
  • Attempts to modify SLOs to be more manageable were met with resistance. Business teams worried that customers would balk if SLOs were loosened, perceiving more permissive SLOs as synonymous with degradation of the product itself. Because the excessive alerting only affected the SRE team and nobody else, perception in the wider company was that the managed cluster service was doing just fine.

After recognizing the problem and resolving to improve their situation, the SRE team took a 3-pronged approach to improve things:

  1. Recognizing and removing impossible SLOs: This change was entirely within the team’s power to fix. They evaluated their alerts and realized that in many cases, the metrics were incapable of discriminating between actual issues and simple misconfiguration.
  2. Aligning SLOs between business and SRE teams: This ties back to the “hard sell” of loosening SLOs. Michael admitted that it’s very much a work in progress, but his team is working with business teams to have everyone working off the same SLOs. The goal is to increase visibility of the very tight SLOs his team has to manage, thus dismantling the perception that everything is fine when it’s certainly not.
  3. Recognition that customer experience is tied to SRE team health: Michael made the point that a customer’s experience depends directly on the ability of an SRE team to respond to incidents. When a team is overloaded with work, it doesn’t matter how iron-clad an SLO is, they won’t be able to deal with them. To this end, Michael’s team developed a “mental health” SLO for the team, aiming to bring the alert rate down to a more reasonable 3 per hour.

For me, the most interesting takeaway was Michael’s comparison between team health and customer experience. In retrospect, the relationship is clear, but as an SRE it’s not always easy to recognize when to ease up. As Michael pointed out, getting buy-in from business teams is important to managing changes in SLO. I’ll be interested to see if he presents next year with a talk on how his team’s SLO burnout has improved.

Clark’s highlights

Upon arrival, it was apparent there was an energy among the staff who greeted us and an uplifting feeling being around the local devops community. An excitement that rejuvenates the engineering spirit by connecting with others who may work in a similar field but bring a fresh perspective into how to solve problems. Hearing the ideas and thoughts of others helps to encourage thinking outside the box and expand the possibilities of solutions in one’s normal area of expertise. Engaging in the community is a great way to learn and grow in different ways.

The first talk of the day wasn’t about the newest tech stack or the best way to implement software development processes. It was more of a topic about how to coexist as humans and the culture we create within our workspace. “Celebrating Diversity and The Path Ahead” by Ell Marquez was a great way to begin a conference because it focused on the topic of inclusion and how to make people feel safe and comfortable through our actions in the workplace.

Ell’s talk description
Ell on stage — https://twitter.com/devopsdaysdet/status/1714644282659733939

A few ways discussed in the talk on how to build a culture of inclusion that fosters progress:

  • Treat people as a person.
  • Get to know people. See them for who they are.
  • Take time to teach someone instead of giving them the solution.
  • Become a mentor, give back, and take action.
  • Find ways to empower people to help themselves.

After hearing this speaker share some of her story and hearing her talk about how we can all take ownership over improving the culture around us and ways we can better support diversity, I felt more comfortable and safe interacting with others at the conference. Later on in the conference, we split up into breakout rooms to discuss different topics, and the conversations went really well. I felt like discussing human issues was a great way to kick off a tech conference, and it helped to foster open communication where people shared and learned from each other.

As for tech related talks, “Introducing DORA Core: Durable research insights for technical practitioners” by Dave Stanke was really interesting. The talk discussed how metrics can be used to predict the success of our software.

Dave’s talk description

The DORA researchers have spent years collecting data from thousands of companies and boiled it down into an equation of influence of sorts. Essentially the metrics that measure our capabilities can predict our performance and the metrics of our performance can predict the outcomes of our goals. The interactive diagram on their website — https://dora.dev/research/ — is really interesting to me because it diagrams how we can monitor parts of our systems, expose bottlenecks, and find out the impact this has on our performance and goals.

Every year, the DORA researchers put out a research report with all their updated findings and learnings from the year. The equations and models they come up with are not only about technical capabilities. They also include culture as a key to success, as seen in — https://dora.dev/research/2023/structural-equation-models/ -. Turns out that happy engineers develop better software! I look forward to diving deeper into the DORA report and learning ways to be more efficient, productive, and happier.

In conclusion, DevOpsDays Detroit 2023 was an incredible experience for our team, and we’re excited to share the knowledge and insights we gained from the conferences and networking opportunities with our colleagues and clients.

We appreciate the effort that went into organizing such a large-scale event. Kudos to you, DevOpsDays Detroit organizers! 👏 ❤️ We look forward to attending future editions and hope to continue supporting and contributing to the technology industry’s growth and success.

--

--

Criteo R&D
Criteo Tech Blog

The R&D team building the Commerce Media Platform for the Open Internet.