Leveraging Observability for Enhanced IoT Development at Electrolux
At Electrolux, we’re shaping the future of how technology integrates into daily life by constantly innovating and developing intelligent appliances. Our products include ovens that recognize food and adjust cooking settings accordingly, and air purifiers that monitor indoor air quality for particles like dust and allergens, and remind users to replace filters. However, building and maintaining these complex IoT systems presents unique challenges, particularly in observability.
This article details our Platform Engineering team journey, exploring the challenges of managing a multi-cloud IoT environment and how we leveraged Datadog to empower developers and improve overall observability. You can also watch a video presentation on our approach at the Datadog Summit in London.
Challenges of an IoT Landscape
The complexity of IoT extends to the infrastructure. We run a multi-cloud environment across three regions, supporting over 60 countries and 20 languages. Previously, we had three main connectivity cloud platforms, with AWS being the dominant one, but also utilizing Azure, IBM Cloud, and Google Cloud. And it means that all of our developer team used different tools for monitoring. So when it comes to Incidents our team needed to look into all of the observability tools used and identify the root cause took a while. New developers joining our organization can be overwhelmed by the variety of tools and cloud vendors used historically due to budgetary reasons or personal preferences.
A strategic decision to migrate to AWS as the primary cloud provider presented an opportunity to consolidate everything into one observability platform. We chose Datadog for its ability to address the needs of mobile applications, backend systems, and hardware teams.
We onboarded teams gradually. One day, a team lead reached out to me for Platfrom team support during an incident. Our team was pleased they started monitoring production, but they still struggled to identify root causes or fully resolve issues.
We used a table to show teams’ understanding of Datadog features like tracing, metrics, and SLOs. Initially, many teams only used basic functions like checking logs.
We brainstormed ways to support these teams better. Despite having documentation and videos, we decided to run internal Chaos Engineering Game Days to promote our toolset and enhance developers’ monitoring knowledge.
Real-Life Incident Simulation: Chaos Game Day
Chaos engineering helps identify normal and abnormal system behavior during experiments, and observability is key for gathering this information.
Datadog connected us with other customers who had conducted Chaos Game Days. I was fortunate to have Long Zhang, a PhD in Chaos Engineering, on my team. Long, a Senior SRE with extensive experience in various chaos frameworks and tools, designed and executed the entire Chaos Game Day for our organization.
We informed developers we’d simulate real-life incidents impacting end users. We covered various troubleshooting topics, including metric analysis for performance and traffic, and trace analysis for latency. We utilized the Capture The Flag (CTF) approach, where you insert flags — such as strings or metric names — into your target system.
Through that day, we revoked developer access to the environment, and we forced them to only use Datadog. On Game Day, we restricted developer access to the environment, forcing them to use only Datadog.
We designed a Datadog dashboard for our players. Every failure injection trained the incident management process, teaching how to declare incidents in Datadog, open communication channels, and provide proper logging, metrics, and traces.
One developer was frustrated because her team hadn’t prioritized logging or metrics for their mobile app. This led them to start the Real User Monitoring integration in Datadog. All participants began setting up their monitors and alerts.
Integrating Datadog with Our Internal Developer Platform (IDP)
Also we took one step further. We connected Datadog with our Internal Developer Platform (IDP).
What is IDP?
An Internal Developer Platform (IDP) is a set of tools that empowers developers to build and deploy applications more efficiently by automating tasks and providing self-service options.
Previously, we faced numerous daily requests for user onboarding, infrastructure provisioning, CI/CD, and granting access to different tools and cloud recources.
We built IDP to automate these tasks.
For example, when backend teams need new EKS clusters, they can select an EKS template on the IDP, which then provisions resources, installs necessary tools like autoscalers and Datadog agents, and our cost exporters. Developers don’t need to know how to log or metric data into Datadog; the IDP handles it automatically, with versioning to update running infrastructure when improvements are made.
Later, we started working on our IDP code to open source it as a product named InfraKitchen, where the Platform Engineering team defines infrastructure templates that developers can use to provision their required cloud infrastructure independently.
Summary:
Through these initiatives, we’ve empowered our developers with deeper knowledge and better tools, reducing the need for constant SRE intervention. This has resulted in fewer incidents and a more resilient system overall.
Datadog’s tracing capabilities have been a major advantage for IoT. By tracking data across cloud services, mobile apps, and physical devices, Datadog can pinpoint the exact origin of issues, saving developers time and effort in troubleshooting. This ensures our IoT devices run smoothly.
During Chaos Game Day, developers not only learned how to use Datadog but also gained a deeper understanding of their applications’ infrastructure. As a result, our system’s observability has significantly improved, allowing developers to troubleshoot 90% of issues without Platform team support.
These combined efforts — including knowledge sharing, improved tooling, and practical exercises (Chaos Game Day) — have significantly reduced incidents, contributing to a more resilient IoT system.