To fully appreciate what a “War Room” is, it’s important to understand the challenges faced by a website used by billions of users. Dailymotion has been around now for more than 10 years, which in the Internet world is an eternity. While this testifies to the solidity of our backbone, it can also lead to a certain over-confidence in some parts of our eco-system.
For those who may not be familiar with the concept of a “War room”, it’s about creating a dedicated team 100% focused on solving a critical problem, sometimes working on a 24/7 basis. At dailymotion, we’ve used this concept when we’ve encountered major downtime or security issues.
To better understand these events, you need to remember that performance/scalability cannot be limited to just page speed, it’s a balancing act taking into account many factors. Sometimes, you have the opportunity to validate that things work, at scale, but only real-life provides the full-sized stress test. As the evolution of website is long, with many ups and downs, and because performance is not a linear fact, we can hit a wall at great speed (which has a tendency to hurt). The immediate, knee-jerk reaction is to freeze all feature development so as to allow a big part of the engineering organization to focus on improving the platform, monitoring, infrastructure and code as quickly as possible.
When it rains, it pours…
A couple of years ago, we were facing a major breakdown in a minor component of our system due to an unusual burst of traffic. This, coupled with our technical debt resulted in an unprecedented technical crisis; service outage, latency, errors and other collateral damage. There was only way to resolve this crisis: a combination of historical knowledge, mutual understanding, in depth technical expertise concentrated in a team of dedicated engineers sitting in the same room and all focused on tearing down and tackling the crisis together.
What resulted was a list of remedial actions, and a lot of lessons learned. Primarily, the biggest revelation — team work. This may seem obvious, but that week will shine in our memories and was key to many of us; we needed to change. One room full of motivated people focusing on one single purpose to fix a mission critical crash — obviously, you can’t work every day in a Hackathon/Survival mode (wouldn’t that be fun?), but the positive aspects of that pressure can help maintain focus and enable the good sense in action. It has improved our Agile process making us better at roadmap prioritization taking into account potential scalability bottlenecks and using dedicated cross-team sprints.
A new era
Of course, we have faced other “war rooms” since, and our mindset has changed to embrace these situations as part of a new journey. These war room situations also initiated a full product and engineering overhaul that started in late 2016 and included migrating our main API, formerly a REST API, to the well-known API Centric mindset, built on top of GraphQL. We also started a huge migration from infrastructure to back-end and front-end, we’ve changed almost everything over the past 18 months, and we did all this whilst encountering other situations, preparing our future challenges, changing our engineering organization and hiring more than one hundred people. We scaled.
“War rooms” have helped our teams to work better together. When everyone is sitting in the same space, you don’t have to wonder whether everyone is on the same page, because those who are in the room with you are there to work on only one thing. And as a bonus, you spend less time revisiting already-discussed issues. The difficulties we faced during this period have resulted in a positive transformation for the company; we’re more efficient, we have new and better processes as well as improved team rituals. Easy to setup, “War rooms” have improved our outage process in terms of communication, procedure, root cause analysis, reporting… Having in the same room, day or night, people from various teams intent on solving a particular critical issue, different points of view, proposals for getting to the required improvement / solution is now a no brainer. We kicked off a lot of our roadmap projects during these events, transforming a bad situation into a win for dailymotion.
Co-operation as a culture
As painful as it was, that significant event has been one of the biggest breakthroughs I have experienced as an engineering manager in the 10 years that I’ve been working at dailymotion. It kick-started the move from our legacy framework to focusing our efforts on building a brand-new architecture. Of course, tech changes and decisions need to be supported by the company as a whole and often require an internal re-organization. Our job as teammates is to work together, in particular when we fail, and remember that collaboration built on trust creates a stronger culture (and way more successes to celebrate) than an arrogant competitive and individualistic environment. At the end of the day, we maintain the belief that constant effort to build a healthy culture based on communication and transparency will go a long way to healing a lot of the verticality effect felt by many teams.