Evolution of IT Operations at ADEO: Automation, Challenges, and Progress

Jean-François Marquis
ADEO Tech Blog
5 min readSep 14, 2023

--

In the field of operations, we are constantly confronted with ever-evolving technologies that drive the need for faster deployment, increased complexity, technological diversity, while also maintaining the operational readiness (with a high level of service) of legacy technologies and applications that have enabled our companies to thrive and often support a significant portion of our business activities.

Efficiency, effectiveness, and speed are key elements to remain competitive and provide the business with the operational excellence it deserves.

What solutions are available to activate these key elements in the field of IT operations?

The 2010s provided some answers by resorting to outsourcing some or all of the operations, commonly referred to as TME (Third-Party Operations Maintenance). This competent, flexible, and cost-effective workforce, often located nearshore, allowed us to continue linear growth (more servers = more workforce) while maintaining the same budgets. However, a decade later, we observe that this paradigm is losing its momentum, even when working with market leaders:

  • Human errors occur, interpretations of instructions in documents are not always accurate, and occasional misguided initiatives are taken, especially during waves of incidents.
  • Operational documentation (DEX) is not regularly updated by product teams.
  • Incident analysis and continuous improvement, which are optimization levers of operations (as stipulated in contracts), are becoming less effective, leading to an increase in technical debt.
  • Response times can hardly drop below 10 minutes.
  • Staff turnover results in retraining periods and decreases in quality.
  • Lastly, the volume of incidents makes it increasingly difficult for operators to identify the truly critical incidents among the overwhelming workload and within very tight timeframes.

In summary, the operational model of the 2010s is no longer suited to our needs.

In response to these challenges, we made the decision a year ago to address these issues and modernize our approach, guided by Stanley Kubrick’s wisdom:

Don’t let a man do what a machine can do.

The next step was to find a solution that would allow us to make progress in a ‘Test & Learn’ mode.

The topic of automation had been the subject of numerous technology watch efforts and vendor demonstrations, but due to the complexity of implementation and associated costs, no project had materialized. Once we started discussing environments with over 20,000 servers, the financial amounts quickly reached several millions euros, without factoring in negotiation, contracting, multi-year commitments for reasonable pricing, and integration efforts. In other words, this was incompatible with our ‘Test & Learn’ approach.

The only solutions that seemed acceptable to us, respecting our principles and adapting to our ecosystem, were open-source solutions: AWX, Chef, Puppet, SaltStack, and others. We won’t claim that open source equals free, but this model offers numerous advantages in our context. In our case, implementing the AWX platform required the creation of a dedicated team responsible for its setup, maintenance, contribution to the open-source community, and 24/7 management. When evaluating products, we paid special attention to their potential for evolving into a market offering when making our choices. In our example, we chose to start with AWX, with the possibility, if the need arises, to transition to a vendor solution (offering additional features not present in the open-source offering, requiring support, etc.).

Throughout this project, we gained valuable insights and adjusted its initial scope. We chose AWX for deployment management and incident remediation, while other products like Puppet found their place in our ecosystem to ensure compliance of assets. These two solutions complement each other seamlessly.

Project Scope:

  • 175,000 incidents per year
  • 23,180 database instances
  • 15,864 servers
  • 45 different operating systems
  • 16 different database engines
  • 8 different application servers

Progress Made: Over the past eight months, our team has worked tirelessly to implement robust automation solutions. The results are impressive:

  • Reduction in manual actions: We have reduced the number of incidents handled by TME by a factor of five, with the goal of reducing it by a factor of seven by the end of the year.
  • Error reduction: Automation has minimized human errors, improving the reliability and quality of our operations. Non-compliance with DEX has become almost non-existent, and when it occurs, it’s usually due to the DEX not being updated by its owner.
  • Time savings: Automated processes have significantly reduced execution times compared to previous manual methods. For example, a server restart that previously took an average of 20 hours is now accomplished in an average of 83 seconds.
  • Increased responsiveness: Diagnostics and operations that were previously performed by humans at Level 1 are now almost entirely automated, either correcting the issue or assigning it to the expert who can resolve it as quickly as possible. For example, a ticket that needs to be escalated to a higher level (Level 3 or Level 4) for analysis and resolution is escalated in less than a minute in 99.99% of cases.

Challenges Encountered:

  • Challenges in identifying development priorities: We faced difficulties in determining the best approach to start development and demonstrate expected results. Should we begin with the most time-consuming tasks, the most frequent ones, or the easiest ones to build team confidence?
  • Obsolescence of operational documentation: We found that operational documentation was not regularly updated, hindering our ability to have the most up-to-date information on our systems and causing us to lose significant time in identifying those with the necessary knowledge.
  • Loss of knowledge/skills regarding legacy applications: Over time, we unfortunately lost valuable knowledge and skills related to our legacy applications. Relearning these applications to enable automation significantly slowed us down.
  • Resistance to change and fear of automation: Despite all the education efforts around the project, we encountered discomfort with the unknown and uncertainty about the consequences of automation.
  • Inability to automate obsolete systems where AWX agents do not function.

We view these accomplishments as a crucial step toward fully optimized operations. However, we will not stop here. We believe that artificial intelligence can open new avenues for greater proactivity and anticipation.

Conclusion: The automation of manual operations has already brought significant benefits in terms of efficiency, reliability, and responsiveness. These successes reinforce our commitment to investing in innovative technological solutions to address current and future challenges. Our team will continue to explore new automation opportunities and continuous improvement, with the ultimate goal of delivering increasingly high-performance services to our internal clients.

Stay tuned for more exciting updates on our journey towards automated, optimized operations at ADEO, and ever-increasing operational excellence.

--

--