Experience Review: Automation Project for Level 1 and Level 2 Operations

Jean-François Marquis
ADEO Tech Blog
4 min readDec 22, 2023

--

In the wake of our journey through an automation endeavor targeting Level 1 and Level 2 operations, it is time to reflect on the challenges, successes, and insights gained throughout this transformative initiative.

Context Recap:

  • 27,000 Operation Documents (DEX)
  • 16,000 incidents per month
  • 100% manual processing in nearshore
  • 30 minutes per incident acknowledgment
  • 10 incidents handled per minute

💡 Results at a Glance:

  • 98% of incidents automated by the robot
  • 99.99% of incidents resolved in less than 5 minutes
  • Processing capacity of 200 incidents per second
  • Implementation of remediation as code with playbooks stored in code management (Github) delivered by a CI/CD pipeline.

Impressive Outcome. Let’s revisit the key elements of this project:

Initially, choosing the methodology posed a challenge. A classic project with workload assessment and realization vectorization in a service center, or an agile method with an iterative and incremental approach?

The answer was a mix of both, a test & learn method based on incident analysis — a term that may sound marketing-driven but empowers teams to dare, make mistakes, and, most importantly, innovate.

Thanks to our incident database, we identified the most frequent and straightforward incidents. These became our starting point.

🛠️ First Playbooks:

  • Operation material wasn’t kept up to date.
  • Outsourcing to a third party (TME) had resulted in a loss of expertise in operation material.
  • Playbooks couldn’t be independent and had to rely on a framework providing common tool management primitives (ServiceNow, Centreon, interactions with OS, middleware, databases…) to avoid generating unmaintainable and non-evolving code.
  • We needed to address operational issues like a developer, implementing software engineering principles and elevating the team’s skills. We found that what’s natural for a developer — branch management, conventional commit, semver — is not at all natural for an ops!

Given these insights, we launched two streams in the project with mobility between them:

  1. Playbook Stream: Continued automation without the framework to prove project value and feasibility while evangelizing product teams.
  2. Framework Stream: Created a framework providing primitives for playbook realization.

One might argue that starting with the framework and then creating playbooks would be logical. However, we didn’t do this for two reasons:

  • The framework was developed in agile mode and delivered in v0 within a month with an initial set of primitives.
  • Playbook development yielded initial results and victories crucial for the project’s momentum.

I recall a contentious manual task: restarting application servers (over 2000 in the park). The number of alerts related to application server availability was substantial (around 50 per day), with a very long resolution time (average of 20 hours from probe trigger to service restoration).

So, we decided to start by automating the most basic aspect: restarting application servers using a playbook that connects via SSH and triggers the service restart command. Operational gain: significant! Application servers were restarted in less than 90 seconds compared to the previous 20 hours.

Yet, there’s still much to be done — logs of the automaton aren’t in incident tickets, interactions with ITSM are hardcoded. However, this victory demonstrated to both the project team and service users that the project was viable and would bring real, measurable gains.

From this success, the streams continued to progress in parallel:

  • The playbook stream continued to automate the simplest and most frequent tasks, feeding the framework stream backlog based on their needs (OS-specific management, need for factorization, recurrent issues).
  • The framework stream continued to enrich the framework with primitives needed for playbooks and fed the playbook stream backlog with refactoring to integrate new primitives.

In retrospect, starting with playbooks and building the framework in parallel seems like the right choice. We quickly demonstrated the project’s value, evangelized product teams with tangible results, and built the framework based on the real needs of playbooks and users.

❓ Persisting Questions:

  • Could we have done it faster? Perhaps, but more resources would have led to more refactoring.
  • Could we have done it better? We made technical choices that allowed speed but weren’t necessarily the best at the time. However, framework refactoring addressed this.
  • Could we have done it cheaper? Probably, but at the expense of code quality and maintainability.
  • Could we have done it differently? Certainly, but we haven’t found the miraculous solution.

📊 Closing Numbers:

  • 27,000 DEX processed
  • 16,000 incidents handled manually per month / 500 in December
  • 98% of incidents automated
  • 350 playbooks
  • 1 FTE in the framework stream
  • 3 FTEs in the playbook stream

🌟 A Project Demonstrating the Symbiosis of Innovation and Efficiency. We are delighted to share our experience and inspire other teams toward the future of intelligent automation. #ADEO #DigitalTransformation #OpsAutomation #TechnologicalInnovation

--

--