AIOps: DevOps in the Age of AI/ML
How did we take DevOps at Innovaccer to the next level with AI/ML? This has reduced the effort required for managing the infrastructure. We used the data generated by infrastructure resources and logs for scalability, fault tolerance, SLAs adherence and reducing cost using AI/ML algorithms.
While working at Innovaccer, we were faced with some shortcomings in the traditional DevOps approach; the Ops engineers were getting too much work on their plate while maintaining and managing the system, and there was not much we could do to reduce the pain until we implemented AIOps.
To understand the crux of the article, let me first explain the meaning of DevOps and AI, and then see how AI can enhance the DevOps paradigm model.
What is DevOps?
DevOps is a set of software development practices that combine software development (Dev) and information-technology operations (Ops) to shorten the systems development life cycle while delivering features, fixes, and updates frequently in close alignment with business objectives.
What is AI?
In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines in contrast to the natural intelligence displayed by humans. Colloquially, the term “artificial intelligence” is often used to describe machines (or computers) that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.
The common requirements for operations engineer are:
- Cost Reduction
- Adherence to SLAs
- Optimum Resource Utilization
The operations engineer would have to fine-tune infrastructure resources parameters to achieve these goals. AIOps is applying AI/ML techniques to these constraints, automatically coming up with a solution that meets these criteria and continuously monitors the infrastructure resources to check whether we will be able to meet the criteria or otherwise come up with a new plan to achieve the same.
To understand this, let us take one of the use cases that we have solved at Innovaccer.
The requirement over time can be translated to run Apache Spark batch jobs at scale and meet the specified SLAs by the data engineers/scientists. The business requirement was to reduce the cost and also make sure optimum resources are utilized (>70% and <85% load while running production jobs). There are other complex business rules, but I am avoiding those to reduce the complexity of this topic.
Phase 1: Dev engineer and Ops engineer are working in separate teams.
The Dev team writes code and unit tests to ensure the code quality. The Ops team creates infrastructure as per the requirements, deploys the code and then executes the jobs based upon the requirement. Manual process/scripts are used for the creation of infrastructure, deployment of JARs and configurations. Spot instances for task nodes can be used for AWS EMR in auto-scale. Monitoring is done by a team to achieve the desired results.
- There is a separation of concern between Dev engineer and Ops engineer. Dev engineers can focus on the product feature and the Ops engineer can focus on the operational aspects of the jobs.
- Too much work of the Ops engineer is to maintain the system and achieve the SLAs.
- Conflicts exist between the Dev engineer and Ops engineer regarding the publishing of new jobs or updates of the old jobs. The Ops engineer wants to achieve system stability and is averse to these changes, and the Dev engineer wants to publish new features for the product.
- Time to market is high, as both teams have different goals they want to achieve.
- As the deployment of infrastructure is a manual process/scripting process, the same infrastructure machine is used for doing all of the activities. This causes deterioration of the infrastructure as each team member follows a different style of convention and has a varying degree of knowledge about the infrastructure process. The infrastructure tends to move into a state of being not maintainable causing the failure of jobs, and the team needs to work hard to achieve the SLAs and other requirements.
- Sizing of the machine is still a challenge to achieve. To meet the desired SLAs and optimum resource utilization, a lot of trial and error methods are done. Some magic numbers are used to perform these activities, like add X amount of machine and Y amount of memory. These numbers also change over time as per the data size, computation complexity, etc. There was no scientific process to achieve the desired result. Not all of the requirements are met every time, thus resulting in unhappy customers and business.
Phase 2: DevOps is implemented. Dev and Ops teams work closely to enable CI-CD and IaC.
The Dev and Ops engineers implement CI-CD. Infrastructure is created using IaC tools like Terraform/AWS Cloud Formation. Configuration Management tools like Ansible/Chef/Puppet are used for deployment of the configuration and JARs. Jobs are executed using a workflow management tool like Airflow or Azkaban. Integration testing of the whole process is done.
- Time to market is reduced considerably.
- Conflicts between the Dev and Ops team are resolved, as both are working closely for product success.
- Fault Tolerance is still a pain. Someone needs to monitor these systems and resolve all issues. SLAs can never be met if the fault is big and the time to address it is huge.
- There are many blind spots in the system which are discovered during production.
- Sizing of the machine is still a challenge. The common requirements specified above are still not addressed.
Phase 3: SRE is implemented.
SRE is implemented, various issues while running it in production are discussed and the plan for its mitigation is scheduled. Tools like Chaos Toolkit/Gremlin is used for introducing chaos engineering experiments in the product. These experiments provide us with insights into blind spots of the system. The issues are addressed and an alternative way or a better way is introduced to resolve it.
- Chaos Engineering helps us to discover the blind spots in the system.
- Fault issues are identified and addressed, thus reducing fault in the production environment.
- Even though we have implemented SRE, we need to consider the new route in our SLAs and cost reduction. We wanted a decision-making entity who could understand these changes and see if the plan can help us to achieve our common requirements.
- Sizing of the machine is still a challenge. The common requirements specified above are still not addressed. We wanted a decision-making entity that would have addressed these.
Phase 4: AIOps is implemented
While working on the implementation phase of SRE, there were two major pain points. At Innovaccer, we understand the importance of data, and thus the infrastructure data we have logged in our implementation was quite useful to build our AI/ML solution.
We have used a Python framework AWS SDK Boto3 for infrastructure creation, as it provides us programmatic access to create infrastructure.
If you see the common requirements, some of the conditions contradict each other. On one hand cloud cost needs to be reduced, and on other SLAs are to be met viz. time needs to be reduced. You have to make a decision regarding the infrastructure machines which depend upon the current marketplace cost of spot instance and availability of machine. Thus it relates to optimization for multi-objective problems, so we have used the Genetic Algorithm so as to meet these requirements. The Genetic Algorithm generates multiple solutions and sees if those solutions can meet our hard and soft constraints. In this problem, SLAs are hard constraints and cost is a soft constraint up to a limit.
Other areas on which AIOps can be applied.
- Manage the avalanche of alerts.
- Correlate data from various tools.
- Identify the problem in a multi-tier stack.
We are hiring across SDEs, SREs, and ML Scientist/Researcher roles. Do join Innovaccer to be a part of the next generation platform team for healthcare.