Maira provides a low code automation platform for Ops day 2 problems. It might sound counter-intuitive to use a DSL for a low code platform. Let’s look at why this design choice make sense. When we started working on Maira platform, we had a set of requirements that came from our experience with going through many customer escalations. The challenge was to design a platform that will make it very easy for someone to understand a workflow as soon as they look at it and at the same time, allow developers to write workflows that will help them automatically respond to incidents and remediate the problem, hopefully with minimal human intervention. We felt that the best option was to represent low code workflows as true code.
View logic as blocks
Most of the time urgent ops problems are handled by the front-line ops/support team, whose members may not understand the code of the underlying application completely. They may not even be full-time programmers. When presented with an automation script, they must be able to understand it and make minor changes if needed. The simplicity of MPL allows us to convert our workflows into a format such that they can be displayed as code blocks. An automation workflow that is easy to understand will lead to far fewer calls to core developers. It also allows front-line engineers to add a step (e.g. add a step to send a slack message)
Take the workflow in the following picture as an example. On seeing this, you can easily figure out what the workflow is trying to do, which is, to get logs from Kubernetes, do some checks to identify if an action is to be taken, and if it is, then confirm with a human approver and take the action if approved, else just create a ticket.
Ops Automation Problems are Complex
The problems of DevOps that need automation are generally very complex. They need flow control logic such as conditions and loops. They need transformation logic that takes the output of one command and transforms it to be suitable as input for another command. A lot of this logic may be used in multiple places, and if you don’t want to repeat it everywhere, you might need to define a function. They are hard to specify as monolithic single-flow pipelines. Combining all these, you can see that it needs a programming language.
Can we use an existing language such as python or bash? The problem with general purpose language is that of code bloat. We want it to be simple enough that a non-programmer should be able to understand and while writing such automation logic, one should not need to worry about handling tens of error cases.
To reduce any learning curve, we have picked python based syntax, which will be familiar to a large number of programmers.
To show the complexity of the automation problems, I will give one example of a workflow that we wrote to monitor and automatically delete unused resources in AWS. This workflow involves reading the list of resources from AWS, examining them to find out if they are unused, and setting an expiry, then later as the expiry nears, sending a notification to the owners, then finally deleting the resource. If we try to do this just with simple blocks, it will take a much longer effort to get right.
See differences across workflow versions
A troubleshooting workflow is constantly evolving. The underlying application code is changing all the time, the way the customers are using it keeps changing, and new issues keep getting discovered every day. Each of these means that the workflow will change quite often. If the only way to see the workflow is as blocks, it becomes very hard to figure out what has changed across versions. Also, sometimes, we may need to roll back the workflow to an earlier version. Changing workflows may also need to be reviewed by peers. All this becomes a lot easier if engineers can see the diffs using their favorite code diff tool. These workflows can even be stored in a code versioning tool, such as GitHub, for easy version control.
The picture below shows changes being made to the workflow and it is easy to understand what is being changed. Imagine doing something like this between YAMLs.
Integrated Troubleshooting and Automation Environment
A great workflow language is not very useful if writing workflows is a tedious task. Today, engineers are used to CLI tools to access all kinds of applications (e.g. AWS, Kubernetes) for troubleshooting. In MPL, each task is run as CLI.
When engineers use CLI in a manual troubleshooting session, Maira will capture those CLI commands and convert them into a workflow, possibly after some editing. These workflows can later be run automatically and enhanced further.
Other Language Features
Most of today’s APIs use JSON as a mechanism to pass data. MPL has built in support to handle data expressed as JSON. You can easily filter a JSON array or object based on any nested field or you can transform one JSON object into another so that it can be used as input for another Maira Command.
Also, MPL automatically takes care of most of error handling and retries for problems arising because of unreliable network or platform. Each Maira Command is retried (based on a policy) if it hits any network issue or a service being temporarily down. This makes MPL based workflows inherently durable.
To summarize, we have a dual requirement of making the workflow easy to understand by providing a block-based representation, at the same time, making it easy to compose the workflow, maintain it, review it, and roll back if necessary. It needs to be low friction for developers to adopt, so we have tried to keep it as close as possible to various CLI tools that the developer community is already familiar with. MPL enables you to solve many complex problems using low code workflow as true code.