Observations from migrating away from Control-M to Airflow

Published in

Apache Airflow

10 min readJun 15, 2024

Premise

A couple of years ago, the company I was working at decided to move away from an expensive enterprise batch orchestration system, Control-M, to an open-source platform, Apache Airflow. For privacy reasons, I refer to the organisation as just “the company”.

This shift represented a big change, because:

a) Open source software wasn’t really the way the company picked software due to enterprise support needs and so on
b) It would bring about decentralization (more on this a bit later).

To keep things in perspective, I’ll just call tasks as jobs within this article. The majority of jobs in Control-M are so called ‘OS’ jobs, which just means that Control-M will have an agent deployed on the Windows/Linux on-premise servers which executes a script/command and reports back on its state to the main Control-M server. At least a good 95% of these OS jobs are running on Windows servers — something not immediately possible with Airflow at the time it was being considered as a replacement of Control-M.

I no longer work at the company, but I have helped leave behind a solid foundation that the engineers taking over from me can build on and complete the migration journey and enable new users that need an orchestrator. When I left, we had over 400 DAGs running on production with over 10,000 daily task executions. Airflow has slowly, but surely become an important part of the data ecosystem at the company, and it will only grow from here.

Motivation behind this article

With this article, I want to share some aspects of our deployment that allowed us to deploy scalable and reliable Airflow instances, with some trickery on top for custom Role Based Access Control (RBAC). My hope is that it will be of help for someone considering Airflow for a bit of a non-standard use case, or at least, I haven’t seen anyone else attempting to do something similar. This article is a mix of the migration journey and the setup. I’ll break this down into the following individual components:

Why Airflow?
Expectations and challenges
Our deployment
Multi-tenancy
Spreding the word and Expansion
Learnings

Why Airflow?

As one can imagine, the name of the game here was cost reduction because Control-M comes with a massive license fee. At the same time, there were also several benefits to be inherited such as the fact that the end-users no longer needed to wait for a specific individual or team to maintain their workflows, they could do it themselves should they want to — because everything was contained in code within git repositories. This isn’t because it wasn’t possible to do so from Control-M, but allowing for that means…yep, you guessed it, more license costs.

It is to be noted that Control-M is a no-code drag and drop system and has been in use at the company for close to 10 years. It also has to be noted that there were quite a few users who were not familiar with Python (or any programming language for that matter), so Airflow was a hard sell for some. However, this is where a hybrid model was offered where a team would be available to take in requests to make changes to those workflows. Plus, several data engineering teams could benefit from a centrally managed Airflow instance since it meant that they would not need to take on the administrative burden. Since Azure didn’t offer a managed Airflow service at the time the project was started, a managed Airflow service was out of the question.

Expectations and challenges

When I started at the company back in 2022, I was only told that they needed someone who knew Airflow and it was currently not in use in production due to some security challenges, which could not really be elaborated upon. Eager and interested, I took on the challenge and the task was beyond what I could imagine.

Some of the hard work had already been done. A skilled engineer had written the PSRPOperator that would enable us to execute any Windows-based (ie, the majority) jobs. Great, next, came the security trouble.

Enterprise security wasn’t too fond of the idea that something cloud-based would be making remote calls to on-premise servers. In a few months, after many rounds of testing, security finally approved a solution where Airflow would used a Group Managed Service Account (gMSA) for each of the machines it had to perform calls to. This meant that everything could be audited and all actions of the user would be logged. There is a little more to the security design around this, but because it is something ‘proprietary’, I have to keep that information out of this write-up.

With the security aspect finally being ironed out, it was time to start migrating away from Control-M to Airflow. It wasn’t exactly straightforward as we soon figured that some of our potential end-users were not so excited about Airflow after all. It didn’t matter to them that it was Control-M or Airflow executing their workflows, they just wanted it to run at the given schedule. I suppose this was to be expected, but after some massaging, the point and benefits became apparent and we started making good, albeit slow progress on the migrations. Slow because of the sensitive nature of many of these workloads and the many stakeholders that were involved. We could technically have just dumped everything into Airflow, however, many of these workflows lack a ‘test’ or ‘dev’ version that we could use to iron out any potential issues.

Our Deployment

Our deployment was in a way quite simple. We have Airflow deployed on 3 separate instances: Sandbox, TEST and PROD deployed on Azure Kubernetes Services (AKS) using the official helm-chart.

For the webserver, we had some custom configuration — mainly to use Azure oAuth and to populate roles from the Azure app registration.

For the executor, we started out with Celery, however at some point, we figured that a little latency for task execution is okay, and therefore we managed to get rid of two “always-on” components in the name of worker and redis when we switched away to using the KubernetesExecutor. Using the KubernetesExecutor brought about an approx 8–10s delay to start task execution, but that is a minor detail we could live with in favor of better resource utilization. Outside of some initial tweaking, administering the instances have been relatively simple.

For the metadata DB we chose Postgres, as recommended.

We also added a custom RBAC layer, basically a customized version of the default “User” role that allowed for access to DAGs with a specific access control definition and select menu items. This was implemented as a way to allow for multi-tenancy without deploying several instances, as detailed earlier.

This is a simplified view of the architectural layout:

We maintained a base Airflow image with our own customizations, scripts, etc. and it is built for each of our environments and stored in the respective container registries on Azure. This repository typically only saw action if there was an Airflow version upgrade, or if we had developed a new plugin, etc.

The deployment repository contained all our deployment logic, and the CI/CD pipeline that would run whenever our users pushed something to main on their own repositories. In the interest of keeping things simple for our end users, they were not exposed to this repository at all (they could of course peek, if they wanted to), and only needed to look at the pipeline run status to see if it was successful, and if any of the integrity tests had failed.

Our users were used to receiving notifications by email, so that’s the alerting method we continued to offer with Airflow.

Last but not least, observability was a key requirement — at least for us in the platform team. In the early days, we encountered a lot of database drops (mostly due to network bottlenecks) and we were struggling to understand when they occurred. Thankfully, we had Splunk rolled out across the organisation around the same time, and we were able to swiftly put together good observability dashboards as well as alerts on key metrics.

We also ran a customized version of the maintenance DAG to remove any metadata that was more than 30 days old.

Multi-Tenancy

Decentralization was a key aspect and a major expectation from the migration to Airflow. To avoid setting up multiple instances, we began thinking of a design that would require the least amount of involvement from us, the platform team. After some trial and error, we came up with a model that was agreeable to us and most importantly, to the users. We provided 3 instances: Sandbox, Test, Production. The users were only required to add an access_control block to their DAGs with a specific role name that was created for their team. For example:

ACCESS_CONTROL = {
    "Admin-MyTeam": {
        "can_read",
        "can_edit",
    }
}

And in the DAG definition:

access_control=ACCESS_CONTROL

These roles were mapped to Azure Active Directory groups that had the required users within them. When our CI/CD pipeline runs, the corresponding scripts that we had to fetch new roles (if any), and populate the right permissions into them. This ensured that when a user with a specific role logged into Airflow, they would see nothing but their own DAGs. This isn’t true multi-tenancy by any means, but it served our objective of isolating users to their own workloads.

Every team had their own DAG repos, and we simply set up triggers on our main pipeline to listen to changes on the main branch on each of the associated repositories.

We used Terraform to manage the role applications on an AD group level, and these always ran as part of every deployment. This meant that the only thing that an end user had to do was to apply for an AD group assignment, and a team-member who was the owner of the group, would approve grants.

Spreading the word and Expansion

Constantly tweaking around on the platform is never the way to work. Doing so would have meant not being able to focus on the actual task at hand, which was the migration and guiding our users. After the initial wave of troubles, we got wiser, adjusted the configuration in such a way that we caught most, if not all problems, and at some point, Airflow was just humming along to all new workloads that we kept adding to it.

We baked a lot of re-usable things into the platform such as custom timetables (to account for Danish holidays, etc.), plugins that could be useful, custom callback functions, etc. and documented them for our users to just be able to use it in their DAGs.

Then, we did a lot of hands on training, attended several company-wide show and tells to show what we had, and found quite a few users who wanted to use Airflow to orchestrate their workloads.

As the number of users grew, so did the type of jobs and complexity, such as executing kubernetes jobs across various clusters in the organisation, execution of Oracle stored procedures, SSIS Packages, SAP jobs, among others. And none of these would have been possible without strong PoCs and some long knowledge sharing workshops.

Learnings

This could be several sub-sections on its own, however, I feel that it is best to keep things generic here. At the outset, every migration is challenging — no matter the size. There are several hidden complications, and some may only appear at a crucial juncture into the migration process. But, as complex as they are, things aren’t impossible to solve or work around.

As with many things, start out by understanding the needs of the users. While there were technical issues to deal with, it was more of of making engineering choices rather than actual deal breakers. A majority of our problems came in the form of offering users support, guidance and accounting for several different scenarios. But in general, by talking to and listening to our users, we were able to find a solution that fit a large majority, if not all.

Next, consider the size of team. We were a modest team of 1–1.5 and at most, 2 engineers doing such a massive migration at some point. Realistically, this was a hidden bottleneck because we didn’t really have the engineering capacity to make some major architectural or design decisions and many things were really just invented in the moment. At many points, it did feel like a one-man army. It felt extremely rewarding to cater to so many people until a certain point, when it simply became too much.

Third, and perhaps the most important, a Technical PO with a vested interest in managing stakeholders and timely development is an absolute necessity. Having an orchestrator (pun intended) for the engineering team and stakeholders makes things a lot smoother and allows for a better glimpse of the future.

It might be worthwhile to try and offload the migration responsibility to various end-teams, this takes a lot of upfront work but it pays off as it empowers the user to work on their own and who knows, they may even end up collaborating with the core team to improve the platform.

Last but not the least, follow up on latest developments in Airflow and if possible, contribute to the platform. The community is perhaps Airflow’s biggest strength, so it makes perfect sense to contribute and add to that strength.

Conclusion

Ultimately, not too much technical information has been shared here — and that’s on purpose. There exists plenty of technical information in relevant sources, but it is not often that a migration post-mortem (or mid-mortem in this case) is done and I feel like we went through a very unique journey here that’s worth sharing and hopefully inspires others to give Airflow a try as opposed to an enterprise solution.

Besides, it’s also my very first public write-up, and I hope it’s all over the place.

Many things have not been mentioned here, and the work of so many amazing people haven’t been highlighted either. Feel free to share experiences if you have attempted something similar, or have questions. I’m always open to chat 😃