Airflow: Leading Orchestrator

And why you should know about it

Victor Caceres Chian
Machine Learning Reply DACH
5 min readSep 15, 2022

--

Automation and Orchestration:

Increasing efficiency, saving money, removing chances of errors… all these ideas that have been associated with the concept of digitalization, no matter the industry.

Making our tasks easier has been a focus of technology and the first method we can think of, that embodies these criteria, is automation.

When we think about automation, it is usually referenced as making entire company processes run on its own. However, this is not completely right. A more accurate understanding of automation is the process of setting a single task to run on its own. Starting a service, processing some data, sending an email at a predetermined time are examples of an automated task. In individual scale, this can improve daily work tremendously. However, at a company scale, this is not enough.

What we are thinking about is orchestration! This is the process of automating many tasks at the same time to work together. The concept sounds more elegant and much more complex… And it is! It requires not only knowing and understanding how tasks interact with each other regarding sequences and dependencies, but also to be capable of tracking their progresses across many environments.

When automation is built in companies there are notable improvements in the companies such as:

  • Saving Costs: Efficient use of resources by having less downtime.
  • Standardizing Workflows: Easier to administer and understand existing process, easier to incorporate similar ones.
  • Improving Work Experience: Avoiding boring, manual and repetitive tasks by employees.
  • Increasing Productivity: Lesser chance of errors as entire processes run uninterruptedly.

Because all of this is a complex challenge, there exist many tools assist us for orchestration tasks. A very popular one is Airflow.

Airflow: Leading Opportunity

Airflow was created at Airbnb in 2014 as an open-source tool. It is an orchestrating tool to create, schedule and monitor workflows.

Airflow utilizes Directed Acyclic Graphs (DAGs) to create workflows. It has a scheduler that utilizes workers to execute tasks following the sequence and dependencies that are expressed in the DAG. The Task monitoring, progress and troubleshooting are displayed in an interactive and easy to understand User Interface.

It follows four main principles:

  • Dynamic: Pipelines are written in python and are not restrictive to a certain type of task.
  • Extensible: Operators, executors, and the library itself can be extended at the users’ will.
  • Elegant: Airflow pipelines are simple and explicit.
  • Scalable: It has a modular architecture that allows an arbitrary number of workers. It can scale to the users’ needs.

But Why utilize Airflow?

  • Market Share: Airflow has the highest market share in the Workflow Automation category with 34%. Its nearest competitors only have 13% and 12%
  • Out of the Box: According to a 2022 Airflow survey, more than 80% of Airflow users utilize vanilla airflow, meaning that the out of the box product is very useful and can be utilized without modifications.
  • Customer Satisfaction: According to this survey, more than 90% of Airflow users are likely to recommend Airflow.
  • Open source: Apache Airflow is free to use. And it has over 500 community members that work on this product and continuously improve the tool!

Use Cases of Airflow

Here are some success cases from Airflow’s blog in two different contexts: Machine Learning and Agile Deployment

Adobe Experience Platform

The Adobe Experience Platform is a system that transforms data into customer profiles, updated in real time. It utilizes Machine Learning to get personalized insights of each customer. To achieve this, the Big Data platform requires many data pipelines that connect their backend services and merge them into complex workflows. These workflows need to be deployed, monitored, and ran at specific times. An orchestration service was built on top of Airflow to allow users to build and handle complex workflows for the data.

Adobe Experience Platform now utilizes Airflow’s components to focus on business use cases as Airflow handles all scheduling, dependencies, and error handling. It can now easily scale to thousands of workflows depending on the use case.

Sift

Sift constantly trains and re-trains machine learning models to identify suspicious behavior and protect the transactions, platforms, and accounts of their customers. To achieve this, they implemented workflows that train models in many MapReduce and Spark tasks, with dependencies among them.

Sift utilizes Airflow to monitor, re-run and track the pipelines’ successes and failures. It allows them to manage their entire Machine Learning pipelines and additionally create new ones ranging from backing up data to ETL pipelines that prepare the data for ML models.

Experity

Experity focuses on building software for urgent care, to improve patient experience and establishing the customer’s urgent care clinic in the healthcare community. To provide this service, Experity required the deployment of its application in multiple nodes in different ways, the communication of tasks across Windows nodes, and the timing coordination.

Experity utilizes Airflow for its flexibility as it can perform any task on any node. They rely on Airflow to be as agile as possible in the deployment. The reliability and the scalability have allowed Experity to decrease the needed time for the fleet of servers to function.

Conclusion

In this article we have highlighted the need and contribution of an orchestration tool inside a company, independent of its sector. We have highlighted why Airflow is a good option in the market and presented some success cases.

At Machine Learning Reply, we guide and support all our customers in the development of their IT capabilities towards of their Machine Learning Use Cases, regardless of the current phase.

In the next Articles we will deep dive into Airflow as a tool, see how we can construct DAGs and checkout a deployment on AWS ECS.

--

--