Running Airflow at Kabbage

Tony Qiu
Tony Qiu
Nov 6, 2019 · 6 min read

What is Airflow

Apache Airflow is an open source platform for programmatically authoring, scheduling, and monitoring workflows.

With Airflow, engineers build directed acyclic graphs (DAGs) to define workflows. One DAG contains a set of tasks, and each task is one instance of an operator. Airflow provides a lot of built-in operators: BashOperator to run bash scripts, PythonOperator to run any Python code, S3FileTransformOperator to move data in S3, etc. Once a DAG is defined, the Airflow scheduler executes those tasks on an array of workers while following the specified dependencies. Moreover, Airflow has a rich user interface that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Sound familiar? I mean any ETL tool nowadays usually provides such functionality. So why Airflow?

Why Airflow

At Kabbage, we process data from a wide range of data sources to support analytics, automate decisioning (our systems and processes responsible for evaluating our users’ credit worthiness), and run machine learning models. Historically, we have used SQL Server Integration Services (SSIS) to run our ETL jobs. It worked fairly well for a while but as our business grew, we were faced with several new challenges:

  • Scalability: Due to increased volume and complexity, SQL server agents began taking longer and longer to process our data. In order to speed up the process, our only choice was to scale up (vertically) using more powerful machines. Even worse, since all of the jobs shared the same computing and storage resources, some long-running jobs were a drain on other normally quick jobs.
  • Extensibility: Business growth required us to keep adding more and more data sources. From a technical point of view, we also began looking beyond SQL server to process and store data. For example, we are currently using S3 for storage and Spark for data processing. SSIS, including its third party vendors, provide plug-ins to handle those cases, but generally speaking, the functionality is limited and we don’t have the low level control we need in many cases.

Airflow addresses the above issues pretty well. The design of Airflow supports horizontal scaling natively. We can simply add more Airflow workers to increase processing power. Airflow also provides a solid suite of default modules for common tasks such as moving data to and from S3 and creating Spark jobs in EMR and Databricks, among others. Even when there is no built-in support, we can usually find Python libraries to build our customized airflow operators. For example, we used the simple-salesforce library to develop our own operator for Salesforce data movement.

Besides those advantages, the most unique feature of Airflow compared with traditional ETL tools like SSIS, Talend, and Pentaho is that Airflow is purely Python code, meaning it is the most developer friendly. It is much easier to do code reviews, write unit tests, set up a CI/CD pipeline for jobs, etc..

We made the decision to use Airflow last year, and since the first job was deployed, we have seen quick adoption of the Airflow platform internally. More than one hundred jobs/DAGs are running in our production cluster. Some of them were migrated over from SSIS, but many of them are brand new.

Infrastructure and Deployment

We built a customized Airflow docker image, and use that as the baseline of both development and deployment. Developers use docker-compose to run Airflow instances locally for testing. Development and production clusters use the exact same docker image, triggered by Bamboo to deploy new releases.

The cluster infrastructure is shown in the above figure. Since all of our cloud infrastructure is hosted in AWS, we leverage AWS RDS services for meta storage, and AWS Elasticache for our scheduling queue. Note that we use AWS Storage Gateway as shared storage to save things like DAGs, configurations, and intermediate data. Most people use AWS EFS for that purpose, but we’ve found that Storage Gateway works pretty well and is much cheaper. The Storage Gateway also plays a role of migrating existing Windows file share storage to S3 at Kabbage. Currently, all Airflow docker instances, including the web server, workers, and scheduler, are running on a DC/OS cluster. So far, the infrastructure fits our needs very well, and we haven’t experienced any outages of our Airflow cluster for more than one year.

Development

We organize the Airflow DAGs development code as follows. The dags folder contains both common library folders and all jobs definitions. Each job folder could have one or multiple DAGs. The conf and sql folders have the same structure. The conf folder is used for environment specific variables, and the sql folder is used for SQL templates (based on Jinja template engine). The consistent structure allows each DAG to read the corresponding configuration and sql statements more easily (convention over configuration).

We’ve developed a set of common Kabbage operators. One example is that we developed a generic file movement operator supporting local file system, sftp, ftps, Google Drive, and S3, with advanced pattern matching criteria built-in. Another example is that we built a customized operator to trigger Databricks notebooks, which makes CI much easier. (We will write a separate article in the future to discuss more about Databricks development.) Besides operators, we also provide several template Airflow DAGs to demonstrate the typical uses we need: moving data from and to sftp/S3, triggering a Databricks job in spark, etc.

In terms of testing, we enforce unit tests for common libraries. We also do smoke/validation tests to load our DAGs to make sure they are properly constructed. We also apply rules to smoke tests. For example, a notification email address must be included in our DAG configuration.

The goal of all of the above work is to simplify the Airflow job development process. Any engineer, even those without much experience with Python, can add new DAGs by themselves. We’ve observed a high adoption rate throughout our organization, including engineers with just a SQL or .NET background.

Migration

We have hundreds of SSIS jobs running in production today. Migrating all of them is not a simple task. The main challenge is the complexity of the job dependencies.

We adhere to the following rules to perform migrations:

  • Start with small and relatively independent jobs, so that the migrations are done incrementally.
  • If the SSIS job has upstream dependencies, we use customized Airflow sensors to bridge the dependencies between those jobs and newly migrated Airflow jobs.
  • For complicated jobs, we run the Airflow version and SSIS job version in parallel, and perform cross validation, disabling the SSIS job only after everything perfectly matches.

We have migrated a significant amount of jobs, and in general we’ve seen that those jobs are more stable and easier to troubleshoot than their SSIS predecessors.

Future Improvements

Many engineering teams want to put more and more jobs into our Airflow platform. Based on the DevOps philosophy “you build it, you run it”, we need better multi-tenancy support and role based access control, so that teams can manage their jobs more easily and independently, without risking impacting other teams’ jobs. There is a very recent blog post from Lyft talking about this topic in detail.

Terraform + Kubernetes has been adopted as a standard for infrastructure management at Kabbage. It helps us to automate environment creation and testing. We plan to migrate our current Airflow docker orchestration infrastructure from DC/OS to Kubernetes as well, using Terraform to automate the provisioning. It will not only consolidate the tools we use but also simplify the work to bring up and shutdown a new environment.

We look forward to sharing the outcomes of those improvements in the near future.

Conclusion

In this article, we shared the data processing challenges Kabbage faces, as well as the reason we chose Airflow as our primary workload orchestration platform. We also shared some details of Airflow development and deployment at Kabbage.

Airflow is a great open source platform for developing modern ETL applications. However, there is no silver bullet. Data engineers still need to apply well-known practices, idempotency for example, to develop robust data processing jobs, no matter what tools we use.

kabbage-engineering

Kabbage Engineering