The Evolution of Transformation Layer Architecture in 99 Group (DBT, Airflow and Kubernetes)

Andy Yan
99.co
Published in
7 min readJul 20, 2022

Requires some basic understanding on Airflow, DBT (data build tool) and Kubernetes. If you are deploying Airflow on a Kubernetes cluster, and are looking for ways to integrate DBT into Airflow, this article might provide you with some inspirations.

In our last article, it was briefly shared how we have integrated DBT into Airflow, which is deployed through a Kubernetes engine. Our transformation layer has since evolved to be one that is more robust and scalable. This article will be documenting the journey of how we initially arrived at the first design, one that is based on a RPC Server set-up, to the current one, which is based on using KubernetesPodOperators.

Part 1 — The RPC Server Set-Up

The folks from Astronomer gave us some pretty good ideas on how can we go about setting up the integration between DBT and Airflow. Our initial design was heavily inspired from this article by them. Like what was recommended in the article, we have a CI/CD process set up that generates a new manifest.json file upon each new commit on the main branch. The Airflow instance then subsequently reads the manifest.json file, creating a DAG per model, which also takes care of running the upstream models. However, there are a few things that we think could be improved on:

  • The DBT models might have to sit together with the Airflow instance for Airflow to access them and perform dbt run on them. Assuming that an Airflow repository already exist, the DBT related files might have to sit in the same repository. This might make the repository bloated, and analytics engineers might find it difficult to maintain the DBT related components.
  • It’s technically possible to create a different repository for storing the DBT models, and have the Airflow instance to simply just read from the directory where the DBT models are stored. In a Kubernetes set up, this could mean additional complexities in the form of setting up a shared volume, and have all the DBT models copied over for every new updates.
  • Furthermore, with the variety workload we are deploying on Airflow, there are just more and more libraries being installed in the same Airflow instance. We wanted to keep the libraries lean to avoid conflicts and messy dependency issues.

While seeking solutions to address the concerns, we discovered that DBT provides us with a plugin to run a RPC Server right out of the box that could help. According to DBT, “this server compiles and runs queries in the context of a dbt project. Additionally, the RPC server provides methods that enable you to list and terminate running processes.”. Essentially, this means that once we have the server set up, instead of having the DBT models sit with the Airflow repository, we could just:

  • Submit a request to the RPC server (which already has the project’s context) to run a DBT command
  • Poll the status of the submitted job
  • When the job is done, parse the response and decide what to do based on the final status

So, for every task node generated from parsing manifest.json, instead of using a BashOperator to execute dbt run , we would use a PythonOperator that executes a function that fulfils these basic steps.

With the DBT repository being decoupled from the Airflow repository, now the overall deployment looks like this:

And for every DBT task, the PythonOperator runs the following function:

Creating DBT task with PythonOperator

The following chart further breaks down what happens when Airflow runs a DBT task for a particular model:

  1. Airflow scheduler spawns a pod to run a DBT task
  2. The pod makes a request to the RPC server, specifying the command that it wants to run
  3. RPC server receives the request, then starts a job to run the command
  4. The pod receives a response with a job ID
  5. The pod use the job ID to make requests to the RPC Server continuously to check on the status of the job
  6. The job is still running
  7. RPC Server responded that the job is still running
  8. The pod makes another request to check on the status of the job
  9. Eventually, job is done or failed
  10. The pod will receive a response saying the job is done or failed. The Airflow task will be marked successful or failed accordingly.

With this design, we managed to decouple our DBT repository from the Airflow repository. Analytics Engineers can work the DBT repository without concerning themselves on how to run these models on production. Data Engineers can work with a leaner repository and avoid any potential issues of conflicting libraries.

Part 2 — The KubernetesPodOperator Set-Up

The RPC server was able to serve its purpose for a while. However, when we added more models and ran them more frequently, the weakness of this set up was starting to show. Since the RPC server was deployed with fixed resources, it was not able to scale itself up when there were more requests coming in. To make matters worse, there were no queue system built into the RPC server to handle the incoming load of requests. Tasks took longer to complete. Tasks that took too long to respond was marked as ‘failed’, and Airflow triggered a retry, which piled onto the existing jobs. The overall performance of this transformation system degraded significantly.

As we can see, the RPC server set-up is actually an anti-pattern to the rest of the system built in the Kubernetes cluster. When Airflow was running more and more tasks, more pods were being spawned, and the cluster was able to scale up the resources to meet the demands from these pods. Conversely, the RPC server had to work with its fixed resources and try to run all of the models being requested. We tried to work around the problem by giving the RPC server more resources, but it was not scalable. The added resources often could not keep up with the increasing number of DBT models that were added overtime. During downtime, the RPC server just hoards the resources without running any jobs, which translates into higher server costs.

It is technically possible to try to scale up the number of instances of RPC server during period of high demands. However, since the pod that triggers the job in the RPC server has to poll from the same RPC server instance with the given ‘request_token’, a mechanism to route the requests back to the same RPC server had to be implemented. With this added complexity, our team decided that it was not worthwhile to further pursue this design.

At this juncture, the team has been experimenting with KubernetesPodOperator for pipelines that cannot be readily built with the default operators. We found that one advantage of using the KubernetesPodOperator is that the libraries specific to the workload is only contained within the image, and Airflow is simply the orchestrator that triggers the job, which in turn spawn the Pod to run the workload on the cluster. We could specify the resources we want for that Pod, along with any variables and configurations. It is an elegant abstraction that allows us to run any jobs we want.

So, we figured we could use the KubernetesPodOperator to run the DBT tasks instead. For each model node from parsing the manifest.json file, instead of a PythonOperator that makes requests to the RPC Server and waits for a response, it could be a KubernetesPodOperator that pulls the DBT repository image, and specify a custom command to run that specific model.

Here’s the code snippet that illustrates the task created on every node from parsing the manifest.json file:

Creating DBT Task with KuberenetesPodOperator
Using KubernetesPodOperator to run DBT tasks
  1. Scheduler spawns a pod to run the DBT task (simplified)
  2. The task pulls the latest DBT repository image
  3. The task then runs a custom command specific to a model

In this case, if sufficient nodes are allocated to the cluster, when there is a high number active DBT tasks, the cluster will just scale up the number of nodes to handle the incoming workload. Unlike the previous iteration, resources are now dynamically provisioned.

Conclusion

The current set-up meets all our requirements that we intended for our transformation layer. It is highly scalable and stable. Teams can work on smaller codebases and they are easier to be maintained. We have since increase our workload multiple fold and the system could handle them without a hitch.

At 99 Group, we are always looking out for improvements and suggestions. If you share similar objectives or have done similar setup, please reach out to share your experiences!

--

--