How We Automated Our Manual Payday Using Airflow And Spark

Published in

Riskified Tech

7 min readJul 20, 2022

In the Engineering world today, manual processes seem ridiculous, right?

We believe that everything should be automated to save time and, most importantly, to avoid human mistakes. Unfortunately, many companies face the issue of old processes that were (or are) still running manually.

In this post, I’d like to present one of our jobs and why we chose Airflow and Spark to automate it. We’ll also look at code examples for integrating Airflow.

The job we’ll discuss is Payday, which runs monthly at Riskified.

What is The Payday Process?

At Riskified, the Payday process applies to our integration with merchants. To transfer money between Riskified <-> merchant each month, we create invoices and charge files that are transferred to the merchant.

We offer a suite of products to merchants, some of whom run different Payday processes.

My team develops a product called Deco, which runs an individual Payday that is only used for Deco’s clients. The process was run manually by the team up until a few months ago.

It’s also important to mention that since we’re talking about finances here, the process should be absolutely accurate. Moreover, Payday should be visible to many teams outside the Dev department, including the Finance and BI teams.

How the process run manually

Each month, one of the team members manually triggered a job in GoCD for each of our merchants. For this blog, let’s look at GoCD as an infrastructure for operating jobs.

This job was running a query on our Data Warehouse to get the monthly charges of the given merchant. The job uploaded the result to an S3 bucket, visible to our merchants through the API.

So if we had ten merchants, this job would have to be triggered manually ten times, each time with a different merchant id. Exhausting, right?

Well, it didn’t only occur in this job. We didn’t have automation for creating invoices for the merchants, so this forced us to send the charges to the Finance team, who summed everything up and created an invoice using Riskified’s format.

We can see the process stages in the following diagram — described for a single merchant:

This manual process had many problems, both the fact that we needed to run the Payday every month and that the process itself was very long and went through many people. We realized that automating it needed to be a high priority.

The new generation Payday

To prevent manual work in the Finance team, we had a new requirement — sending aggregated billing data to an internal billing service, which generates invoices for the merchants. This automates the manual calculation done by them for the month’s charges.

So the flow we were aiming for was:

Get merchant charges; in our case, this is done by running a query in Data Warehouse for the relevant month & merchant
Upload charges to a bucket in S3 (or any other tool that can store charge files)
Aggregate raw charges data into billing format
Send aggregated data and charges URL to the internal billing service

Let’s deep dive into the brainstorming we had when choosing a solution for the automation.

Alternatives we considered

Our go-to solution was to use the current infra with GoCD, and trigger it using CronJob. A different CronJob can also trigger the automatic aggregation to calculate the charges (in Ruby/Scala).

For those who aren’t familiar with CronJob — it’s a resource in Kubernetes that creates jobs on a repeating schedule.

This solution seemed simple and wouldn’t require much work — we’d use existing infra in GoCD, and have it trigger with CronJob, which we used a lot across our services.

When we looked into the flow’s step in the diagram above, we realized that both the aggregation automation and the storage of the charges file are dependent on fetching merchant charges. While the last step, sending the request to the billing service, is dependent on both of those steps.

When considering the above, we realized that the CronJob solution wouldn’t fit here; CronJob doesn’t support running a workflow in steps, and we couldn’t ensure that one step would be completed before the next step would run. Also, we realized that we didn’t need to be coupled to the GoCD job, since it only runs a query and uploads the result to S3.

Spark and Airflow solution

Before diving into the selected solution, let’s talk a bit about what Spark and Airflow are.

Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries running on data of any size. So, Spark allows us to run aggregation on our data faster, even if we’re talking about big sizes of data.

In our case, Spark’s capability of aggregating large amounts of data can automate the manual work that has been done by the Finance team till now.

Airflow is also open-source, being used for programmatically authoring, scheduling, and monitoring workflows. It’s a robust platform for orchestrating workflow or pipelines. Airflow will ensure that each task in the pipeline will get executed in the correct order. The collection of all the tasks, organized to reflect their relationship and dependencies, is called DAG (Directed Acyclic Graph).

Airflow also provides many valuable operators — an operator usually integrates with some other services, such as SQL, S3, Docker, k8s, and so on.

After having gotten to know both Airflow and Spark from previous experience in the team, we realized that running the workflow using Airflow and aggregating the data using Spark could be practical solutions for our needs.

In Airflow, it’s easy to have dependencies between the tasks, and we could use some simple operators to upload & download the data that we’d need. Airflow seemed precisely like the solution we’d need for orchestrating this workflow.

Spark would allow us to aggregate bulks of data each month into billing details in a fast and convenient way.

We had a few doubts when choosing Spark. The product only had about ten small-medium merchants at that time, so we weren’t sure if the workload we had was worth using Spark in terms of “big data”.

Eventually, we chose Spark because we appreciated that the product was growing, we might have many more merchants in the next few years. In addition, we had a supporting infrastructure made by the Data team for executing Spark jobs in Airflow, which we’ll talk about in the next section.

Looking inside the solution: getting to know Airflow

Airflow has a simple UI, in which we can look at our processes (DAGs) and see their flow. Let’s take a look at ours:

As you can probably see, this DAG contains precisely the steps that we talked about before when showing Payday’s new generation.

Each step performs a job and is initialized by the operators mentioned earlier. In our case, we’re using tailor-made operators that can be implemented pretty easily or already exist in the community provider packages.

In this DAG, we’re mainly using two types of operators:

SnowflakeToS3Operator unloads data from our Data warehouse to an S3 bucket and is based on BaseETLOperator. Once we have this operator defined, it’s easy to use:

Task group that unifies a few operators and creates sparkApplication in our k8s cluster. This is how we defined it once we had the task group:

Deco_spark_on_k8s_task_group will execute the Spark job in the relevant namespace, while using the correct snapshot.

Although we already had an infra for creating those applications, you can use this operator for creating a SparkApplication object in a k8s cluster.

So now we have our tasks defined, and we can determine the relationship between them easily:

That’s it! Using those two lines, we defined that “create_merchant_charges_task” and “create_merchant_billing_details_task” can run at the same time, while the last task, “create_invoice_task,” depends on both stages being done.

We now have a seamless solution running every month for the Payday process! That’s awesome.

Wrapping up

This solution was implemented pretty smoothly, thanks to the fact that our team already had an infrastructure integrating Airflow. I still think that Airflow can be a great solution, even if the infra doesn’t exist yet. You can check out this excellent blog post that explains more about getting started with Airflow.

I think that this feature implementation showed me, as a Full Stack developer, that sometimes tools that seem intimidating can be easy to use and a great solution even if we don’t know them.

In order to choose the best solution, we need to keep ourselves open-minded and think outside the technologies we are familiar with.