How We Enabled Dev and Data Science Independence With Clear API Boundaries Using Airflow and Databricks

Uri Brodsky
Riskified Tech
6 min readOct 18, 2022

--

Your dev team needs to use a data science algorithm to solve a real business problem, but how can you use this algorithm? Usually, data scientists write in R, Python, or Scala (Spark), and these do not expose a microservice you can consume using a clear API like any other service. So, you will often need someone (Dev/ML Platform) to wrap a data science artifact and expose it for consumption. This, of course, requires time and effort from another team to deploy and maintain artifacts that are actually owned by the data science team.

In this post, I will show you how we enabled our data science team to expose their artifacts with a clear API, allowing them to take full ownership of the process from deployment to production.

Business context

As you probably know, these days machine learning models are making instantaneous decisions that impact our lives. One example of such a decision is in eCommerce, where online merchants need to decide whether an order is fraudulent or not. While these intelligent algorithms are trained on large data sets, the world and consumer behavior are constantly changing, so they require some configuration to maintain performance.

At Riskified, we do just that. We analyze and guarantee transactions in real-time, using machine learning and continuous adaptation of these models of behavior. It requires our merchant health team of fraud experts to monitor our accounts continuously and change the configuration manually, using analytical tools to give our merchants the best experience possible.

As we continued to grow, we wanted to make sure that merchant health was working on proactive tasks that require human insight rather than on solving an algorithmic optimization problem. Therefore, I had my team work with the data science team to solve this problem.

Defining clear boundaries

The data science team was responsible for what they do best, providing an algorithm that suggests the new optimization value. My team was responsible for developing the end-to-end solution, including UI and backend, to orchestrate this complex flow.

We wanted to set clear boundaries and use the data science team’s suggestions as API. We wanted to enable them to write and deploy their code in R without needing us to deploy, maintain, and monitor their code like any other team responsible for their microservice.

So, how can we enable data science to focus on writing the algorithm and providing a simple way to deploy their code to production, as well as to interact with us easily?

For this solution, we checked two options:

  • Argo Workflow
  • Airflow + Databricks

What is Argo Workflow?

Argo Workflow is an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflow is implemented as a Kubernetes CRD.

Simply put, you can wrap any code as a Docker image and execute it natively in Kubernetes as a POD. You can provide input and get output.

So, we could build a Docker image from the data science team’s R code and run it as part of a directed acyclic graph (DAG) in Argo Workflow. Argo Workflow allows you to get complete control of how you perform these steps, you can do it sequentially, in parallel, and waiting for specific steps to finish. This way, we can wrap any code written in any language since the abstraction here is a Docker image.

The most simple DAG in Argo Workflow will look something like this:

The cons of Argo Workflow

There are several reasons why we didn’t choose Argo Workflow in the end:

  • Massive and hard to maintain YAML — In this case, you see a simple YAML, but in a real-life example, this YAML becomes massive and hard to maintain, making it difficult to understand what is going on inside.
  • Error-prone and long feedback loop — Since it’s a YAML, it’s prone to errors. It takes a lot of time to configure a working YAML, because you need to run the actual workflow each time and wait for feedback. As you can tell, I’m not a fan of YAML, and this excellent talk explains my point very well.
  • Requires management by dev — Creating Docker images, tagging images, deploying them, monitoring it, and managing them in our Kubernetes namespace still requires the dev team, so there isn’t total separation from data science. First and foremost, we wanted something simple and straightforward (as with any interaction with microservices), like a REST API, and Argo Workflow did not provide this.

So, where can the data science team write their code, deploy it, and expose it without needing to deal with infrastructure and all the complications that come with deployment? Databricks to the rescue!

What is Databricks?

Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers, and data analysts with a simple, collaborative environment to run interactive and scheduled data analysis workloads.

In simple language, do you want to run Spark jobs, Python, or R code without having to deal with managing Spark clusters and without having to create Docker images? How about just writing the code in a common language in data science and letting someone else handle the infrastructure?

That’s exactly what we were looking for!

With Databricks, the data science team can start writing their code in a notebook, where they can run each step and test it during development. Once they are ready, we can trigger this notebook via Databricks REST API.

Current state:

The user submits a request from UI to optimize. It goes to the optimization service (dev ownership), and then via the REST API, we simply trigger the Databricks’ job (data science ownership).

Cool! So, we can trigger data science, but how about triggering us back? Getting results back? What about retrying in case of failure? Sending alerts for monitoring?

Airflow to the rescue!

What is Airflow?

Airflow enables you to manage your data pipelines by authoring workflows as DAGs of tasks. In simple words, it’s a workflow orchestrator.

In a way, Airflow is similar to Argo Workflow. You can define a DAG, which is basically the steps you want to execute with complete control — sequentially, in parallel. But, in contrast to Argo Workflow, you defined the DAG not in YAML but in Python, which is much better.

So, we want to define a task that will trigger us back in case of success or failure.

Airflow has many operators built in. Operators are building blocks that you can use to run tasks on external resources, such as REST, CLI, Snowflake, and many more.

For example, we will use “SimpleHttpOperator” to make a REST API call in order to call our service back, like this:

You also have an operator to trigger a Databricks notebook, like this:

Operator to get task status:

Final task definition:

And finally, now we have this design:

In the diagram, you can also see S3. If you want to pass something huge between the services, S3 is the easiest way to do it. Send the location of S3, where you want the response to be stored, to the data science team as input. Then, once you call back, you can go and parse this response.

Wrapping up

Using Airflow and Databricks allowed us to define clear boundaries between dev and data science using REST API, enabling both teams to work independently while providing an end-to-end solution to our problem.

Feel free to contact me with any questions!

--

--