Data Engineering Made Simpler: How to Get Your ETLs to Production Faster

Published in

Similarweb Engineering

8 min readJul 10, 2022

If you’ve landed up in this post, I can safely assume you’re working with Big Data.

And now I’ll guess a few more things. And if I’m right, you’ll read this post top to bottom and perhaps throw in a few claps.

Do we have a deal?

…

I’ll take that as a “Yes!”

You’re either:
- Diving head first into any available data and researching the heck out of it, trying to find something new, interesting and valuable
Or
- You’re a data engineer working on ETLs. You often need to run those ETLs partially, trying things out in order to assist the person who does the aforementioned research ☝️

Either way, I have no doubt that you know the pain of rewriting your research/POC code from a Jupiter/Databricks notebook into something that resembles the ETLs you run on your production system.

In this post I’ll cover the common pains we continually faced during those rewriting processes, and the techniques we used to overcome them at Similarweb.

The (long) Life Story of an ETL

It’s a new sprint, you have a new task, and as usual you start writing your ETL in a notebook.

Why wouldn’t you? It’s convenient, no setup required and you get quick results and stored state for every bit of code you run.

You get your imports in order, and set up your consts…

You read your inputs into Dataframes…

Before you proceed, you decide to display the inputs to make sure all is as expected…

Wooh, all looks well.
You wiggle the data to the left, you wiggle the data to the right

Do some crazy complex operations on it…

Because you’re really careful, you display your perfect results and gaze lovingly at those rows 😍

At this point you write the results to your favorite temp path in some bucket somewhere. All that is left now is to rewrite this whole thing to look like a production ETL 😒

Because of course, the notebook you used to research and develop it can’t be used for production purposes.

Rewriting for production

You obviously know what our production ETLs look like. This isn’t your first rodeo, and you have many ETLs under your belt.

But…

This morning the coffee machine decided to give in.
So you, being oh so under-caffeinated, decided to refresh your memory and take a look at an existing ETL.

It looks like this:

It receives a SparkSession, and a Dict ctx to hold all the parameters we want to pass into this Spark Job.

The code in your notebook seems like a simple ETL. So you begin by grabbing all the code from the notebook, and pasting it into a function with the same signature you’ve just seen.

The results looks like this:

There are clearly a few modifications you need to make:

This ETL you’re working on is supposed to be a daily one. Meaning that the paths can’t be consts, you need to change them to be configurable and get them from ctx.
The display commands are nice during the development phase, but they aren’t needed in production — all you need to do is remove them.

The changes you had to make couldn’t be simpler:

Now the paths are passed in through the ctx and the display commands are gone.
You push your code, merge your branch, deploy the ETL to the production system and run it.

and…

The lack of caffeine finally got you to make a mistake!

How could you forget that the complex_algorithms package is not in the PYTHON_PATH in the production environment, but rather in the infamous common.shared.utils directory.

You fix the import and redeploy:

Finally! You did it, it’s working!

With another ETL in the books, a simple one, you reflect on the annoyances you tend to face when moving ETLs from dev to production (today’s case notwithstanding):

You have to clear out a lot of code before you can run ETLs in production. Mainly debugging prints and temp variables you used to work with subsets of your data
Many consts need to be replaced with ctx usage
Commands always end up being pretty messy in the dev notebook. You keep rerunning different commands to work out a complex part of your algorithm, fix a data set or troubleshoot a transformation. Ending up with a notebook that is far from having all its commands in chronological order
Dependencies can be located in different places in each system

There must be a better way, you think to yourself.
Well, you’re right. There is a better way.

Mirroring Your Production Environment Locally

If you approach your dev process differently, you can drastically reduce the number of changes you have to make. You can also solve the problems you will encounter when moving an ETL from the development phase to production.

One way you can speed up the dev to production transition process is by taking the place of the “scheduling” component in your production environment.

I have no doubt that you remember this diagram from your on-boarding at our company 😉. It shows how we run Spark jobs in our production system.

But just in case you don’t remember it, let’s go over it again.

A custom Airflow operator, DatabricksSubmitOperator, sends all required parameters to run a Spark job over to our internal library that wraps the Databricks Job Api. (parameters like job name, execution date, input and output paths, etc.)
Using the received parameters, our internal library resolves the path of the egg/wheel file that contains our Spark job code.
The path to the egg/wheel file, along with all other parameters, are submitted to the Databricks Job API for execution.
The state of the job is monitored by our library and reported back to the operator.

The scheduling component in this diagram is the DatabricksSubmitOperator. Using a simple script that you run locally on your machine, you can put yourself in the place of the operator.

This simple script, lets call it db_submit.py, can do everything the operator does, and more!

It receives all the same parameters and passes them on to the internal library, just like the operator. But in addition to that it can also take your local code, package it into an egg/wheel, and upload it to the S3 bucket containing all Spark job files.

No need to push the code to your remote repository, no need to create a merge request, no need to deploy it.

*security concerns were kept in mind, and access is limited to staging resources*

Running a Spark job that only exists on your machine is as simple as that:

The path to the Python file containing the Spark job is my_jobs.magic_job. After that, pass in the other parameters required to run the job, like paths.

Let’s not forget the -l (L) flag, it’s there to let the script know that you want to run the local version of the job and not the one that is in production.

“If only it were this simple at my previous job” you think to yourself.

But especially during the development phase, you might find yourself wanting to troubleshoot a rebellious job that keeps misbehaving when running at scale.

Enter…

Databricks Connect

What if I told you that using Databricks Connect, a feature provided by Databricks, your local machine can become a driver in a Spark cluster running on Databricks.

This gives you a powerful tool, a debugger. You can actually debug any code that runs on the driver. And if you’re tricky enough, you can even debug samples of code that usually run on executors, like UDFs.

Debugging the code that runs on your driver is as simple as debugging any other Python code you’ve ever debugged. Just place a breakpoint on your troublesome line and run with debugging.

But you’re working with Spark, and know that the heavy lifting is done on the executors. Unfortunately that code can’t really be debugged locally, unless we’re talking about UDFs and you’re being tricky 😉. (this also applies to custom functions used to map or filter RDDs)

You can take your Spark job…

and make a slight modification…

Take a small sample of the data (line 17)
Iterate over the sample in a purely Pythonic manner (line 18)
Call the function you use as a UDF (line 19)

This way you’ll be able to place breakpoints inside the function you registered as a UDF (line 10), and troubleshoot any weirdness that might be going on inside!

What we gained

With all these tools in our toolbox, we get:

Faster POCs

No need to go through the usual bureaucracy to get your code to production to run it at scale

Faster custom runs

Want to run a job with different input paths? Easy! Just change the input paths you pass to db_submit.py

Faster feedback loop

When it’s faster and easier to get running, you’ll know if something goes wrong in no time!

Extra Credit

While the tools used are very specific to how we do it in Similarweb, the techniques we used to give our dev environment a similar feel to our production environment are very much transferable.

Our next step is to look into dbx, a new open source offering by Databricks, and see if it can simplify our development process even further. And if it does, I’ll see you in the next post 😉