How the Data Lakehouse brings OBI to the next level

Published in

Machine Learning Reply DACH

7 min readOct 24, 2022

Amazon MWWA and Databricks integration

With the amount of data that data engineers, analysts, and scientists are working with nowadays, more and more companies are migrating into the cloud (AWS, Azure, GCP, etc.) and adopting solutions for Data Lakehousing. A data lakehouse combines the best of data warehouses and data lakes into one simple platform to process all your data, analytics, and AI use cases.

Currently, the most common and most attractive lakehouse solution for engineers and developers is Databricks. It allows you to prepare, process data, train ML models, and do prototyping using great cooperation features. The Databricks platform is built on top of Apache Spark, a fast and general engine for large-scale data processing, and it provides reliable and best-in-class performance.

Another task that developers face on a daily basis is data processing and transformation. In companies where you deal with big amounts of data every day, these processes must be automated. It means that engineers have to create, manage and maintain data processing workflows to deliver the data to analysts, end users, or any other stakeholder.

One of the best out-of-the-box solutions and the most common workflow engine is Apache Airflow. This platform allows users to programmatically author, schedule, and orchestrate different workflows and data pipelines. The main feature of Airflow is easy-to-use workflow deployment since all pipelines are written in python, which allows fast prototyping and also extends the scope of possible use cases. It can be used to transfer data, build machine learning models and manage infrastructure.

Introduction: Data Engineering at OBI

OBI is a German multinational home improvement products retail company. It operates 668 stores in Europe, of which 351 are in Germany. In addition to the physical stores, OBI built an app called heyOBI, in order to be closer to its customers. Since this application is used by thousands of users, there is a need to process massive amounts of user data on daily basis. To accomplish this task, different data pipelines and workflows have to be developed to transform, prepare, and deliver data and gather useful insights to improve the product. Another data engineering task at OBI is to build a new Data Platform.

Since Databricks is the most suitable tool for all organization data needs and Airflow is the engine of choice for workflow management, we came up with an idea to combine these two solutions together.

In our tutorial, we will take a look at how to deploy Amazon Managed Airflow (MWAA) and integrate Databricks in the workflows. You can see the architecture that we will build on the following diagram:

Amazon MWAA Setup Using Terraform

There are two ways to set up an Amazon MWAA environment:

using AWS console
using terraform Infrastructure as a Code

If you would like to set up your environment using AWS console, use the following tutorial because today, we will focus on automatic deployment using terraform.

The first step is to create the VPC network. We will assume that you have already created one. If not, then you can do it using the AWS tutorial or using terraform module. Once the VPC is set up, we can start creating your first Airflow environment.

The first step is to create a S3 bucket to store the DAGs and Airflow configuration e.g. the requirements.txt file to install all necessary python packages for your development.

To start setting up our Airflow environment with terraform we must define some variables and terraform providers first:

The next thing to do is to install Apache Airflow Databricks Provider to be able to create a Databricks Connection in the next Section.

The following package has to be added to the requirements.txt

apache-airflow-providers-databricks==3.1.0

In the second step, we will create all necessary IAM policies and roles. MWAA needs to be permitted to use other AWS services and resources used by an environment. You also need to be granted permission to access an Amazon MWAA environment and your Apache Airflow UI.

And the final step is to create an environment itself:

Also, if you have a public DNS zone configured using Route53, you can create a static link to the Airflow UI using the following code sample:

Now you can create your first MWAA environment using terraform CLI:

terraform init
terrafrom plan
terraform apply

The Environment creation will last about 20–30 minutes. After the environment is created, you will see the following output in the AWS console:

How to Create a Databricks Airflow Connection

The first step is to create a Databricks Service Principal using the following tutorial or using terraform. A service principal is an identity that is used for use with automated tools and running jobs like in our case.

After the Service Principal is created, the next step is to create Databricks Access Token for Service Principal:

Get your personal access token:

Using CLI: databricks tokens create — lifetime-seconds 129600 — comment “personal access token”
using workspace UI:

2. Export env variables

export DATABRICKS_HOST="<DATABRICKS_WORKSPACE_URL>"export DATABRICKS_TOKEN="<YOUR_PERSONAL_ACCESS_TOKEN>"

3. Get the ID for Databricks Service Principal

curl -X GET \${DATABRICKS_HOST}/api/2.0/preview/scim/v2/ServicePrincipals \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \| jq .

4. Create the Databricks access token for the Databricks Service Principal

curl -X POST \${DATABRICKS_HOST}/api/2.0/token-management/on-behalf-of/tokens \--header "Content-type: application/json" \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--data @create-service-principal-token.json \| jq .

The last step is to configure the connection in Airflow UI:

Go to Airflow UI and click on the Admins option at the top and then click on the “Connections” option from the dropdown menu. A list of all your current connections will be displayed
By default, the databricks_conn_id parameter is set to “databricks_default”. Click “edit” and type in the URL of Databricks Workspace in the Host field. In the Password field insert your Databricks token, which you have generated in the previous step

How to Run Airflow DAGs in Databricks

There are two ways how you can run your DAGs in Databricks:

using Databricks notebook
using a Python script

Airflow has defined an operator named DatabricksSubmitRunOperator for a fluent Airflow Databricks Integration. This operator executes the Create and trigger a one-time run (POST /jobs/runs/submit) API request to submit the job specification and trigger a run. You can also use the DatabricksRunNowOperator, but it requires an existing Databricks job and uses the Trigger a new job run (POST /jobs/run-now) API request to trigger a run. In this example, for simplicity, the DatabricksSubmitRunOperator is used.

To create a DAG, you need:

To configure a cluster (Cluster version and Size)
A Python script specifying the job

All DAGs have to be uploaded to the S3 bucket, that was configured as central storage for MWAA, e.g. s3://<MWAA_STORAGE_BUCKET>/dags/ .

Run using Python script

Run using Databricks Notebook

After running the DAGs, we have created, they will successfully connect to your Databricks account and run a job based on a script you have stored in S3.

Finally, we will place our DAGs into files/ folder and upload them to the S3 bucket using terraform. To achieve that, extend storage.tf file:

The last step is to trigger our DAGs in MWAA UI:

Once triggered, you can see the job cluster on the Databricks cluster UI page:

Logging and Metrics

Amazon MWAA uses Amazon CloudWatch for all Airflow logs. These are helpful for troubleshooting DAG failures. Also, MWAA Airflow automatically sends various metrics to CloudWatch, so that one can observe, track and measure system performance and possible issues.

Conclusion

In this blog post, you learned how to deploy Amazon MWAA using terraform and efficiently set up an Airflow Databricks integration. This solution allows you to use the optimized Spark engine offered by Databricks to orchestrate and automate data transformation and processing. This integration delivers a powerful solution to use across different departments like Data Science, Data Engineering, and Analytics. Using collaboration tools like shared Databricks notebooks, developers can effortlessly work together on various business and technical tasks.

This collaboration of the OBI Data Engineering Team and Machine Learning Reply teams is a great showcase of Data Platform development using state-of-the-art tools and technologies to deliver fast and reliable solutions for data processing, lakehousing, and other data-related tasks.