ETL with Azure Functions, Python and Pandas | QueryClick

Sam Morris
QueryClick Tech Blog
9 min readDec 21, 2020

Data collection at QueryClick is growing. As we build out our platform and move closer to completing the next significant version of the platform we are looking to the future and revisiting the way we do things.

Our goal of being able to provide insight to our users in the shortest possible time frame is a driving force behind us looking at different ways we can manage our data.

Excluding our clickstream data, a lot of our data is “small” and comes in a semi-structured format via APIs. Dealing with this data doesn’t require a large compute, which means we have a couple of options we can investigate, one of those options is Azure Functions.

Event Driven ETL at QueryClick

Our data collection services are message-driven. That is about as far as we go in terms of event/message-driven ETL at the moment. An example of a collection can be found below.

  1. We have a micro-service that generates messages, this service is triggered on a cron job.
  2. We have a second service, in which its role is to collect said data, this is listening on the queue related the collection.
  3. Once a message is received it goes away and does its task. Most of our data collections relate to a company so the message will normally contain a unique ID for the client along with any other data needed.
  4. If the collection is successful then the data is loaded into blob storage to be processed downstream.

Where do we source our data

We don’t do any scraping here and excluding clickstream / behavioural data our data comes from APIs. From an ETL point of view, this is lovely as the semi-structured format plays nicely with Python and especially Pandas which we utilise heavily for our ETL.

“Basic ETL”

Depending on which part of our product offering you’re looking at, our ETL pipelines aren’t complex, lots of enrichment and getting a mixture of data sources to play nice.

It’s our simpler workflows where we utilise Azure Functions for ETL, this is mainly because of the size of the data (normally under 50MB) that we are dealing with and the short execution time for the process.

As you can see from the below we can have a couple of functions do their thing before we get to a point where we are happy with the data. Utilising Azure Functions allows us to get into the meaty bits of the work almost instantly, this is mainly due to their pre-made triggers and the auto-generate projects you get from VSCode, more on that to come.

Where do we land the data

We are planning our architecture around the Databricks Lakehouse model. We have opted for this as we are a small team and don’t have a lot of experience around data lakes. We are also already utilising the Delta format as part of our workflows so it felt like a natural progression. Databricks have done a great job at providing a wealth of documentation around the Lakehouse method which gives us solid guide rails for development. This also helps with team discussions as it allows us to reference official documentation.

Delta format for our small data

For our small data workflows, following the Lakehouse model can be difficult as we don’t have access to write to delta in Python. Yes, we can output to Parquet format, but having a single format for our Gold area is something we are aiming for, to get around this we can utilise Azure DataFactory (ADF) to help glue the two worlds together.

ADF jobs can be triggered on an event basis too and converting a CSV file to a delta format is simple enough.

Out with the old and in with the new

This post focuses on the enablement that using Functions bring us so I would be remiss if I didn’t explain our current set up a little bit.

The previous workflows were not difficult to set up, but it is time-consuming and these steps are a barrier to entry. I don’t come from a software background and having to learn everything outlined below took a bit of time, time which could have been spent elsewhere.

  1. Create a Docker image.
  2. Push it to ACR.
  3. Create the image in ACI.
  4. Set up the container to run on Logic Apps.

Azure Functions Intro

What are Azure Functions

Microsoft put it pretty elegantly so I’m not going to try and rewrite the wheel here.

Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date servers needed to keep your applications running. https://docs.microsoft.com/en-us/azure/azure-functions/functions-overview

Why I love them

I write less code

The section on writing less code is very true, no more IO code. The emphasis is really towards writing business logic and transformations.

The auto-generated templates that come when starting a functions project in VSCode have your IO based code.

We have a small number of Python users outside of our R&D team and I honestly don’t believe it would be difficult to port over workflows written by them, productionise it into a Functions app and let it run.

Input Binding

The below is an example of the binding that come with functions, in the below example they act as your access point to the event and the configuration of where the data will then be stored. This saved us from having to write our own CRUD modules.

{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "inblob",
"type": "blobTrigger",
"direction": "in",
"path": "landingzone/{name}.json",
"connection": "connectionDetails"
},
{
"name": "outblob",
"type": "blob",
"direction": "out",
"path": "enriched/{name}.csv",
"connection": "connectionDetails"
}
]
}

Couple of other things I’m enjoying about them.

  1. We can add them into our ADF pipelines.
  2. As we are looking to implement the Lakehouse model for our platform, the end zone for our data will be a delta format, utilising Data Factory we could potentially create a nice DAG style workflow using Function and Tasks that comes with ADF.
  3. Don’t need to remember Docker commands. Bliss.
  4. They have decent templates for GitHub actions. We are starting to move more of our CI/CD workflows over to GitHub Actions and Azure provides a decent template for you to copy and paste into your project.
  5. You can store multiple functions into one Functions App. This works really nicely for our workflow as one Github repo can contain the full lifecycle of the data, from the ingests of the data from the API to enrichment onto its final destination.

What I dislike about them

So far there isn’t really that much I dislike about them but adopting serverless / event-driven workflows do come with some new learning.

Dealing with changes

Requirements change, and when they do, we will need to decide where the changes will live. Do they become their own new function? Do we change the output of the current function to fit the new criteria? These are things we still need to work through.

Write defensive functions

As we will be using functions to load into our Delta Lake we don’t necessarily need to worry about the database example outlined in the Azure post below but it’s worth pointing out.

We will still need to think about scaling

Sticking with the consumption plan isn’t going to last us forever, we will at some point need to revisit this. Again Azure provides documentation on helping you understand how this can be done.

Creating Your First Project

Azure has a great tutorial for getting started with Azure Functions and Python, they walk you through creating a HTTP triggered function. I recommend you go through this first as it covers downloading the Azure CLI and all of that.

Setting up your Storage Accounts

Not really needed for this walkthrough as you can just use the storage account that is created for your function app. If you fancy making a new one for the purposes storing the test and the transformed data then fire away. If you’re new to Azure then the link below will walk you through the process.

Pulling ETL example

If you’re all ready to go then feel free to pull down the following repo, its mainly files generated by the VSCode extension, couple of things to highlight are the function bindings and the init.py file. There is example data for you to use in the following folder.

Example Binding

Depending on if you’re planning on running this locally using the debug in VSCode you may need to provide connection details to your function so it can access the storage account you’ve made.

{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "inblob",
"type": "blobTrigger",
"direction": "in",
"path": "landingzone/{name}.json",
"connection": "yourconnectiondetails"
},
{
"name": "outblob",
"type": "blob",
"direction": "out",
"path": "enriched/{name}.csv",
"connection": "yourconnectiondetails"
}
]
}

Example code

This is covered in the GitHub repository, but in case you’re just flicking through, here’s an example of a simple function. The example function in the repository itself is a bit more complicated, but you get the gist.

As you can see from the below the amount of code needed to load in the newly landed JSON file, flatten it into a format columnar format and save it to CSV is tiny.

import logging
import time
import os
import json
import azure.functions as func
import pandas as pd
def main(inblob: func.InputStream, outblob: func.Out[bytes]):
#reading in of the json file
json_data = json.loads(inblob.read())
#lets pretend that the data we are working with is a list filled with dicts
data_to_loaded_into_storage = pd.Dataframe(json_data)
logging.info("saving data back to Azure.")
#we save the file as a utf encoded dataframe for us to save to Azure.
output = data_to_loaded_into_storage.to_csv(encoding="utf-8")
outblob.set(output)

Connection String / Environment Variables

You can manage your environment variables and connection strings within the portal. Example of which is below and more information can be found in the below article.

Summary

At the core of it, using Functions means we can spend more time where we can actually provide value, building pipelines that will be used to surface data via our platform, ingested into our models or used in conjunction with other data to feed individual bits of analysis.

Yeah we may get hit by different problems in the future, like when we have thousands of micro-services doing all kinds off task or when we inevitability hit a workflow that lasts longer than the standard timeout of the consumption plan which we currently use.

For now the positives outweigh the negatives for us in this space.

Next steps

  1. Looking into using our privately hosted libraries as part of a function job
  2. Work out when we need to move away from the standard blob trigger. If it’s anything like the standard limit that comes with event-driven ADF jobs, then we should be okay for a while.
  3. Working out where else we can use them. Tempted to build a rest endpoint using them, we may need to move off the consumption plan to do this so we can utilise having a warm instance.

Useful training

Azure have created a learning module for serverless that covers a lot of what I’ve covered today and so much more.

Best Practices:

--

--