Serverless Python in Azure Data Factory

Eugene Niemand
Jan 8, 2020 · 5 min read

In the Advanced Analytics team here at ASOS, we have a few infrequently executed (monthly or weekly) Python processes. They were built by technical users, not engineers, and therefore used to run on a physical machine somewhere in a cupboard. For obvious reasons they had to be moved to a more stable and manageable infrastructure.

We had a requirement to run these Python scripts as part of an ADF (Azure Data Factory) pipeline and react on completion of the script. Currently there is no support to run Python natively inside of ADF. Our main objective was to achieve this without owning any compute to execute our scripts.

Where to run our Python scripts?

  • Azure Functions: Timeout after a maximum of 10 minutes. Occasionally our scripts became stuck in a third party queue and we often had to wait in excess of 12 hours before receiving a response. Also, Python is still in preview on Azure Functions.
  • Azure Batch: You own compute and was overkill for our use case.
  • Azure WebJobs: You don't own compute, but we struggled to get custom libraries working, and local development and testing can become difficult.
  • Azure Databricks: You don't own compute because it spins up a cluster on demand, but it was overkill for our use case and it was fairly slow to create a cluster.
  • Containers: Is an abstraction away from the infrastructure, so can choose whether to own the compute or not. Docker is the defacto standard for containers and is fairly simple to use. You can also set up the environment with all the necessary modules and it can (depending on the size of your container image) start up really quickly.

We decided to use Docker as it’s a simple solution and works in the same manner both locally and in the cloud. Integrating this into our Azure DevOps Pipelines was fairly effortless.

Where to run our Docker container?

  • Azure Kubernetes: You own compute and it’s overkill for our use case
  • Azure Container Instances: You don't own compute and it’s very easy to get started

We settled on ACI (Azure Container Instances) as it’s both simple and relatively inexpensive — you are only billed for the time the container runs. It also supports Managed and User Assigned Identities which simplifies permissions for things like Azure KeyVault and Azure Data Lake.

How do you run an ACI from ADF?

Create and start the ACI

  • Image Name and Version
  • Command to run on the container
  • Resource group where the ACI is created
  • Name of the ACI Group
  • Additional parameters as a JSON object that can then be parsed in the script

When the API is called, it will create and/or update the ACI Group and then start the container inside the group, executing the command specified. As part of the creation we also map environment variables so that the Python script can receive arguments from ADF.

Call the function and wait

ADF WebHook activity to the rescue! This allows you to create a POST request to a URI, our Azure Function, and wait for a response. There is a little magic behind the scenes; when ADF posts the request it generates a unique URI, named callBackUri, and adds it to the body of the POST request. This allows ADF to wait for a POST request, with or without a body, to this callBackUri and then it will continue. You can set it to timeout if you do not receive a callback in the specified timeframe. Lastly you can read the body that was sent to the callback and react accordingly.

Example of the pipeline

Gotchas

  • When making a request to the callBackUri you have to use POST. If you want to send data to ADF using the Body, it has to be sent using the Content-Type of application/json or you will receive an http error code 415 (unsupported media type) https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/415.

I hope this provides a helpful summary on both how to run Python from ADF in a completely serverless fashion and how to use the WebHook activity and its callBackUri.

About the author

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and…

Thanks to Gareth Waterhouse, Hailey Niemand, and Rosie Tredwell

Eugene Niemand

Written by

Lead Data QA Engineer at ASOS.com - I have a passion for Test Driven Development, Agile Methodologies, Continuous Integration and Delivery using Microsoft Azure

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and directed by our writers. Learn about our engineering, our culture, and anything else that's on our mind.

Eugene Niemand

Written by

Lead Data QA Engineer at ASOS.com - I have a passion for Test Driven Development, Agile Methodologies, Continuous Integration and Delivery using Microsoft Azure

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and directed by our writers. Learn about our engineering, our culture, and anything else that's on our mind.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store