In the Advanced Analytics team here at ASOS, we have a few infrequently executed (monthly or weekly) Python processes. They were built by technical users, not engineers, and therefore used to run on a physical machine somewhere in a cupboard. For obvious reasons they had to be moved to a more stable and manageable infrastructure.
We had a requirement to run these Python scripts as part of an ADF (Azure Data Factory) pipeline and react on completion of the script. Currently there is no support to run Python natively inside of ADF. Our main objective was to achieve this without owning any compute to execute our scripts.
Where to run our Python scripts?
Below are the options we evaluated for a simple use case: using a third party Python library to request a dataset from a vendor API, storing the retrieved data in Azure Data Lake.
- Azure Functions: Timeout after a maximum of 10 minutes. Occasionally our scripts became stuck in a third party queue and we often had to wait in excess of 12 hours before receiving a response. Also, Python is still in preview on Azure Functions.
- Azure Batch: You own compute and was overkill for our use case.
- Azure WebJobs: You don't own compute, but we struggled to get custom libraries working, and local development and testing can become difficult.
- Azure Databricks: You don't own compute because it spins up a cluster on demand, but it was overkill for our use case and it was fairly slow to create a cluster.
- Containers: Is an abstraction away from the infrastructure, so can choose whether to own the compute or not. Docker is the defacto standard for containers and is fairly simple to use. You can also set up the environment with all the necessary modules and it can (depending on the size of your container image) start up really quickly.
We decided to use Docker as it’s a simple solution and works in the same manner both locally and in the cloud. Integrating this into our Azure DevOps Pipelines was fairly effortless.
Where to run our Docker container?
- Virtual Machine: You own compute
- Azure Kubernetes: You own compute and it’s overkill for our use case
- Azure Container Instances: You don't own compute and it’s very easy to get started
We settled on ACI (Azure Container Instances) as it’s both simple and relatively inexpensive — you are only billed for the time the container runs. It also supports Managed and User Assigned Identities which simplifies permissions for things like Azure KeyVault and Azure Data Lake.
How do you run an ACI from ADF?
We wanted a generic solution to (a) create an ACI of any Docker image from within a pipeline and (b) pass arguments into the Python scripts. When the container is running we need to wait for it to terminate and then continue pipeline execution based on the outcome.
Create and start the ACI
We created a PowerShell Azure Function to call the Azure Container Instances REST API with the following information:
- Image Name and Version
- Command to run on the container
- Resource group where the ACI is created
- Name of the ACI Group
- Additional parameters as a JSON object that can then be parsed in the script
When the API is called, it will create and/or update the ACI Group and then start the container inside the group, executing the command specified. As part of the creation we also map environment variables so that the Python script can receive arguments from ADF.
Call the function and wait
We needed a way to call the Azure Function and wait for a response from the process in the container to notify us of completion and success status. Initially we thought we would need to poll something to determine if the process was complete but we didn't like that idea.
ADF WebHook activity to the rescue! This allows you to create a POST request to a URI, our Azure Function, and wait for a response. There is a little magic behind the scenes; when ADF posts the request it generates a unique URI, named callBackUri, and adds it to the body of the POST request. This allows ADF to wait for a POST request, with or without a body, to this callBackUri and then it will continue. You can set it to timeout if you do not receive a callback in the specified timeframe. Lastly you can read the body that was sent to the callback and react accordingly.
- When using the WebHook Activity your Body has to be passed as a JSON object, however, when using a Web Activity you could pass either a string or a JSON object. This can cause a lot of confusion. If you provide a string when using a WebHook activity, it will result in your endpoint not being called and it will neither error nor will it timeout. This is a bug that we encountered and with the help of two Microsoft engineers we have determined the cause and a solution while they work on allowing a string Body to be passed.
- When making a request to the callBackUri you have to use POST. If you want to send data to ADF using the Body, it has to be sent using the Content-Type of application/json or you will receive an http error code 415 (unsupported media type) https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/415.
I hope this provides a helpful summary on both how to run Python from ADF in a completely serverless fashion and how to use the WebHook activity and its callBackUri.
About the author
Eugene is a Senior Data Engineer at ASOS with a passion for Test Driven Development, Agile Methodologies, Continuous Integration and Delivery using Microsoft Azure. In his spare time he tinkers with Home Automation and all sorts of gadgets.