Make The Most Of Your Azure Data Factory Pipelines

Data Engineering Best Practices for ADF

--

by Shekhar Parnerkar

Azure Data Factory (ADF) is one of the most powerful tools for building cloud data pipelines today. As with everything else, you need a well-thought-out approach in order to get the most from it. While approaches vary from project to project, some patterns remain consistent. This blog post provides some insights into best practices and potential loopholes to watch out for when planning to use ADF.

Data Engineering Best Practices

Best practices are generally common across all platforms. I will cover some of these practices that were implemented in our recent project using ADF:

  • Secure your secrets with Azure Key-Vault: It is super-easy in an Azure environment. There is absolutely no reason why your usernames, passwords, connection strings, secret keys & tokens should be lying around in source or config files.
  • Stick to Azure DevOps guidelines for ADF pipelines: This will make deploying your pipelines in different environments (like QA, PROD) very easy. It boils down to a press of a button — literally.
  • Ensure your database objects are also released through your DevOps pipeline: Database objects are generally not part of the CI/CD pipeline. This creates a dichotomy between code and database. This dichotomy causes many problems when moving code from one env to another. Versioning and deployment of database objects do not require any fancy tools, just a bit of discipline. It is working for us with ADF and Snowflake!
  • Build smaller reusable pipeline components: Move away from large monolithic E2E ETL pipelines. ADF has an easy and visual way to create an orchestration of multiple jobs. It offers an opportunity to create smaller reusable pipeline components. This can be achieved by using metadata to drive the pipeline parameters, rather than run-time variables or hard-coding within the pipeline code. If you have a 5-stage ELT process, you can break them down by stage. Typically, stage 1 will change with the source, and stage 5 will change with the data model. But generally, stages 2, 3, and 4 remain the same — prime candidates for reusable code.
  • Two-way traceability: With data governance and compliance coming into the picture, you need full traceability between acquisition and consumption of data that must be maintained. We copied the pipeline’s Job Run Identifier and other metadata in raw data, in landing tables, and tagged source files in Azure Blob with the same.
  • Naming Convention: Adherence to naming convention really helps with event logging, job monitoring, and CI/CD.

Other Considerations

ADF runs on a Spark cluster behind the scenes, and the integration is robust. You are unlikely to face any major issues unless your volumes or velocity are exceptional. However, ADF does have some limitations that must be accounted for in your design:

  • Connectors: ADF has a wide range of connectors available, but not everything you’ll need is covered out of the box. Most Microsoft products are very well supported, but anything else deserves a closer look. It is best to do a POC for all such sources to ensure you have the functionality and scalability, you need for the project. We also found some limitations when using Excel sources and had to write our own connectors.
  • Function App (Serverless Functions): Where you do not have a connector (or need functionality not supported by ADF), you will need to write your own code in Python or Scala and call it through an Azure Function. Azure Functions have a 4-minute execution window, after which they will be killed. So, if you are moving large volumes of data to/from ADF, it will fail. That means you need to split your data/process into smaller batches and call the function repeatedly. This heavy lifting needs to be done in your ADF code. Function App also introduces another moving part in the pipeline that needs to be managed through SDLC and integrated with your DevOps pipeline. The execution window for Function App will be increased to 8 or 10 minutes in the near future, which will alleviate this situation.
  • Event Logging and Monitoring: Job monitoring functionality and the UI provided by ADF is evolving. It may work for most small to medium projects. For others, it may be worthwhile to build your own that is in line with your pipeline architecture.

Closing Thoughts

ADF has proven to be a robust framework for most of our data pipelining needs. It is easy to use and equally easy to deploy, which led to increased developer productivity.

Hopefully, what I’ve shared through my experience gives you some insights into best practices and potential loopholes to watch out for when planning to use ADF.

Be sure and checkout Hashmap’s Azure focus page at hashmapinc.com and reach out if you’d like additional assistance in this area. Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

Hashmap’s Data Integration Workshop is an interactive, two hour experience for you and your team where we will provide you with a high value, vendor-neutral sounding board to help you accelerate your data integration decision-making process, and selection. Based on our experience, we’ll talk through best fit options for both on premise and cloud-based data sources and approaches to address a wide range of requirements. Sign up today for our complimentary workshop.

--

--