Get the latest file from Azure Data Lake in Databricks

Dian Germishuizen
2 min readMay 22, 2022

--

Databricks and PySpark Logo

There are many ways to orchestrate a data flow in the cloud. One such option is to have an independent process pull data from source systems and land the latest batch of data in an Azure Data Lake as a single file. The next layer where you process the data can be handled in many ways. The most independent way to do this is to have the processing layer fetch the latest file from the Data Lake on its own. This ensures the processing layer is not dependent on a previous tool or service giving the file path to it, increasing fault tolerance.

In Databricks, there is no built-in function to get the latest file from a Data Lake. There are other libraries available that can provide such functions, but it is advisable to always use standardized libraries and code as far as possible.

Below are 2 functions that can work together to go to a directory in an Azure Data Lake and return the full file path of the last modified file. This file can then be processed as normal in Databricks.

Tags

#data, #databricks, #datalake, #etl, #pyspark, #python

Thank you for reading my ramblings, if you want to you can buy me a coffee here: Support Dian Germishuizen on Ko-fi! ❤️

My Socials

My Blog: diangermishuizen.com

Linked In: Dian Germishuizen | LinkedIn

Twitter: Dian Germishuizen (@D_Germishuizen) / Twitter

Credly: Dian Germishuizen — Badges — Credly

--

--

Dian Germishuizen

I have been working in the Technology Industry as a Data Engineer since 2016. I have a passion for learning new things and sharing that knowledge with others.