Get the latest file from Azure Data Lake in Databricks

2 min readMay 22, 2022

There are many ways to orchestrate a data flow in the cloud. One such option is to have an independent process pull data from source systems and land the latest batch of data in an Azure Data Lake as a single file. The next layer where you process the data can be handled in many ways. The most independent way to do this is to have the processing layer fetch the latest file from the Data Lake on its own. This ensures the processing layer is not dependent on a previous tool or service giving the file path to it, increasing fault tolerance.

In Databricks, there is no built-in function to get the latest file from a Data Lake. There are other libraries available that can provide such functions, but it is advisable to always use standardized libraries and code as far as possible.

Below are 2 functions that can work together to go to a directory in an Azure Data Lake and return the full file path of the last modified file. This file can then be processed as normal in Databricks.

Get the latest file from Azure Data Lake in Databricks

Tags

Written by Dian Germishuizen