I think it’s reasonable to assume, that not all our customers will view our Data Lake and it’s technologies through the same lens that we use in our team of Data Engineers.
Whilst we have chosen to lay down our data in S3 using Delta Lake and Apache Spark, we recognise that other teams, such as Data Scientists, have their own priorities to deliver value to their customers and to choose their tooling according to best satisfying these priorities and their team’s skillset.
The ecosystem of tools for Data Scientists is diverse, with this community of highly skilled data professionals using a plethora of libraries and software in addition to the ones used by ourselves.
Recently, I have been exploring one of these tools, Dask.
It can read a large number of file formats like CSV, JSON, ORC and Parquet, but as at the time of writing this article, it does not have a dedicated file format reader to ingest Delta Lake files.
This however, is not really as much of an issue as might appear at first glance.
Delta Lake is based on the Parquet file format, extending it with a transaction log, which specifies which Parquet sub-file is active and which one has been logically deactivated (tombstoned) and no longer applicable to the latest version of the data contained within the Delta Lake file.
To facilitate its integration with other technologies like Presto / AWS Athena, Hive and Amazon Redshift, a manifest can be generated for the Delta Lake file that provides a comma separated list of only the active Parquet sub-files for the latest version of the data in the Delta Lake file.
This manifest file can be read via the pandas read_csv() function, converted to a python list, which can then be used in conjunction with the Dask Dataframe function read_parquet() to read in only the active records from the Delta Lake file.
So for example for a Delta Lake file on local disk:
import pandas as pd
from dask import dataframe as ddformatted_file_list = manifest_df = pd.read_csv('delta_file.delta/_symlink_format_manifest/manifest', header=None, names=['file_name'])file_list = manifest_df['file_name'].tolist()for file in file_list:
formatted_file_list.append(file.replace('file:', 'file://'))print(formatted_file_list)df = dd.read_parquet(formatted_file_list, engine='pyarrow')output = df.compute()print(output)
If you are already a user of PySpark and need to use Pandas in a parallel mode then I would guide you towards Koalas, but otherwise if you use Dask perhaps the above code may help you ingest Delta Lake files.