Extending Dask to Read Delta Lake

George Pongracz
Oct 26 · 2 min read
Image for post
Image for post

I think it’s reasonable to assume, that not all our customers will view our Data Lake and it’s technologies through the same lens that we use in our team of Data Engineers.

Whilst we have chosen to lay down our data in S3 using Delta Lake and Apache Spark, we recognise that other teams, such as Data Scientists, have their own priorities to deliver value to their customers and to choose their tooling according to best satisfying these priorities and their team’s skillset.

The ecosystem of tools for Data Scientists is diverse, with this community of highly skilled data professionals using a plethora of libraries and software in addition to the ones used by ourselves.

Recently, I have been exploring one of these tools, Dask.

Dask is a Python based library that enables code written for single threaded libraries like Numpy, Pandas and Scikit-Learn to be executed in parallel.

It can read a large number of file formats like CSV, JSON, ORC and Parquet, but as at the time of writing this article, it does not have a dedicated file format reader to ingest Delta Lake files.

This however, is not really as much of an issue as might appear at first glance.

Delta Lake is based on the Parquet file format, extending it with a transaction log, which specifies which Parquet sub-file is active and which one has been logically deactivated (tombstoned) and no longer applicable to the latest version of the data contained within the Delta Lake file.

To facilitate its integration with other technologies like Presto / AWS Athena, Hive and Amazon Redshift, a manifest can be generated for the Delta Lake file that provides a comma separated list of only the active Parquet sub-files for the latest version of the data in the Delta Lake file.

This manifest file can be read via the pandas read_csv() function, converted to a python list, which can then be used in conjunction with the Dask Dataframe function read_parquet() to read in only the active records from the Delta Lake file.

So for example for a Delta Lake file on local disk:

import pandas as pd
from dask import dataframe as dd
formatted_file_list = []manifest_df = pd.read_csv('delta_file.delta/_symlink_format_manifest/manifest', header=None, names=['file_name'])file_list = manifest_df['file_name'].tolist()for file in file_list:
formatted_file_list.append(file.replace('file:', 'file://'))
print(formatted_file_list)df = dd.read_parquet(formatted_file_list, engine='pyarrow')output = df.compute()print(output)

If you are already a user of PySpark and need to use Pandas in a parallel mode then I would guide you towards Koalas, but otherwise if you use Dask perhaps the above code may help you ingest Delta Lake files.

George Pongracz

Written by

Dad, Husband, Cyclist, SEEKer, AWS Data Platform Engineer, Melbourne, Australia

SEEK blog

SEEK blog

At SEEK we’ve created a community of valued, talented, diverse individuals that really know their stuff. Enjoy our Product & Technical insights…

George Pongracz

Written by

Dad, Husband, Cyclist, SEEKer, AWS Data Platform Engineer, Melbourne, Australia

SEEK blog

SEEK blog

At SEEK we’ve created a community of valued, talented, diverse individuals that really know their stuff. Enjoy our Product & Technical insights…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store