Amany Abdelhalim
The Startup
Published in
7 min readSep 28, 2020

--

Do Delta and Parquet files Refresh Automatically When Appending New Data to Them?

Delta Lake is an open-source, intermediary storage level between Apache Spark and the storage layer (HDFS, S3,..., etc) and it brings reliability to data lakes. Delta Lake stores the data in Apache Parquet format enabling Delta Lake to take advantage of the capabilities of Parquet but it adds on top of that solutions to missing capabilities (e.g. ACID Transactions, Concurrent Control, Schema Enforcement/Evolution, Unified Batch and Streaming data processing, MERGE command capabilities) and adding powerful features (e.g. Time Travel, Audit Tracing) . Delta Lake stores changes to the delta table as ordered, atomic units called commits or transactions. Every transaction (commit) is recorded in the delta log as a json file. In the json file you will find lots of information describing each commited transaction.

In this post, I would like to answer the question stated in the post title: Do Delta and Parquet Refresh Automatically When Appending New Data to Them?

I will give examples and take you, step by step, through the process that I took to answer this question. I used Databricks notebook (community edition) to explore and come up with the answer which is not a direct yes or no, it depends on the different situations of appending data.

I started by reading a small json dateset that consistes of 100,000 records available at “/databricks-datasets/structured-streaming/events/”. The dataset had two columns “action” and “date”. I added one extra column “id” to the data set which is a number from 1 to 100,000.

The number of records in the events data frame is 100,000 records as shown below.

I wrote the data one time as a parquet file and once as a delta file.

I read the parquet file and put it in events_parquet data frame and read the delta file in events_delta data frame.

I created a delta table delta_table1 using the same delta file that I used to load data into the events_delta data frame . I created a parquet table parquet_table1 using the data from delta_table1.

Since one operation was performed on the table delta_table1 which is a write operation, querying the history of delta_table1 shows that there is one version only version 0.

The delta log will show only one json file as well.

If you explore the json file to see what is stored there, you will find the Commit Information including the timestamp of the transaction, the user Id and user name for whom did the transaction, operation committed (e.g. overwrite, append), numOutputRows, etc. You can find also Meta Data such as the file format, schema, and much more.

You need to remember that you can’t check the history of a parquet table because that is only supported for delta tables.

Now lets append extra data to the parquet and delta tables and see what happens regarding being able to refresh automatically.

First, I will append data to the files making sure that the data has the same schema as the schema of the data already stored in those files and take you through what happens.

I will append the data in the events data frame once more to the parquet and delta files.

Appending to the parquet file:

So, here the data frame already had 100,000 records and I am adding again the same 100,000 records. When I count the number of records it shows that they are still 100,000 records rather than 200,000 records.

So that means that parquet was not able to reflect the changes and update automatically, we need to load the data again to the data frame to see the new count. We even need to select the data again into the parquet_table1 to see the new records.

Appending to the delta file:

You can see below that it is a different story with delta, appending the new data will automatically be reflected when we ask for the count of the records of the events_delta data frame. We did not need to load the data again in the data frame.

The delta_table1 will be refreshed automatically as well, without the need to select the data again into the table.

Now, if we query the history of the delta table we will see a new version added which is version 1.

Delta log also has a new json file describing the new commited transaction.

Second, I will append data to the parquet and delta files with different schema than the data already saved in the files.

I created a data frame that I called new_events that has the same 100,000 records of data from events data frame added to it an extra column label with a value 0. So now the Schema of the data in new_events consists of four columns (action, date, id and label).

Appending to the parquet file:

Appending data to the parquet file while changing the schema by adding an extra coulmn results of inconsistant results. When loading the data again from the parquet file to the data frame to refresh the data frame, we see that the number of records increased to 300,000 records but the new column label sometimes appears in the result and sometimes does not as shown below.

label column is not there
label column is there

Appending to the delta file:

Delta by default enforces the schema of the data existing already in the file, as you can see below an attempt to append data with different schema will not be allowed.

So you need to remember to allow schema evolution if you want to append the data that has the extra column label.

Although the count is updated automatically as happened before when appending data with the same schema, the new schema is not going to show automatically. So as you see below the label column is not displayed automatically.

Delta will require that we read the data again in the case of evolving the schema to reflect the changes.

As a summary, you can see that parquet can not refresh automatically in any case and data has to be read again so we can see changes but delta is able to refresh the count of the records automatically. Delta in the case where the schema evolves could not show the extra columns unless the data is read again into the data frame. I hope you found the post interesting and that I was able to answer the question in the title.

--

--

Amany Abdelhalim
The Startup

PhD. in Computer Engineering | Research Associate | Computer Science Instructor | love Machine Learning & Big Data.