Amany Abdelhalim
6 min readOct 24, 2020

Appending/Overwriting with Different Schema to Delta Lake Vs Parquet.

Delta Lake has unique characteristics and one of them is Schema Enforcement. Attempting to add data to a Delta file that has different schema ( different column names, differnt data types, etc) will cause Delta to deny the transaction and it will raise an exception (A schema mismatch detected when writing to the delta table) and will show you the current data schema and the schema of the data that you are trying to append. It becomes up to you to fix the schema before appending the data or if you want the new columns to be added to the existing schema you can choose to do Schema Evoloution by approving the new schema. You can find an example of that in one of my previous posts by clicking here.

I will use the same dataset that I used in here, which is a small json dateset that consistes of 100,000 records available at “/databricks-datasets/structured-streaming/events/”. The dataset had two columns “action” and “date”. I added one extra column “id” to the data set which is a unique number that starts from from 1 and increments by one. Since the dataset contains 100,000 records, the coulmn has numbers from 1 to 100,000.

I displayed the number of records in the data frame, the first 10 records, and the schema as shown below.

Display the number of records
Show the first 10 rows
Showing the schema

I wrote the data as a delta file and then read the delta data int a data frame events_delta.

The number of records in the events_delta data frame shows 100,000 records and the schema shows that there are 3 columns.

I dropped the id coulmn from the events_delta data frame as shown below.

Now by checking the schema, there are only 2 columns instaed of 3.

I will start by appending the data that has 2 columns (action, date) to the Delta file that has a schema of 3 columns (action, date, id). Will this action cause a problem?, will Delta deny the transaction?, will it consider this as an attempt to add data with different schema and produce an error “A schema mismatch detected when writing to the delta table” or is this okay to do and it is not considered a schema mismatch?!!.

By looking below, you will find that everything went okay and no exceptions were raised.

The append process worked fine and we succeeded in adding an extra 100,000 records, thus, now the Delta file has 200,000 records.

So actually this append transaction was totaly acceptable and there wasn’t any schema mismatch. Lets see why this is not considered schema mismatch. I will show you the schema of the Delta table and you need to look at an important property of the column id below which is (nullable = true).

This means that when we dropped the id column and attempted to append the data with (action, date) only, Delta considered that we are still adding values for the id column, except it considers all the values as null since the column id was dropped. So, it considered adding 100,000 nulls in the id column, and it didn't consider this as a schema mismatch. Lets check now how many id values from the 200,000 values are null. We should expect the result to be 100,000.

So what I disscussed above was an attempt to append to the Delta table after dropping one of the columns. Lets see what will happen if I repeat the same steps while changing the mode to overwrite instead of append.

Actually, you will see below that the Delta schema didn’t change and the number of columns stayed as is. The file is overwritten with the 100,000 records from the events_delta data frame and nulls were added in the id column.

If you would like the schema to change from having 3 columns to just the 2 columns (action and date), you have to add an option for that which is option(“overwriteSchema”, “true”).

So, in order to show you that, first, I deleted the existing delta file as shown below.

Then, I repeated all the steps above from the beginning up to the part of dropping the id column. After that I did the following.

As you see above, after adding the option (“overwriteSchema”, “true”) to overwrite the schema, the schema now has only 2 columns, action and date (id is no longer there).

Consider now, that we have a Parquet file that already has the same 100,000 records from the json file mentioned above, with a schema that has the same 3 columns (action, date, id). I loaded the data in a data frame events_parquet.

Lets now see the Parquet’s reaction to appending the data after dropping the column id.

After appending the new records, the data replaced the data that used to be in the Parquet file. So we lost 100,000 records that existed in the file before and we only have now in the file the appended 100,000 records of 2 columns instead of 3 columns that existed before.

New schema of the Parquet file

Changing the mode to overwrite, will do the same thing that append did, except that we would need to refresh to see the results, by reading the data again, which is 100,000 records of the 2 columns (action, date) that replaced the old schema and deleted the existing data.

I hope you found the post interesting and that I was able to demonstrate the subject in a good and clear way.

Amany Abdelhalim

PhD. in Computer Engineering | Research Associate | Computer Science Instructor | love Machine Learning & Big Data.