How to handle the schema change ?

Amit Kumar
3 min readJul 31, 2020

--

Schema change in source system is very common. We don’t know what are the columns business users are adding or removing and sending the files to us for processing. We always create a table upfront to load the data from source system and whenever data loads fails and it’s throwing the error says “Column XX doesn’t exist in table” then only we come to know that there is a change in source schema and we start changing the schema manually in our destination table.

It’s okay, If it happening very rarely. But what will happen If it is happening with you on regular basis for most of the files or tables. Will you keep changing the table structure all the time as per your manager ask?

Obviously, your answer will be no. Then start looking at solution to solve this modern data challenges.

In this article , I will try to solve this problem with modern data engineering tool i.e. Databricks.

Let’s see this in action.

I am going to load the data for EMP1 which has 3 columns and 4 rows (3 Cols * 4 rows) into a table “DynamicTable”

Now, I got the another file from source which has 4 columns and 5 rows (4 Cols *5 rows). If you compare this file with previous file, It has one extra column “age”. It means schema got changed at source level.

In the final result, you can see, I haven’t changed schema of my table. It handles and load the data dynamically. Now, Its all business requirement If you want to remove the duplicate or and wants to keep only distinct values. You can use various transformation on the dataframe.

Happy Learning !!!

--

--

Amit Kumar

Data Engineer | Azure SQL | Azure Data Factory | Azure Data Lake | Azure Analysis Service | Azure Databricks | PySpark | Azure ML | Power BI | Snowflake