How to Merge two DataFrame in PySpark || Databricks

Mudassar
2 min readAug 29, 2022

Today we are going to learn that how to merge two dataframe in PySpark. First of all, we have to create the data frame. We will create the dataframe which have 2 rows and 4 columns in it. See the following image in order to create the dataframe:

When you run the above command. You will be able to see student name, department, city and marks in the window.
Now you need another dataframe for merging, so copy the above code and do some changes like change in df_1 to df_2 and change the second-row values like below:

You can see the second data frame value after run the above code.

Now we will apply the union on both dataframe in order get merge values. So, write the following command:

After run this command, you will be able to see the following result:

Here you can see Jimmy name is appeared twice. Its mean there is duplication in the data. So, we have to remove the duplication by using following command:

You can see the distinct values now:

Union function show column name of that dataframe first which dataframe column you put first while create union like we have put df_1 dataframe name first.

--

--

Mudassar

BI Analyst by Profession. Student by Nature. Conservative for Liberals, Liberal for Conservatives