How to merge a group of CSV files with different columns into a DataFrame

In this story, I will go over the steps to merge a group of CSV into a DataFrame and export it into a CSV file.

2 min readAug 27, 2020

Challenge

You have a group of CSV files with different columns. Your assignment is to merge them into a giant CSV file without any duplicates. How could you manage that?

Using glob to find all files matching the pattern

import glob
import pandas as pd
path=#working directory#all_files = glob.glob('path/**/*.csv',recursive=True)

In the fields of data preprocessing, Python is second to none as compared with other programming languages. Indeed, Python offers a highly flexible package named glob to allow convenient file access with the designated file formats. In the code given above, glob fetches all CSV files recursively in the given directory. After doing so, all_files represents an Object that contains all CSV files.

Storing each dataframe into a List

all_df = []
for f in all_files:
    df = pd.read_csv(f, sep=',')
    all_df.append(df)

In the code above, I first create an empty list, after that, for each file in the all_files object, convert it to a new dataframe, then add it to the list.

Merge every single dataframe into a giant dataframe

df=all_df[0]
for df_temp in all_df[1:]:
    df = pd.merge(df,df_temp, on=#primary key#, how='outer')

# use outer join to make sure every single ticketNo in the data is included"

I initialize the dataframe as df, then merge two dataframes sequentially on the primary key (usually the first column with unique and non-null values) and specify how=’outer’ to allow nulls in the rows where keys are not matching.

Validate the data consistency

df.drop_duplicates()
df.dropna(how='all')

I validate the data consistency by dropping duplicate rows or rows with all nulls.

Convert the dataframe into a CSV file

df.to_csv('data_processing.csv',sep=',')

One last step is to convert the dataframe into a comma-delimited file. Done!