How to combine multiple CSV files using Python for your analysis

Stella Joshua
6 min readApr 25, 2022

--

Oftentimes, as a data analyst, you may find yourself overloaded with multiple CSV files that needs to be combined together before you may even start your analysis on the data available. However, its not always the case that all the files are extracted from the same data sources and have the same data columns or follow the same data structure.

In this tutorial, you will learn how to combine multiple CSVs with either similar or varying column structure and how to use append(), concat(), merge() and combine_first() functions to do so.

Before we do that, let’s see how to import a single csv file into a dataframe using Pandas package.

1. Importing the File into pandas DataFrames:

To import a single file into a dataframe you can simply use pd.read_csv() function.

#import pandas module
import pandas as pd
#read the csv file into a dataframe
df = pd.read_csv("D:\Blog\Merge_Files\csv_files\csv_sample1.csv")
df

When you have multiple files to work with, the best way is to paste all the files into a single directory and then read all these files using pd.read_csv() function.

2. Setting up the working directory:

One method is to pass the path of the directory into a variable and then list all the files in that directory.

#import the modules
import os
import pandas as pd
#read the path
file_path = "D:\Blog\Merge_Files\csv_files"
#list all the files from the directory
file_list = os.listdir(file_path)
file_list

Else, if you want to read files from the same directory as your ipynb file you can use below code.

#import the modules
import os
import pandas as pd
#read the path
cwd = os.path.abspath('')
#list all the files from the directory
file_list = os.listdir(cwd)
file_list

The third method is to use the glob() function to list only the csv files from the working directory.

#import modules
import os
import glob
import pandas as pd
#list all csv files only
csv_files = glob.glob('*.{}'.format('csv'))
csv_files

Now to read multiple CSV files with the similar table structure, you can use pandas.DataFrame.append() OR pd.concat() functions.

Let’s look at the 3 sample CSV files we’ll be working with.

csv_sample1.csv

csv_sample2.csv

csv_sample3.csv

All three files have the same column headers except, csv_Sample2.csv has an additional column named “Birthdate”. Also, note that there are 2 entries that are common between csv_Sample1.csv and csv_Sample2.csv, as highlighted. Here, entry for “Tom R. Powell” has different “Joined Date” values in both files. Note how these entries get combined in all the methods used below.

3. Combining multiple files with the similar table structure using pandas.DataFrame.append()

Use the below code to read and combine all the csv files from the earlier set directory.

df_append = pd.DataFrame()#append all files together
for file in csv_files:
df_temp = pd.read_csv(file)
df_append = df_append.append(df_temp, ignore_index=True)
df_append

The output after using the append() function is as below.

Here, you can see that all the data rows from the files have been appended one below the other. However, NaN values have been inserted in the “Birthdate” column as these values are not present in csv_sample1.csv and csv_sample3.csv files.

4. Combining multiple files with the similar table structure using pandas.concat()

Another way to combine the files is using pandas.conact(), as shown below.

df_concat = pd.concat([pd.read_csv(f) for f in csv_files ], ignore_index=True)df_concat

Now, if you want to join data rows of the files based on related columns then you may use pandas.DataFrame.merge() function.

5. Using pandas.DataFrame.merge() to join the data rows

First read the files into separate dataframes as below.

#import pandas module
import pandas as pd
#read the csv file into a dataframe
df_sample1 = pd.read_csv("csv_sample1.csv")
df_sample1
df_sample2 = pd.read_csv("csv_sample2.csv")
df_sample2

Now, while using merge() between these dataframes, you need to specify the related columns on which you want to join the rows.

list = ["Email", "First Name", "Last Name", "Joined Date"]df_master = df_sample1.merge(df_sample2,
on = list,
how = 'outer')
df_master

The function joined all the rows only where the all the values of the specified columns were a match.

Here, we have used the outer join method to merge the files. To learn more on the type of merge to be performed, you may refer this link: pandas.merge()

In the above example, we passed a list of column names on which we wanted to join the rows. Instead, if we join the rows only on the “Email” column then we would get an output as below.

df_master = df_sample1.merge(df_sample2,
on = "Email",
how = 'outer')
df_master

Now, if you want to create a dataframe with values of say, csv_sample1.csv and wherever null, take values from a different file say, csv_sample2.csv then use combine_first() .

6. Updating null values in columns from other columns using pandas.combine_first()

Replace ‘_x’ from the column headers.

#Rename columns with suffix '_x'
df_master.columns = df_master.columns.str.replace("_x", "")
df_master

Pass all the column names on which you want to apply combine_first(). An easy way is to fetch columns with ‘_y’ in the headers and then remove ‘_y’ from them, as below.

x_cols = [col for col in df_master.columns if '_y' in col]
x_cols = [s.strip('_y') for s in x_cols]
x_cols

Applying combine_first().

for n in x_cols:
df_master[n] = df_master[n].combine_first(df_master[n +"_y"])
df_master

Drop the extra columns.

df_master = df_master[df_master.columns.drop(df_master.filter(regex='_y'))]df_master

Finally, to export the file you may use pandas.DataFrame.to_csv()

7. Exporting the file to your directory.

df_master.to_csv('D:\Blog\Merge_Files\csv_files\Combined_files.csv')

--

--