How to move data from one Excel file to another using Python

Todd Q Brannon
The Startup
Published in
8 min readJan 22, 2020

--

Using the Pandas library in Python, we can get data from a source Excel file and insert it into a new Excel file and then name and save that file. This is useful when you need to drill down to specific data and/or reformat the data for a report.

As a long-time VBA developer, discovering Python’s many libraries and tools that work with Excel files and data has been rather life-changing. The following example use case is loosely based on an actual project where I was able to replace pages of VBA code with under 20 lines of Python code.

I won’t get into downloading and installing Python or the Pandas library in this post. You can easily find better documentation on both elsewhere. This example uses Python 3.

The Use Case

We have an Excel workbook that contains 2019 reported sales for 28 stores across the US. There are 6 different worksheets in the file all related to a different product ID. Each is the same template containing a table with row names for each of the 28 stores and column headers for each month of the calendar year.

The goal is to extract sales data for each store from each worksheet in the source file and place the data into three columns in a new Excel file: Product ID, Store ID, and Sales (for a given month we will specify).

Step 1 — Import the Pandas library

I will open Visual Studio Code and create a new file. I will save that file as ‘getSalesData.py’ and type ‘import pandas as pd’ at the top.

There are a couple of extremely useful features of the Pandas library I use for this solution. The DataFrame object, for data manipulation, and the ExcelFile class to parse tabular Excel sheets into DataFrame objects. Pandas offers these and other tools to read and write data between in-memory data structures and several different formats besides Microsoft Excel like CSV, text files, and even SQL databases.

Step 2 — Set variables

The next step is to create three variables. The ‘spreadsheet_file’ variable is set to the source file workbook. The ‘worksheets’ variable will be the sheets in that source file and ‘appended_data’ as an empty array.

Step 3 — Loop through all the worksheets in the data source Excel file

Using a ‘for’ loop, I will get the data I want from each sheet in the source data file. This approach is completely scalable meaning that no matter how many sheets there are in the file (of the same template), or how often you add, remove, or change sheet names, this script will always check every sheet.

I use ‘sheet_name’ as the variable for each sheet in the workbook. After I save my changes to the file, I test it by printing the variable ‘sheet_name’ to the terminal. In the integrated terminal in VS Code (at the directory path where the file getSalesData.py is saved), I type ‘python getSalesDate.py’ after I have saved my changes and press Enter. Note that each tab name in our Monthly_Store_Sales.xlsx file prints to the terminal.

Step 3.1 — Set the month column to pull sales data from

Now that I am set with looping through each sheet in the workbook, I will get the exact data I want. This starts with getting all the data from each sheet and passing it to a dataframe. I will use ‘df’ as our variable name for the dataframe.

At this point, I will also set a month variable to the month (table header) I want to grab the data for. I will pull the data for August.

Step 3.2 — Get initial dataframes from my source Excel file

Now I will read the Excel data from the source sheets into a Pandas using the pandas.read_excel method. There are three arguments I need to pass to the method — the source file name, the sheet name, and the row of the column headers. I will use the value of 2 for the header row since our header row in the sheet is Excel row 3 and the row values recognized by Pandas are on a zero-based index (meaning row 1 in Excel would be a header = 0 value).

Step 3.3 — Print my initial dataframes to the terminal for inspection

If I want to see what the dataframe looks like in its current state, I can print it to my terminal with ‘print(df)’. I will actually type ‘print(df.head(5))’ in order to limit the results to only the first 5 rows of each sheet. Note that Pandas adds an index column on the far left by default.

Note that the dataframes contain every column in the sheet and those cells that are not apart of the store sales tables show ‘NaN’ as values. I’m about to clean that up and pull only the specific data I want and remove all the irrelevant data I do not need or want.

Step 3.4 — Limit the dataframes to only the columns for ‘Store ID’ and ‘August’

The next thing I will do is add a line of code that will limit the data in my dataframe to the columns for ‘Store ID’ and ‘August’. Note the results of ‘print(df)’ in the integrated terminal below the script.

Step 3.5 — Filter the rows of the dataframes for only values in the ‘August’ column greater than $3000

The next step is to limit the rows to only those with a sales value greater than $3000. I can use the dataframe.where method to do this.

Step 3.6 — Remove rows with ‘NaN’ values from the dataframes

Notice in the terminal, we now have a bunch of rows with ‘NaN’. We can eliminate these with the dataframe.dropna() function. Once I do that, save the file, and run it again I can see a much cleaner dataframe in the terminal.

Step 3.7 — Create a ‘Product ID’ column and add it to the dataframes

The next two lines of code will create a ‘Product ID’ column in the dataframes based on each worksheet tab name in the source workbook file. Then I will set the final column order of the dataframes to include my new ‘Product ID’ column as the first one. When I run the script again, I see the 3 columns of my dataframes printed to the terminal.

Step 3.8 — Add each dataframe to an array

The next step is to add each dataframe as an element to the ‘appended_data’ array. Since every iteration through our ‘for’ loop creates a new dataframe, I can use the append(df) method to append each successive dataframe from each sheet in the source workbook to the existing array, ‘appended_data’.

Step 4 — Join all the dataframes in the appended_data array into a single dataframe

Now that the ‘for’ loop has pulled the data from all the worksheets into their own dataframes that have been added to the ‘appended_data’ array, I need to combine them into a single dataframe. I will use the pandas.concat() function to accomplish that.

I will save the file and run it again, printing ‘appended_data’ to the terminal. The results confirm that I now have as single dataframe indicated by no repeated header row for each ‘Product ID’ section.

Step 5 — Export the dataframe to a new Excel file

The only thing left to do now is to export the existing dataframe to a new Excel file. Using the pandas.dataframe.to_excel() function, I will pass the file path and filename as an argument to the function. In a second argument, I enter ‘index=False’ so the index column doesn’t get exported to the new spreadsheet file.

Step 6 — Run the finished script and open the new file

I will save the new changes, run the script, and check the output file path for the newly created file.

When I open the file, I will see the three columns: ‘Product ID’, ‘Store ID’, and ‘August’ with only rows containing values greater than 3,000 in the ‘August’ column.

In less than 20 lines of code, you can automate a process that doesn’t even require opening the Excel application, much less the source data file. Additionally, it’s scalable to however many worksheets are in the source file and the script takes all of about 2–3 seconds to run.

You could even develop a desktop UI using the Tkinter library to make the script even more dynamic by passing parameters like the column selection and file paths and names from user input. But that’s another article for another day.

Get the full script code on GitHub: https://github.com/toddbrannon/get_excel_data/blob/master/getExcelData.py

--

--