Python and Openpyxl: Gather Thousands of Excel Workbooks Into a Single File
Link to this article Portuguese translation here.
Microsoft Excel has established itself as one of the most popular programs globally, and it is widely used in the job market and for private purposes. Perhaps at your company or your home, you still need to manipulate files with the extension .xlsx, and you want your data to remain saved in that format for some reason.
Excel has VBA as a native resource, and such a programming language can perform a large number of process automation demands in xlsx files. However, Microsoft stopped developing this language several years ago, even though it remains available for use in Excel desktop versions for now.
Anyway, you don’t need to know VBA to become a spreadsheet wizard! With Python and the openpyxl library, it is possible to write code that will save you a lot of time and will bring concrete results to your work. And all that will come with the advantage of using one of the most modern programming languages in the world, with one of the highest growth rates in the market and with a huge, active developers community.
USE SCENARIO AND EXAMPLE DATA
This article deals with a sadly widespread data manipulation scenario in Excel, which occurs when a single database is spread by different spreadsheets or several Excel files. If there are few spreadsheets or workbooks and no user is familiar with programming, a copy-and-paste approach may still be feasible to bring all data together in one place. This procedure is by no means recommended since there will be a considerable risk of human error. Furthermore, that would represent a colossal waste of time.
What if you are working with information from the same database, but such data is spread over hundreds, maybe thousands of different spreadsheets and files? In that case, the manual approach becomes unfeasible. Let the openpyxl library help us complete such a task quickly when working with the most modern .xlsx extension. For the much older xls files, one must use other Python libraries.
In this Github repository, you will find two directories (workbooks_by_days and workbooks_by_months), with xlsx files we can use to show this openpyxl power in gathering data from thousands of Excel files.
I created this data randomly. Therefore, they do not correspond to any real data that should be analyzed or understood. All files have the same structure: in the workbooks_by_days directory, one will find an Excel workbook for each day between January 1st, 2020, and March 28th, 2021. The file names have the date information in the year-month-day sequence. There is only one spreadsheet per file here, simulating a weather report with data collected every hour. In the workbooks_by_months folder, the data is grouped by month so that each Excel workbook has dozens of spreadsheets, one for each day of that month.
So, download these Excel files to your machine if you want to follow the explanation of the code below and, at the same time, test it locally.
We will start by importing the openpyxl resources we will need in our code, as well as the os module:
Now, we will reproduce the directory path where the Excel files we want to gather are located and then use a for loop over all the items in this directory. If the code finds a filename string ending in .xlsx and if it is not the name of a temporary file (which usually has the ~ $ prefix), we will call the
load_workbook function and create an instance of the openpyxl Workbook object. This Workbook object will be appended to the list called
wbs, which will save, in memory, the data of all the files to be consolidated.
When the for loop is over, we will create a new Excel workbook with Python code. This workbook will be filled with data from the workbooks saved in the
wbs list. After creating this consolidation workbook (which is saved in the
final_wb variable for now), we will access its first spreadsheet and save this object, of the Worksheet type, in the
final_ws variable. We will also open the first workbook in the
wbs list, saving its data in the variables
Each Worksheet object from openpyxl has two attributes that will be useful for us:
max_row. They save the information of the maximum number of columns and rows in that spreadsheet. These attributes values will be used extensively in our code, as they will define the maximum limits when we are iterating through rows and columns of a given spreadsheet.
The code below has a for loop that executes only one code line at each iteration. It was created to copy the columns header and transfer it to
Now comes the code’s central part. Since there are four nested for loops and we could get lost with all the different indentation levels, we will first present the code and then make some remarks about it.
In the first for loop, we iterate over all Workbook objects saved in the
wbs list. For each
wb (a common abbreviation for workbook), we will iterate through all its Worksheets objects (
wb.worksheets returns a list of these objects). Finally, we will iterate through all the worksheet rows and columns, retrieving each cell’s information and inserting it in the correct place in
current_row variable will control the line in the
final_ws worksheet which will receive the information that is being currently copied.
Our code’s last step is to save the
final_wb workbook by creating a .xlsx file on our machine with the compiled data.
CONSOLIDATING OUR CODE INTO A FUNCTION
We can encapsulate the code presented so far in a single function, which takes as arguments the directory with the xlsx files to be compiled (
workbooks_path) and the name to the new Excel file, with the final gathered data (
final_filename). Such a function is shown below. Note the inclusion of five conditional structures at the function body beginning. They check the type and format of the function arguments.
I believe that the presented function is general enough in its scope and might help other people who need to solve similar problems like the one shown here.
Thank you so much for having honored my text with your reading.
P.S.: You will find more info about my work on LinkedIn, Medium, and Github: