Mundane Excel Tasks Made Easy With Pandas & Python Part I.

Umar Hussain
Ricoh Digital Services
5 min readJan 8, 2020
Photo by Erol Ahmed on Unsplash

If I asked you what the most popular database and analytics tool in the world is, would you be surprised if I said Excel? from dashboards to Gantt charts, project management plans or plain old data crunching, Excel does and has done it all. It is the bread and butter of many companies, both large and small.

However, like any tool, it has its limits and quite often frustrations. Like that time you’ve opened a report and for the life of you cannot remember where the start or end of it was. You later start to curse the person who created such a monstrosity only to remember that the person who created it was yourself six months ago.

I’ve been in this situation many a time and eventually started to wonder if there was a better way to do things. I have been burned by undocumented Excel reports one too many a time.

My searches led me to Python, an all-purpose programming language, and most notably the popular data library Pandas (short for Panel Data). What drew me initially was the ability to make tasks instantly repeatable whilst maintaining reproducibility. The moment when it all started to click together was when my manager at the time had told me I had just saved him two days of manual work by using Python.

This may seem scary as it’s in code, but if a nested IF statement is your jam, then this shouldn’t be so alien. The things to note are the comments and the readability of the methods. My partner who has never used Excel or coded before in her life had a general understanding of what was happening from the syntax alone

I’ve spent the last few months constantly reading and sometimes answering questions on Stackoverflow under the Pandas tag and noticed there was a trend in the questions concerning Excel, to which I’ve decided to create a series of posts addressing what I believe to be the most recurring themes.

Mundane Task 1 — Merging Multiple Excel Workbooks.

As any bright-eyed fresh analyst, I’m sure you, like myself, have been asked to compare some form of data from this year against last year. You fire up Excel, ready your SUMIFS and ask your more experienced team member, where you can grab the aggregated form of the data, surely it should be in SAP or some database? Nope, he points you to a directory on some server where the files are kept in the following fashion, each file is about 8mb and contains around 6000 rows give or take. Your job now is to open up each file and merge all the data into a single Excel workbook. Let’s do this with Python & Pandas. (if you don’t have Python or Pandas read this excellent guide )

In order, we first import our modules, pandas is our library used for working with data of all types whilst pathlib is a library used to work with file systems (note you’ll need Python 3.4 + to use this).

We set our directory to our Excels and create an empty list called excel_files— imagine this to be an empty folder where we store the locations of the Excel files. (Python doesn’t know where or what the Excel files are at this point).

we then loop over our folder returning any and all Excel files glob is a general term used to define techniques to match specified pattern according to rules related Unix shell. What this means is that "*.xlsx" will match every Excel file in our folder.

the .append method will add each file to our excel_files list above, much like a simple copy & paste.

the result of our list will be a list of files — here’s a subset of our list below.

as our original folder had 52 files, we want to make sure we got them all. len returns the length of objects within our list, which is 52, great.

Now, here is the magic, what would be a long boring and manual task, which is error-prone — can be done in mere seconds (depending on the size of each Excel file)

excel_dataframes = [] is our empty list to store our individual Excel files read into pandas data frames.

a pandas data frame is a two-dimensional object, much like a single Excel sheet with columns and rows (see below)

we then start our for loop with for file in excel_files: and read our file into a pandas dataframe with pd.read_excel() we pass an argument called file that is the individual item from our for loop, we pass a further argument sheet_name="sheet1" this specifies the sheet we want to grab. (If your sheet names are different I will deal with that in a follow-up post)

You can also pass numbers into the sheet_name argument, if your tab is always on the first sheet but with different names, you can set sheet_name=0

Finally, we concatenate our dataframes into a single dataframe called df — df is a standard naming convention for dataframes, but it’s always better to be explicit with your variables; remember you will come back and revisit them one day.

A note of caution when using concat the function merges along common columns, that means the column names must match for it to work, otherwise, you’ll have a mismatched data frame.

Our final method is passing our merged Excel workbook into a single Excel file called master_2019

This is a lot to take in, so well done for following along so far, but the beauty of this script is that you only need to change three things to use it in multiple locations, that is:

  • The Location
  • The Sheet Name
  • The Location where you want to save your file.

full script available at on this repo with some slight modifications to handle errors, such as Excel files with that don’t have the sheet name that you want to use.

--

--