CODEX

A Quick and (hopefully) Clean Way to Reorganize Excel Data with Python

A Hands-On Guide to Filtering and Separating Excel Data into Different Worksheets in a Single Workbook with Python

Peter Kane

Published in

CodeX

9 min readFeb 23, 2021

Using Python codes to filter Excel data into multiple worksheets in one Excel workbook — Photo by author

An Urgent Request!

Recently I got an urgent request from a friend to help her work on an Excel file. She needed to separate the data into different sheets and save everything in a single excel workbook. Each sheet's data would come from the original file, filtered according to each unique element in a certain column. Thus each sheet name would come from each unique item in that column.

My friend needed to separate the data into different worksheets according to unique items in a column called “**Customer Class Description**.” All must be saved in a single workbook.

My friend’s request sounds simple enough. However, I’m not an Excel expert. After making it clear to her that she shouldn’t expect much from me, I proceeded to search for solutions on Google (and wondered she might have already done the same.)

Highly-customized problems require highly-flexible soutions.
— author

I came across a solution with some lines of VBA code. I promptly tested it and, superficially, found it to be satisfying. I worked on the given Excel file with that VBA code and gave the modified file to her the same morning she requested.

In the evening of the same day, I decided to be a more precise person than myself in the morning and inspected the resulting .xltm file (a macro-enabled Excel file) again. I found that there were some empty sheets without names and data. I immediately contacted my friend and, with all her niceness and thankfulness, she admitted that she found the problem, too, but was reluctant to ask me for another help.

As a Python practitioner and a data science apprentice, I took it as my responsibility to thoroughly satisfy her need by correcting those errors, hoping that her boss hadn’t already scolded her for giving him a defective Excel workbook.

No VBA this time, I determined to solve the problem with Python. And I’m going to show you how to do it.

Notice I: For demonstration purposes in this article, I modified the original data to protect the customers' privacy whose names were in the original Excel file. However, the modified file’s data structure and cleanliness were still pristine.

Notice II: This article is based on Python 3.9.1, Jupyter Notebook 6.1.6, and Microsoft 365’s Excel, run on a 64-bit Windows 10 machine.

The Hands-On Part

To follow along, you can download the mock-up dataset used in this project here: https://github.com/pkx8326/python-excel-splitsheets

Data Exploration

Pandas is a prerequisite library if we are to work with any tabular data in Python. After copying the data in the same folder we are running our Python session from, we can import the pandas library and use it to import our excel data in the following code:

import pandas as pddf = pd.read_excel("mockupData.xlsx")

I explicitly named the data “mockupData.xlsx” and made sure that I cut and masked all the ties that will link to my friend’s company’s real-world customers. I assigned a data frame named “df” to contain data from mockupData.xlsx.

Let’s explore the data with the following code:

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9039 entries, 0 to 9038
Data columns (total 6 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Province                    9039 non-null   object
 1   Customer Name               9039 non-null   object
 2   Customer Class Description  9039 non-null   object
 3   Item Description            9039 non-null   object
 4   SumOfQty Sold               9039 non-null   int64 
 5   Sales AMT                   9039 non-null   object
dtypes: int64(1), object(5)
memory usage: 423.8+ KB

The tabula data contains 9039 rows and 6 columns. The column that my friend needed to use as a filter to separate the data into individual worksheets is called “Customer Class Description.”

Let’s explore the data a bit further:

df.head(20)

The result of **df.head(20)** — Photo by author

What about the records’ ending?

df.tail(20)

The result of **df.tail(20)** — Photo by author

We can now see that the target column contains mixed string categories. We will use those categories as unique worksheet names in the Excel file we are about to create. We will also filter data according to each category in each worksheet.

Let’s have a look at the list of all the unique categories in the target column:

df[“Customer Class Description”].unique()

The result of **df[“Customer Class Description”].unique()** — Photo by author

There are “many” unique items in that column. How many?

df["Customer Class Description"].nunique()>>>107

In total, there are 107 unique items in the column, and we also seem to have some special characters such as “&” or “/” in some of the items.

Not only those special characters, after scanning the list (array,) one may find that there are items with the same name, but different character cases, such as “Automotive OEM” and “AUTOMOTIVE OEM,” and some of the items contain quite long strings.

Since we will name our worksheets according to these items, any of the above observations' traits may cause a problem when we save the new Excel file.

Excel Worksheet Naming Rules:

A worksheet must have a name (the name tab can’t be blank.)
Each worksheet in the same workbook must have a unique name.
The name can’t be longer than 31 characters.
The “forbidden” characters are / \ ? * : [ ]

The above list is also the rules for our Python code to manage and extract unique elements from the target column and use them as names for the worksheets in our new Excel file.

Data Preparation

The only column we should focus on is the one named “Customer Class Description.” We will use this column as a filter to separate data into worksheets, name them after each unique item in that column—all in a single new workbook. We will not create a copy of the dataframe; we will directly modify the items in our column of interest.

Replacing the “forbidden” characters with “_”: According to the worksheet naming rules, it appears that “&” is OK, but the other “forbidden” characters still need to be replaced. We will replace them with “_.”

Because there are many unique items in our focused column, too many for the comfort of thorough manual picking, we will tell Python to replace any of the forbidden characters with “_” if Python finds any of them in any item in that column with the following code:

for i in df.index:
    if ":" or "\\" or "?" or "*" or "[" or "]" in df["Customer Class Description"].iloc[i]:
        df["Customer Class Description"].iloc[i] = df["Customer Class Description"].iloc[i].replace(":", "_").replace("\\", "_").replace("/", "_").replace("?", "_").replace("*", "_").replace("[", "_").replace("]", "_")

With “for i in df.index:,” we tell Python to iterate through every item in the column with index i. Python knows this with the .iloc[i] segment.

We also take advantage of Python’s ability to chain methods together for our convenience. Here, we chain the .replace() string method to replace all of the forbidden characters with “_” in every column item.

Truncating items with strings longer than 31 characters: Now, another point to consider is that a worksheet’s name can’t be longer than 31 characters. We can fix this quickly with the following code:

for i in df.index:
    if len(df["Customer Class Description"].iloc[i]) > 31:
        df["Customer Class Description"].iloc[i] = df["Customer Class Description"].iloc[i][:31]

The [:30] segment at the end of the code above explicitly tells Python to choose the strings from the first character to the 30th character the current item and discard the rest. In total, we will have 31 characters in each item (from the 0th to the 30th character.)

Eliminating Repeating Item Names With Different Character Cases: This is the trickiest part but with a very simple solution.

From the entire list (array) of the column’s unique items, we can initially see that at least there are “Automotive OEM” and “AUTOMOTIVE OEM.” So far, we don’t know if there is anything else such as “automotive oem” (all with lower case characters) or extreme cases such as “AuToMoTiVe OeM” (with alternating cases) or “automotiveoem” (without any space.) And not just this item; the other items have equal rights to have their own variations.

Assuming that we can safely forget about the “extreme cases,” we can “normalize” all the items, in this case, by either making all of them to be in upper or lower case. I chose to make them all in the upper case with the following code:

for i in df.index:
    df["Customer Class Description"].iloc[i] = df["Customer Class Description"].iloc[i].upper()

Now, there is only one variation for each item. In the case of automotive oem, we have only “AUTOMOTIVE OEM.”

After making all the items in the column upper case, we should now have 106 unique items, assuming that only “Automotive OEM” is the only repeating item. Let’s check this assumption with the following code:

df["Customer Class Description"].nunique()>>>106

We are on the money! Let’s have a quick look at our modified list (array) of items in the column:

df["Customer Class Description"].unique()

The new result of **df[“Customer Class Description”].unique()** — Photo by author

It seems that all items in the column are in the upper case, and there are “_” here and there, where the “forbidden” characters used to be.

Saving a New Excel File

Now it is safe to use this column as the filter for our new Excel workbook. To proceed, we will have to filter the data according to each unique item in the filter column and save them as individual worksheets in a new excel workbook. We can accomplish this task with the following code:

writer = pd.ExcelWriter("Sorted_by_Sheets.xlsx")for i in df["Customer Class Description"].unique()
    df[df["Customer Class Description"].str.contains(i)].to_excel(writer, sheet_name = i, index = False)writer.save()

The resulting file named “Sorted_by_Sheets.xlsx” will feature the following data structure and sheets (more sheets are hidden from view):

The final result, an **Excel** workbook with data separate in each worksheet according to each unique item in the “**Customer Class Description**” column — Photo by author

Tidying It All Up

We can combine all the steps above into a single function. After making sure that the original data file is in the same folder that you’re running Python from, you can use the following function:

def sort_by_sheet_excel(inputFileName, byColumn, outputFileName):
    import pandas as pd
    
    readFileName = inputFileName
    df = pd.read_excel(readFileName)
    
    #Eliminate unwatned characters:
    for i in df.index:
        if ":" or "\\" or "?" or "*" or "[" or "]" in df[byColumn].iloc[i]:
            df[byColumn].iloc[i] = df[byColumn].iloc[i].replace(":", "_").replace("\\", "_").replace("/", "_").replace("?", "_").replace("*", "_").replace("[", "_").replace("]", "_")
            
    
    
    #Eliminate records that has more than 31 characters:
    for i in df.index:
        if len(df[byColumn].iloc[i]) > 31:
            df[byColumn].iloc[i] = df[byColumn].iloc[i][:30]
            
    
    #Eliminate same record names but with different cases:
    for i in df.index:
        df[byColumn].iloc[i] = df[byColumn].iloc[i].upper()#Write to excel file:
    writer = pd.ExcelWriter(outputFileName)
    
    for i in df[byColumn].unique():
        df[df[byColumn].str.contains(i)].to_excel(writer, sheet_name = i, index = False)writer.save()
    
    newFile = pd.ExcelFile(outputFileName)
    print("An excel file named {} is created with {} unique sheets.".format(outputFileName, len(newFile.sheet_names)))

With sort_by_sheet_excel, you give the original filename, the name of the column of interest, and the intended output file name, and you are all set!

sort_by_sheet_excel("mockupData.xlsx", "Customer Class Description", "sorted_by_sheets_1.xlsx")>>>An excel file named sorted_by_sheets_1.xlsx is created with 106 unique sheets.

After inspecting the resulting .xlsx file, I found no “empty sheets” like the result I got using a VBA code I found by googling. Three cheers for Python!

Conclusion

I’ve shown you that we can separate filtered data into sheets in a single Excel workbook with Python coding. In my case, the resulting .xlsx file is not error-ridden like the file I obtained with VBA coding. However, this article is not to say that Python is better than VBA. Highly-customized problems require highly-flexible solutions. VBA or Python, coding is the right way to go. It all comes down to what you know you can do better with.

Since we have been talking about Excel here, have you ever wondered what are the differences between Excel and SQL? Check out this article to learn more.

Until next time, happy coding!