Excel Data Manipulation using VBA and Python- color rows

Pratish Mashankar
6 min readJan 30, 2024

--

Since its inception, MS Excel and its many clones have been a powerful data management and manipulation tool. Imagine leveraging automation and execution with simple code to an already mammoth interface. Ideation meets innovation when Visual Basic for Applications (VBA) and Python programming armor MS Excel with untapped potential which is often overlooked by analysts.

As a part-time project management analyst at George Mason University, I was asked to label around 180 rows into two classes by referencing an online database. Little did I know that each row had multiple entries, each with a different label, and the task if done manually could eat into my entire weekend. In this blog, I detail how I tackled the problem reducing the workload by at least 70%. The problem statement is detailed below.

Photo by Campaign Creators on Unsplash

Due to security concerns, I cannot share the original data files or their details, instead, I have created a dummy data to classify a Professor as tenured.

Problem statement

Given an Excel sheet of 183 rows with three columns:

  1. Email Column: Semi-colon separated list of professors on each row
  2. Department Column: The Department which the professors belong to
  3. Tenured: Whether a professor is tenured or not. Given blank.

Originally, 30% of the rows have been colored yellow to mark all those professors as Tenured. The remaining 70% need to be classified. Here is a sample data:

4 example records. Email column consisting of one or more emails.

Tackling the problem

Step 1: VBA for colored cells

The first step is pretty simple, all the yellow records need to be classified as Tenured. We will be using VBA as Excel does not have a function to directly detect the cell color. We follow the below steps:

  1. Press ‘ALT’ + ‘F11’ to open the Visual Basic for Applications (VBA) editor.
  2. In the editor, click on ‘Insert’ in the menu, and then choose Module. This will open a new module window.
  3. Copy and paste the following VBA code into the module window:
Function IsCellColoredYellow(rng As Range) As Boolean
IsCellColoredYellow = (rng.Interior.Color = RGB(255, 255, 0))
End Function

You can change the hex code for whichever color the cell belongs to. We can use the IsCellColoredYellow as a function on our cell. Click on the C2 cell in the Excel and paste the below Excel function code

=IF(IsCellColoredYellow(A1), "Tenure", "")

Drag down for all records. We thus check if the cell color of the A1 cell is yellow (RGB(255, 255, 0)), if true we set C1 as Tenure and otherwise leave it as blank.

The output looks like this:

Step 2: Python for rearranging the data

Our next step will be to split the semi-colon-separated emails into multiple rows. I am using a Jupyter notebook to run the Python codes which will help me visualize my data.

The below code reads the Excel file into a dataframe using the Python Pandas library and prints the first five rows.

import pandas as pd

# Read your Excel file into a DataFrame
df = pd.read_excel('tenured_data.xlsx')
df.head()

The output looks like this

We run the below code to perform the split.

# Method 1: For loop
# Create an empty DataFrame to store the new rows
new_rows = pd.DataFrame(columns=df.columns)

# Iterate through each row of the original DataFrame
for index, row in df.iterrows():
# Check if 'Email' is a string
if isinstance(row['Email'], str):
# Split the emails in the 'Email' column by semicolon
emails = row['Email'].split(';')

# Create a new row for each email
for email in emails:
new_row = row.copy()
new_row['Email'] = email.strip()
new_rows = new_rows.append(new_row, ignore_index=True)
else:
# If 'Email' is not a string, add the row as is
new_rows = new_rows.append(row, ignore_index=True)

There is an easier way to do this using Python lambda functions, the code for which is given below:

# Method 2: Lamda and Explode
# Split the 'Email' column into multiple columns
new_rows['Email'] = new_rows['Email'].apply(lambda x: x.split(';'))

# Explode the 'Email' column to create rows for each email
new_rows = new_rows.explode('Email').reset_index(drop=True)

We run the below code to delete any rows with empty emails. We save the dataframe into a new Excel file

# Drop rows with empty strings in the 'Email' column
new_rows = new_rows[new_rows['Email'].astype(str).str.strip() != '']

# Reset the index of the resulting DataFrame
new_rows = new_rows.reset_index(drop=True)

# Save the new data
new_rows.to_excel("python_edited.xlsx", index=False)

We have thus converted this data:

To this:

Step 3: Data Validation for Tenure column

The final steps will involve populating the Tenure column for all the other emails. I first set a data validation for the Tenure column to allow only two values — Tenure and Not Tenure. Follow these steps to achieve the drop-down option for the Tenure column:

  1. Select the column where you want to apply this restriction.
  2. Go to the “Data” tab on the Excel ribbon.
  3. Click on “Data Validation” in the “Data Tools” group.
  4. In the Data Validation dialog box, go to the “Settings” tab.
  5. Choose “List” from the “Allow” dropdown.
  6. In the “Source” field, enter the options separated by a comma: ‘Tenure, Not Tenure’.
  7. Click “OK” to apply the data validation.

Now, the selected column will only allow entries that match “Tenure” or “Not Tenure” or be left blank. You will see a dropdown list with these options when you click on a cell in that column.

Step 4: Create a username column

Moreover, I had access to the university’s people finder which could give me the details about a professor based on their username (sam in sam@uni.edu). To retrieve the usernames, I wrote the following Excel formula in a new username column and dragged it to all my columns

=LEFT(A1, FIND("@", A1) - 1)

Here’s what this formula does:

  1. FIND(“@”, A1) finds the position of the "@" symbol in the email address in cell A1.
  2. LEFT(A1, FIND(“@”, A1) — 1) extracts the left part of the email address up to the position of the "@" symbol, excluding the "@" symbol itself.

The result looked like this:

I can now use the Usernames to query another database and obtain information about the Professor being Tenure or not. In my original problem, I split around 180 rows into 440 rows of which nearly half were pre-labelled because of this approach. Neither did I have to manually separate the emails nor search for all the emails thus reducing my load by nearly 70%.

We have reached the end of another exciting blog! This should give you a rough idea of how we can use VBA, Python, and Excel formulae to tackle business problems.

You can connect with me on Linkedin to talk about Python and Machine Learning concepts. Check out my GitHub for projects.

--

--

Pratish Mashankar

Tech enthusiast, educates with fervor. Master's in Computer Science. Innovates data solutions. Passion for teaching, writing.