Scraping Excel Data with Python

Tanya Shapiro
5 min readDec 9, 2021

--

Wranglers in The Data Wild Wild West

In a perfect world, data is clean, structured, and all in one place. That’s every analyst’s dream! But much like most things in life, data is messy and unpredictable. A good data analyst knows how to interpret data. A great data analyst knows how to tame and wrangle data too.

Wrangling data is just as tough as wrangling Kristen Wiig!

Wrangling Data

As an analyst who’s lived in the wild wild west of corralling data, I wanted to share one of the Python data wrangling methods I’ve used in the past. While the use case scenario might be very niche, I hope the basics can offer some insight to others facing similar challenges.

If you prefer to get right to the code, I have a full step-by-step tutorial available on my GitHub.

Excel, Excel Everywhere!

One of the most challenging data wrangling tasks I encountered was dealing with manual data stored in disparate Excel files. There were hundreds of Excel files stored in one directory, each file contained the same set of information per customer. The mission at hand: coalesce the information from the hundreds of Excel files and create a data frame with customer data.

My sentiments on Excel are well summarized by the image below -

The task felt daunting at first. To make it more manageable, I broke the problem into smaller pieces:

  1. Understand the format of the Excel files with the data
  2. Find a way to read the data in Excel and extract the relevant data points
  3. Create a process to iterate through all Excel files (rinse & repeat)

Step 1. Understanding The Data Format

To understand the data format, I took a look at one of the Excel files I would need to scrape. For the purpose of this tutorial, I will use a mock file with dummy data and a similar structure to illustrate.

Example file:

Example data only

Each file contained information laid out in two columns: the first column contained an attribute name, e.g. First Name, Sex, and the neighboring column contained the output value, e.g. John, Male.

Upon review of 5–10 other files, I noticed that the attribute name was not always in the same row index, e.g. First Name was sometimes in Row 4, and in other files in Row 5. I would need to account for this in my code (index is unknown!).

Step 2. Reading & Extracting Data

Importing Packages

For this task, I used 3 packages. Xlrd does most of the heavy lifting for opening, reading, and extracting the data from Excel. I also used pandas to produce the final data frame at the end.

import os
import pandas as pd
import xlrd

Getting Started

Before diving into loops and iterating through files, I started with testing out this task with one Excel file. Better to start small!

#Select the directory where the Excel files are located
os.chdir('./Files')
#open Excel workbook with xlrd
wb = xlrd.open_workbook('Form - Hopper, Grace.xlsx')
print(wb)

Xlrd will return an excel workbook object. To tap into the data one must first specify the worksheet within the Excel workbook, and then the method for calling values (by row or by column).

After going through the xlrd documentation I noticed there are two ways to specify worksheets within a workbook: by index or by sheet name. The latter is ideal if a workbook contains several unordered sheets but contains the same name.

#To get sheet by index (starts with 0)
ws = wb.sheet_by_index(0)
#To get sheet by name
ws = wb.sheet_by_name('Sheet1')

Extracting The Data

Next step is to extract data from the worksheet. There are two methods, we can either extract data by row index with row_values or extract data by column index with column_values.

Since the data format in this use case scenario has data stored by column, we will use column_values.

#get list of values in row
ws.row_values(2)
#get list of values in column
ws.col_values(0)

The output of col_values returns a list of values. We can then use basic Python to index and slice data within the list. Since we need to repeat this process for each attribute, it might be easier to write a function and call it get_value(). We will set the parameters to include the worksheet object, the column where the attribute is stored, and the name of the attribute.

def get_value(worksheet, attribute_column, attribute_name):
attributes = worksheet.col_values(attribute_column)
if attribute_name in attributes:
attribute_index = attributes.index(attribute_name)
#assume value is in the adjacent column where attribute is stored
values = worksheet.col_values(attribute_column+1)
value = values[attribute_index]
return value
else:
return None

get_value(ws,0,'First Name')

Step 3. Rinse & Repeat

The previous steps are all in preparation for the grand finale. We started off small by scraping data in one file, now we need to repeat the process for all files. To iterate through the files, we’ll wrap our code in a for loop.

The os.walk() function will help us traverse through all directories and files. Before we get into reading and extracting data with xlrd, we will also initialize a dictionary that we’ll use to store the scraped data. The dictionary will be comprised of our target attributes (attribute names we want to scrape) and an empty list (where we will append the corresponding output values).

The empty dictionary will look something like this:

{‘First Name’: [], ‘Last Name’: [], ‘Sex’: []}

At the end, our dictionary should look something like this:

{‘First Name’: [John, Grace], ‘Last Name’: [Smith, Hopper], ‘Sex’: [Male, Female]}

For loop function below:

for root, dirs, files in os.walk('.'):
attributes = ['First Name', 'Last Name', 'Sex','City','State']
#initialized dictionary, create empty list for attributes with dict comprehension
data = {attribute: [] for attribute in attributes}
#append a key:value for File, will use this as unique identifier/index
data.update({"File": []})
for file in files:
wb = xlrd.open_workbook(file)
ws = wb.sheet_by_index(0)
data['File'].append(file)
for attribute in attributes:
data[attribute].append(get_value(ws,0,attribute))
data

Once we have the dictionary with our list of values per file, we can use pandas to convert it to a data frame and save it back to excel.

df = pd.DataFrame.from_dict(data)
df.to_excel(“Scraped_Data.xlsx”,sheet_name=”Sheet1")
Example output in Jupyter Notebook

Voila, we’ve successfully wrangled our data into a structured table!

Closing Thoughts

Data wrangling is not the sexiest part of analytics (we don’t thank our data engineers enough) — but often times it’s a necessary part of the data science process. And if it feels like a herculean task, break it down into smaller, bite-size pieces. For me personally, it’s easier to tackle it piecemeal and put it in logical terms before I even think about the code. When in doubt, start with the “what” then back into the “how.”

--

--