Taming a Herd of Spreadsheets

Apascale
Nerd For Tech
Published in
3 min readApr 19, 2021

An introduction to productionalized ETL (Extract, Transform, Load) processes using the OS library.

Tired of copying and pasting data from multiple different spreadsheets into one Excel doc? Wish there was a way to automate this process in bulk? Fear not! A simple combination of pandas DataFrames and the OS library can save HOURS of your time, not to mention reduce human error as a result of manual data entry.

The Python OS Module

Python’s OS library allows developers to interact with their operating system in many of the same ways as using a command line. It provides many useful functions to create and remove folders, retrieve their contents, and navigate a system’s file directory. For the purposes of this example, we’ll be using the following commands:

import pandas as pd
import numpy as np
import os
# Generate a list of the files in a folder
os.listdir('FOLDER_NAME')
# Read in an Excel sheet to a pandas dataframe
pd.read_excel('EXCEL_FILE', sheet_name = 'SHEET_NAME')
# Merge pandas dataframes
df1.merge(df2, how='JOIN', on='COLUMN_NAME')
# Create a new folder
os.mkdir('FOLDER_NAME')
# Write dataframe to csv
merged_df.to_csv('FILE_NAME.csv')

Before we get started, it’s important to note that these commands may occasionally throw an OSError or FileNotFoundError, so you might want to read up on your Try/Except statements to be on the safe side!

Sample Retail Dataset

For this walkthrough, I’ll be using a few spreadsheets from a sample retail dataset on Kaggle. First, read in all the files in the current directory to a list.

files = os.listdir('.')

Then, import each excel file to its own dataframe. If no sheetname is specified, pandas will import the first sheet. You can reference sheets by index (starting with 0), or by the name of the sheet (‘Sheet 1’).

stores = pd.read_excel('stores.xlsx', sheet_name=0)
sales = pd.read_excel('sales.xlsx', sheet_name=0)

Next, we’re going to merge the dataframes on the ‘Store’ column. Similar to SQL, you have to specify which type of join (left, right, outer, inner) and the column name on which to join.

df_merge = sales.merge(stores, how='left', on='Store')

Now we’re going to parse our merged data by store, and save a new spreadsheet for each individual store. First, let’s define a function to return one dataframe per store number passed as a parameter.

def pull_store(df, store_no):
return df[df['Store']==store_no]

Let’s also create an “Output” folder to save all these new files.

os.mkdir('output')

Now, we just need to loop through the Store numbers, saving a new .csv file to the output folder for each grouping.

for store in df_merge['Store'].unique():
store_df = pull_store(df_merge, store)
store_df.to_csv('output/store_' + str(store) + '.csv')

That’s all! This loop should have generated 45 new .csv files in the “Output” folder you just created. Nice work!

These scripts can get really complicated, especially when you have multiple sheets per Excel file, or need to map file names to standard naming conventions.

Here’s an example of a full ETL script for grocery store sales and their corresponding PLU codes across 10 cities:

--

--