How to extract information from your excel sheet using Python

Oluwabukunmi Ige
Analytics Vidhya
Published in
5 min readAug 31, 2019
Image by Andrian Valeanu from Pixabay

One of the fascinating things about programming is that with a few lines of code you can get your computer to carry out a task that would have been otherwise mundane and annoying for you to do on your own. One of those mundane tasks is extracting information from a large excel sheet. The Python programming language is very robust, and one of the areas where it shines is helping us automate tedious and repetitive Excel tasks.

In this blog post, we will be embarking on a step by step process to extract some valuable information from an excel sheet. The excel sheet we will be using is the sheet that contains all the fruit sales of a supermarket for a month. Each row contains individual records of fruits purchased by customers. There are three columns in the excel sheet. Column A gives the cost per pound of the purchased fruit, Column B gives the Pounds sold, and Column C gives us the total cost of the purchase. The excel sheet has 23758 rows and four columns. You can download the excel sheet here.

Our goal is to find out and document the total pounds sold total sales and the entire purchase instances for each fruit in that month. You can imagine the frustration of having to go through 23758 rows to extract information about each fruit, well we are in luck as Python would help us to complete this task in no time. The steps below will give an in-depth and practical explanation of how you can use Python to complete this task.

Before we get on to this task, I want to assume that you have a basic knowledge of writing code in Python and that you have the Python installed on your computer.

Install the openpyxl Module
The python module we will be working with is the OpenPyXL Module. The OpenPyXL Module is a library that allows you to use Python to read and write excel files or files with the .xlsx/xlsm/xltx/xltm extension. If you don’t have it installed on your IDE, you can install it using

pip install openpyxl

To test if you successfully installed it, import it using

Import openpyxl

So far no error is returned you have installed the OpenPyXL Module and are ready to work on some excel documents.

Read in and open the Excel document with OpenPyXL
The next port of call is to read in the excel sheet into our Python environment. Make sure the Excel you will be working with is in your current working directory(CWD). You can access your CWD using:

import os 
os.getcwd()
os.chdir() #this changes our CWD, if the excel sheet is not in CWD

What if the excel sheet is not in your CWD? You can copy the file and paste it into your CWD, so you access it from there. Once we are sure we have our Excel document in our CWD, we can now read it in.
After reading in the Excel document, we can now access it to obtain various information about the excel sheet.

import pandas as pd 
file = ‘produceSales.xlsx’
data = pd.ExcelFile(file)
print(data.sheet_names) #this returns the all the sheets in the excel file
[‘Sheet1’]

Next, we parse the sheet we will be working with into a data frame, this will enable us to know if our excel sheet was correctly read in.

df = data.parse(‘Sheet1’)
df.info
df.head(10)
This image shows the first ten rows of our sheet.

Read in the spreadsheet data
The next step is to read in data from the spreadsheet [Sheet1].

ps = openpyxl.load_workbook(‘produceSales.xlsx’)sheet = ps[‘Sheet1’]sheet.max_row 
#returns the total number of rows in the sheet
23758

Next, we use a For loop to iterate over all the rows in the sheet.

for row in range(2, sheet.max_row + 1):# each row in the spreadsheet represents information for a particular purchase.produce = sheet[‘B’ + str(row)].valuecost_per_pound = sheet[‘C’ + str(row)].valuepounds_sold = sheet[‘D’ + str(row)].valuetotal_sales = sheet[‘E’ + str(row)].value# the first column is B followed by C and so on.# Each value in a cell is represented by a column letter and a row number. So #the first element in the sheet is B1, next column C1 and so on. This enables #to iterate over the entire cells.

Create an empty dictionary that contains all the information on each fruit. We then use the set.default() method to fill the first set of elements into the dictionary. set.default() first argument checks if the key exists in the dictionary, if it doesn’t it replaces it with the second argument. That way, we can start populating our dictionary with the second argument of the set.default function.

TotalInfo.setdefault(produce,{‘Total_cost_per_pound’: 0,
‘Total_pounds_sold’: 0,
‘Total_sales’: 0,
‘Total_Purchase_Instances’: 0})
# so with this set default method, we have set all metrics we want to collect to zero. When we are iterating, we start from Zero and add new iterated to the dictionary. The key of the dictionary is the fruit which is mapped to their various metrics.

Finally, we populate the dictionary. For each new produce seen in a new row, we increase the metric by its corresponding value in the new row.

# Each row represents a fruit, so increment by the new corresponding values. 

TotalInfo[produce][‘Total_cost_per_pound’] += float(cost_per_pound)
TotalInfo[produce][‘Total_pounds_sold’] += int(pounds_sold)

TotalInfo[produce][‘Total_sales’] += int(total_sales)
# Each row represents a fruit, so increment by one.

TotalInfo[produce][‘Total_Purchase_Instances’] += 1

After running this code block, we would have populated the TotalInfo dictionary with all the various metrics for each fruit for the month. The populated dictionary looks like this:

‘Apples’: {‘Total_Purchase_Instances’: 627,
‘Total_cost_per_pound’: 1178.7600000000068,
‘Total_pounds_sold’: 12119,
‘Total_sales’: 22999},

Write the Results to a File
After populating the TotalInfo dictionary. We can write this populated dictionary to any file of our choice be it a .csv, .txt, .py et al. We will be using the pprint.pformat module to pretty print our dictionary’s values and we use python’s write mode to write the dictionary’s values to the file. The code snippet below gives an illustration:

resultFile = open(‘Total_info.txt’, ‘w’)
resultFile.write(pprint.pformat(TotalInfo))
resultFile.close()
print(‘Done.’)

The Total_info.txt file will be found in your CWD.
You can always change the file format by changing the .txt extension to whatever file format you want.
The code snippet below shows how you can change to a .csv file format.

 Open (‘Total_info.csv’, ‘w’)

Conclusion
In this blog post, we demonstrated how we could use Python to extract information from an excel sheet. Knowing how to obtain information from an excel sheet is always a welcome addition to your toolbox as it saves you a lot of time from carrying out repetitive tasks. Feel free to modify the code in the article to suit your needs; you can access the notebook that contains the end to end code that was used in this blog post here.

Happy Pythoning.

--

--