Glory I
3 min readMar 6, 2024

Data analysis Project: Using Python to clean and analyze data I

Welcome back everyone 🌺, and to those who are new to my page, um… hello? *quirks an eyebrow*

In this project, we’ll use Python to import an Excel file containing sales data for Scent Haven’s total sales in February 2024. We’ll clean, analyze, and visualize the data. Let’s dive in!

Step 1: Importing our data

I will be using Python in Google Colaboratory and yes I know most people prefer Jupyter Notebook but I prefer Google Colab because it’s easy to use and lets me work with others in real-time. It also makes it simple to save and share my work on Google Drive. To open Google Colab:

  • Open Google Drive
  • Click “New”
  • Where you see Google Docs, Sheets click Google Colab if it’s not there select more, and if you don’t see it select “Connect More Apps” and it will take you to Google’s Workspace Marketplace. Enter Google Colab in the search bar and install it.

To import files in Google Colab they should be saved in your Google Drive and mounted into Google Colab. This video explains how to do it and it’s way better to understand than me explaining it here. The code block below will import our libraries and display our data frame.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel(r"/content/drive/MyDrive/dataset/ScentHaven.xlsx")
df

Note: I chose these libraries because they were convenient for working on this project. But you are more than welcome to substitute Matplotlib with Seaborn, Plotly, Bokeh, or any other visual framework of your choice.

You notice that in our data frame, we can’t see the full dataset to show all data we’ll run the code block below and will display the entire 124 rows of data.

pd.set_option('display.max.rows', 124)
df

Step 2: Cleaning the data frame

The next step would be to clean our data frame because we can see there are null values, wrong spelling, and other errors and this could affect our analysis, let’s start with the null values. This code will display each column and the sum of null values for each of them. You can go ahead and add .sum() it will show the total of all null values in the data frame.

df.isnull().sum()

Now we know how many we have, let’s get rid of them. The first line would remove rows where any value is null and the latter would remove rows that have no data in them at all.

df.dropna(how = 'any', inplace= True)
df.dropna(how = 'all')
df

As you may have noticed, you see things like (/…) within the values to remove them, we would use the strip function as you can see below. The strip function can also help to remove any extra space hence the first line.

df["Product"].str.strip()
df["Product"]= df["Product"].str.strip("/…, ")
df

Since our data frame has been cleaned, some rows of data have been removed. I reset the index to ensure that the index reflects the new arrangement of the data and to maintain consistency in our analysis.

df = df.reset_index(drop=True)
df

I’ll wrap it up for now in the next part, we’ll begin our analysis by answering questions, visualize, and state our insights.

You can find the code and dataset on my Github page. Follow me there for updates on my upcoming projects, or stay tuned here as I’ll be sharing everything I work on. Feel free to give alternatives to my code so I can improve.

Thank you for reading! I welcome your questions, suggestions, and insights. Feel free to share your thoughts in the comment section — I look forward to engaging with you!