Superstore Data Analysis Project — Beginner Level

Suresh Choudhary
7 min readFeb 18, 2024

--

Superstore Data Analysis project delves into sales patterns, product performance, and customer behavior, providing insights that inform strategic decisions for the continued success and refinement of the Superstore.

Source code — Github

Table of Contents:

· Introduction
·
Project Objectives
·
Requirements
·
1. Data Exploration and Cleaning
Data Collection
Data Exploration
Data Cleaning
· 2. Exploratory Data Analysis
a. Count of sub-category
b. Best Performing Category
c. Which customer segment is the most profitable?
d. Which is the preferred Ship Mode?
e. Customer Regional Analysis
· Conclusion
·
Summary
·
References

Introduction

The Superstore Data Analysis Project focuses on extracting insights from a retail superstore dataset. The dataset contains information about sales, customers, products, and orders from a fictional retail superstore.

By analyzing this data, we aim to uncover trends, patterns, and insights that can be used to improve business operations and decision-making.

Project Objectives

Key aspects of this project includes:

  1. Data Exploration: We will begin by exploring the dataset to understand its structure and contents.
  2. Data Cleaning: We will clean and preprocess the data to ensure its quality and suitability for analysis.
  3. Data Analysis: We will perform in-depth analysis on various aspects of the data, such as sales trends, customer behavior, and product performance.
  4. Data Visualization: We will create visualizations to illustrate our findings and make it easier to interpret the data.
  5. Insights and Recommendations: Based on our analysis, we will derive insights and offer recommendations for improving business strategies.

Requirements

  • Python 3.7 or higher
  • Jupyter Notebook / Google Colab
  • Pandas
  • Matplotlib
  • Seaborn
  • NumPy

Fun Fact: Did you know that the Superstore dataset contains more rows than a flock of migrating birds? Brace yourself for a data journey that’s more exciting than a birdwatching adventure!”

1. Data Exploration and Cleaning

We first need to collect and import the dataset. Here’s the source link for the dataset that I have used for this project — superstore.csv

Once you have the dataset, the next step is to import it into our Python environment using the Pandas library. Additionally, include other libraries such as NumPy, Matplotlib, and Seaborn.

In this project, I have used Jupyter Notebook; however, you can also choose to use Google Colab or Visual Studio Code. Make sure to download the dataset from the provided source link before proceeding with the import and analysis.

Data Collection

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "superstore.csv"
superstore_data = pd.read_csv(file_path)

# Display the first row of the dataset to get a glimpse
print(superstore_data.head(1))

Data Exploration

Let’s start by understanding the structure and contents of the dataset.

The info() function provides an overview of the dataset, including column names and data types, while describe() gives statistical information about numerical columns.

# Explore the structure of the dataset
print(superstore_data.info())

# Display basic statistics of numerical columns
print(superstore_data.describe())

Data Cleaning

Now, let’s clean and preprocess the data to ensure its quality and suitability for analysis:

df.drop_duplicates(keep = 'first',ignore_index = True,inplace=True)
df.duplicated(keep = False).unique()
df.duplicated().sum()

This line of code first drops duplicate rows from the DataFrame df, keeping only the first occurrence of each duplicate (keep='first'). It then reindexes the DataFrame (ignore_index=True) and modifies it in place (inplace=True). Finally, it calculates and returns the total number of remaining duplicated rows in the DataFrame using duplicated().sum().

We can further perform other data cleaning operations like converting datatype of each column into appropriate format. For example,

>>> df['OrderDate'].dtype #dtype('O')

Here the original datatype of OrderDate column is ‘object’ type. Hence, we need to convert it into ‘datetime’ format.

>>> df['OrderDate'] = pd.to_datetime(df['OrderDate'])

df['OrderDate'].dtype #dtype('<M8[ns]')

2. Exploratory Data Analysis

Great progress, Now the data is ready for further analysis! Exploratory Data Analysis is where the magic happens. It’s kind of like data adventure — where each chart and plot is a clue.

In this section, we systematically examine various aspects, from product sub-categories to financial dynamics such as sales and profit.

a. Count of sub-category

Let us find the distribution of products across different sub-categories. The countplot() function is utilized to generate a bar chart, where each unique sub-category is represented on the x-axis, and the corresponding count is depicted on the y-axis.

sns.countplot(x='SubCategory', data = df)
sns.set(rc={'figure.figsize':(12,5)})
sns.barplot(data=temp,x=temp.index,y=col,palette= 'Blues')
plt.xticks(rotation=60)
plt.title('Count of SubCategory')
plt.show()

b. Best Performing Category

To know the best performing category we can use countplot() to visualize the distribution of items across various categories in the dataset.

Each unique category is represented on the x-axis, and the corresponding y-axis indicates the count of items within each category. The plot provides a snapshot of how products are distributed among different categories.

sns.countplot(x='Category', data=df)
plt.title('Count of Category')
plt.show()

c. Which customer segment is the most profitable?

Now let’s calculate the average profit for each customer segment by utilizing the Pandas library. It employs the groupby() function to group the DataFrame dfby the Segmentcolumn and then calculates the mean profit for each group.

df2 = pd.DataFrame(df.groupby(['Segment'])[['Profit']].mean())
df2

We can now visually represent the average profitability of different customer segments. The x-axis displays customer segments (Consumer, Corporate, Home Office), while the y-axis showcases the average profit.

The resulting plot provides a quick and insightful comparison of profitability among various customer segments.

sns.set_theme(style="whitegrid")
sns.barplot(data = df2, x = df2.index, y = df2.Profit, palette = "Blues")
plt.title("Customer Segment Profitability")
plt.show()

d. Which is the preferred Ship Mode?

Now we investigate the performance of various shipping modes utilized by the Superstore. By focusing on discounted sales and profit, we aim to discern the effectiveness of different shipping methods in contributing to the overall financial dynamics.

# Taking a subset of data (To keep things simple)
df_stackb = df[['ShipMode','DiscountedSales','Profit']]

# Grouping by ship mode and aggregating sales and profit
df_stackb = df_stackb.groupby(['ShipMode']).sum().reset_index()

# Visualizing the Stacked BarChart
plt.figure(figsize=[10,15])
plt.subplot(3,1,1)
plt.bar(x=df_stackb['ShipMode'], height=df_stackb['DiscountedSales'], color='skyblue')
plt.bar(x=df_stackb['ShipMode'], height=df_stackb['Profit'], bottom=df_stackb['DiscountedSales'], color='green')
plt.title("Sales & Profit Across Ship Modes", size=20, pad=20)
plt.legend(['Sales','Profit'])

The resulting data is visualized using a Stacked Bar Chart, where each bar represents a different ship mode. The blue section of each bar corresponds to discounted sales, and the green section represents the profit.

e. Customer Regional Analysis

This segment evaluates the profitability generated across different regions — Central, East, South, and West.

region_analysis = pd.DataFrame(df.groupby(['Region'])['Profit'].sum().reset_index())
region_analysis

We can visualize this on a pie chart using pie() function which represents the distribution of profits across different regions. Each slice of the pie corresponds to a specific region, with the size of the slice determined by the total profit generated in that region.

plt.pie(region_analysis.Profit, 
labels = region_analysis.Region,
startangle = 90, autopct = "%1.0f%%",
explode = [0, 0, 0, 0.1])

plt.title("Most Profitable by Region")
plt.show()

Conclusion

To wrap things up, our Superstore Data Analysis journey has given us a closer look at how the business is doing. We started by digging into the data, making sure it’s clean and ready for analysis. Then, we explored different aspects like what people buy, how much they spend, and what regions contribute the most.

We found out which categories and products are the stars, learned about the different types of customers and how they impact profits, and even looked into how people prefer their items to be shipped. And that’s a wrap for our data adventure at the Superstore!

Summary

1. Data Exploration and Cleaning:
— Explored dataset structure and contents.
— Cleaned and preprocessed data for analysis.

2. Exploratory Data Analysis (EDA):
— Investigated sub-categories, sales, and profit patterns.
— Identified best-performing categories and products.
— Explored customer segment profitability and shipping preferences.
— Analyzed profitability across geographical regions.

References

Source code — Github

--

--

Suresh Choudhary

Crafting Code Stories | Join me, a Computer Science enthusiast, as I weave tales of Python, Data Science, and Web Technologies.