Power of Python Pandas

Rohan Raj

Published in

Analytics Vidhya

6 min readJul 16, 2021

Advanced-Level analysis of US Store Sales data

Dataset

The data used for this project can be accessed from here. The dataset contains the Retail data of a global superstore for 4 years.

Different attributes that are used in data can easily be understood from their respective column headers.

Before starting with the analysis, some cleaning has to be performed on data to remove null entries and to convert the attributes into proper data types.

Data Cleaning

Output:

Let’s begin the Analysis!!

1. Top 5 States with Highest Sales
2. Top 3 Sub-categories for each Category in terms of Sales value

##Top 5 States
data.groupby('State').agg({'Sales' : sum}).sort_values(['Sales'], ascending = False).head(5)##Top 3 Sub-Categories
cat_sub = data.groupby(['Category', 'Sub-Category']).agg({'Sales':sum})
cat_sub.sort_values(['Sales'], ascending = False).groupby('Category').head(3).sort_index()

These are the top-3 sub-categories from each category of products that generated the maximum revenue for the superstore.

3. Month and City with the highest number of “Same-Day” deliveries

Output: Here, (538–514) = 24 Same-day deliveries have been shipped 1 day after the Ordered date. Hence, to calculate the month-wise same day deliveries, we will use the data with 0-day(min) shipment which is the true Same-Day delivery.

The highest number of same-day deliveries occurred in the month of September and in New York City.

4. Which products are most often sold together?

Here, the idea is to group all the products on the same Order ID to form a cluster. Then count the number of instances these clusters of products have been ordered together, to get the popular ones.

To understand the counting of these unique pairs in the whole dataset, this link from Stack Overflow is very helpful.

Output: This can be very useful for the marketing the products and products can be sold in a pair of these groups as they are most popular among the audience.

[(('Newell 344',
   'Memorex Micro Travel Drive 16 GB',
   'Wirebound Four 2-3/4 x 5 Forms per Page 400 Sets per Book',
   'Elite 5" Scissors',
   'Staple-based wall hangings'),
  2),
 (('Newell 344',
   'Memorex Micro Travel Drive 16 GB',
   'Wirebound Four 2-3/4 x 5 Forms per Page 400 Sets per Book',
   'Elite 5" Scissors',
   'Eldon Image Series Desk Accessories Burgundy'),
  2),
 (('Newell 344',
   'Memorex Micro Travel Drive 16 GB',
   'Wirebound Four 2-3/4 x 5 Forms per Page 400 Sets per Book',
   'Elite 5" Scissors',
   'DAX Two-Tone Silver Metal Document Frame'),
  2)]

5. Percent-contribution in total sales of each order by their member products

Here, we can try to group by the Order ID and then do a transform sum. This is similar to using Window Functions in SQL and performing aggregate sum.

data['Total Order Sales'] = data.groupby(['Order ID'])['Sales'].transform(sum)#Calculating percent of contribution for each member product
dataset = data[['Order ID', 'Product Name', 'Sales', 'Total Order Sales']]
dataset['Percent of Total'] = (dataset['Sales']/dataset['Total Order Sales'])*100

Output: This is very useful to identify any product in the order that contributed the highest per cent in the total sales of that order.

Per cent Contribution of member products

6. Timely vs Delayed orders Analysis

Taking orders taking more than 5 days to ship as ‘Delayed’ orders and rest as ‘Timely’ orders.

data['is_delayed'] = np.where(data['days_for_shipment'] >= 5, 'True', 'False')plt.figure(figsize=(10,5))
data['is_delayed'].value_counts().plot(kind = 'bar',title = 'Timely vs Delayed',rot = 0)

Creating Pivot-Table to analyse data wrt number of orders

Output: Shows proportion of orders delivered late region-wise & Mode wise

Max percentage of ‘Delayed’ orders has been from the ‘Central’ region, while the minimum percentage of delayed orders occurred in the ‘East’ Region

For ‘First Class’ and ‘Same-Day’ Shipping Mode Deliveries, none of the orders was delayed. This can be one of the reasons why the cost of these types of shipping modes is usually costlier than others since for faster service people often agree to pay more.

7. Customer Level analysis

First, the intention here is to take only the first purchase and the last purchase of each customer.

first_purchase = data.sort_values(by='Order Date').groupby('Customer ID').first()
last_purchase = data.sort_values(by='Order Date').groupby('Customer ID').last()##Merge the tables and fetch only important attributes:
customer = pd.merge(first_purchase, last_purchase, on='Customer ID')[['State_x','Order Date_x','Order Date_y']]
customer.rename({'State_x':'State', 'Order Date_x':'First_Purchase', 'Order Date_y':'Last_Purchase'}, axis=1, inplace=True)

Creating new attributes like Retention_days(b/w first and last purchase), Count_of_orders, Freq_of_orders(in days) and Avg_Revenue that will be very useful to determine the important customers for the business.

Output: If the store owner wants to award its 3 most loyal customers with a discount coupon, he can find them through this method.

These are the 3 most loyal customers of the store according to the given data, who have been purchasing constantly at the store for the last 3yrs(Retention) with an approx ‘avg order freq’ of once every month and also generating decent per-order revenue for the store.

Output:

State-Wise Distribution of Loyal Customers

If the owner wants to expand the Super-store chain further in some states, how can he find those important states?

Considering Loyal customers stay with the store for at least 1 year (365 days) with a frequency of shopping every 3 months (90 days) and generating at least $100 revenue per order for the store, the owner can decide to expand its store's chain in these 5 states, with ‘California’ being the safest bet.

NOTE: The parameters taken in the above analysis to determine customer loyalty were purely based on some intuition. These can be changed depending on how the store owner wants to grade their customers as being ‘Loyal’.

These were only some of the areas from the whole project.

For a detailed understanding of the project with respective output and visualizations, look at the GitHub repository.

If you have feedback, suggestions, or questions about this post, feel free to comment on GitHub or to add to it, or reach out to me on LinkedIn.

Power of Python Pandas

Dataset

Let’s begin the Analysis!!

Written by Rohan Raj