Sales Analysis of an Electronic Store in United States

5 min readJan 7, 2022

Beginner Data Science Project

By Smit Shah

Data Science can be a pretty intimidating field, especially for freshers and beginners who are looking forward to enter this abyss of greatness. Just having the theoritical knowledge is not enough. One must apply the knowledge to projects inorder to thoroughly understand the ups and down which eventually arise while working on an actual real world scenario.

This Project is a starter for anyone who wants to gain an insight in the data science world and wants to get started with building projects.

This project contains basic analysis of the dataset from an online electronic store in US and aims to answer some basic questions that may arise for the store manager/owner giving a much better insight about the store and how to increase the productivity.
Let’s get started..

The dataset for this project can be found on the github link posted at the end of the article along with the full source code for reference. Any doubts are welcome :)

Importing the Libraries

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import calendar
from itertools import combinations
from collections import Counter

Reading the Data into a Pandas Dataframe

all_data = pd.read_csv('all_data.csv')
all_data.head()

The data cleaning process is not shown here. You can refer to the github link to dive deep into the data cleaning & preparation part.

Adding Additional Columns for better understanding the Data

all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
all_data.head()

On the above part, we have extracted the month from the order date inorder to answer the questions that are related to month. We have also added a column called “Sales” which is basically the total value of that particular order placed.

def get_city(address):
    return address.split(',')[1]

def get_state(address):
    return address.split(',')[2].split(' ')[1]

all_data['City'] = all_data['Purchase Address'].apply(lambda x: get_city(x) + ' (' + get_state(x) + ')')

Lastly, we will extract the city and state from the address of each order using a customized function and making a split on the address. We will make use of the pandas .apply function that enables to apply a function on the dataset by passing the values and getting the processed values.

Now, it’s time to answer some questions…

1. What was the best month for Sales? How much was earned that month?

results = all_data.groupby('Month').sum()
months = []
for month in range(1,13):
    months.append(calendar.month_abbr[month])

plt.title('Sales by Months')
plt.bar(months, results['Sales'])
plt.xticks(months)
plt.ylabel('Sales in USD ($ raised to e+06)')
plt.xlabel('Months')
plt.show()

Answer: The best month was December with sales nearly upto 4.6 million

2. Which City had the highest number of sales?


results_city = all_data.groupby('City').sum()
cities = [city for city, df in all_data.groupby('City')]plt.title('Sales by Cities')
plt.bar(cities, results_city['Sales'])
plt.xticks(cities, rotation='vertical', size=8)
plt.ylabel('Sales in USD ($)')
plt.xlabel('City Names')
plt.show()

Answer: San Francisco had the highest number of sales out of all other cities

3. What time should we display advertisements to maximize likelihood of customer’s buying the products?

all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()

hours = [hour for hour, df in all_data.groupby('Hour')]

plt.plot(hours, all_data.groupby(['Hour']).count())
plt.xticks(hours)
plt.xlabel('Time (24 hr format)')
plt.ylabel('Number of Orders')
plt.grid()
plt.show()

Answer: We can display advertisements at around between 11am to 12pm and between 6pm to 8pm.

4. What products are most often sold together?

df = all_data[all_data['Order ID'].duplicated(keep=False)]

df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))

df = df[['Order ID', 'Grouped']].drop_duplicates()
count = Counter()
for row in df['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 2)))
    
for key, value in count.most_common(10):
    print(key, value)

Note: Counter() is used for better iterations and to count in a sophisticated manner with quick results.

Answer: The 2 most common products sold together are iPhone and Lightning Charging Cable

5. What product is sold the most? Why do you think it is sold the most?

product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']

products = [product for product, df in product_group]
prices = all_data.groupby('Product').mean()['Price Each']

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.bar(products, quantity_ordered, color = 'g')
ax2.plot(products, prices, 'b-')

ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quantity Ordered', color='g')
ax2.set_ylabel('Price ($)', color='b')
ax1.set_xticklabels(products, rotation='vertical', size=8)

plt.show()

Answer: The AAA Batteries are the most sold product followed by AA Batteries with USB-C and Lightning Charging Cables coming close. Since these items are more cheaper than many other items there is a possibility of this being the reason for most selling of these items.

As you can see, we have tried to answer the most common questions that a potential electronic store could have inorder to increase their revenue and sales. Thus, we have successfully applied our Analytical Thinking into tackling a real world use case scenario.

If you liked the project, you can head over to my github profile to find more interesting projects that will be posted in the coming future. The code to repository is here: Link