Product Sales Analysis Using Python
OVERVIEW
In this post, I use Python Pandas & Python Matplotlib to analyze and answer business questions about 12 months worth of sales data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc. The dataset can be downloaded here. In this analysis, I’m using jupyter notebook.
PROBLEMS
- What was the best month for sales? How much was earned that month?
- What city sold the most product?
- What time should we display advertisements to maximize likelihood of customer’s buying products?
- What Products are most often sold together?
- What product sold the most? Why do you think it did?
SOLUTION
- What was the best month for sales? How much was earned that month?
First of all, we need to see what kind of data we are trying to analyze. we can simply do this
We use panda to read the csv file and create a dataframe from it. The file’s directory can be put anywhere. I personally put it in D. You can copy the directory and paste it in the syntax. At Figure 1, we can see that we have 6 columns. Now the first task is to merge all 12 months worth of sales data (12 csv files) into a single csv file. To do that, we need to import new library called os.
We need os library to read all csv files’ title and call it using for loop. As you can see from Figure 2, we successfully read all the csv files’ title and we’re ready to merge it. To do that, we can simply do
— — — — — — — — — — — DETAIL CODE — — — — — — — — — — —
import pandas as pd
import os
df = pd.read_csv(“D:\Self\Online Course\Solve real world Data science task\Pandas-Data-Science-Tasks-master\Pandas-Data-Science-Tasks-master\SalesAnalysis\Sales_Data\Sales_April_2019.csv”)
files = [file for file in os.listdir(“D:\Self\Online Course\Solve real world Data science task\Pandas-Data-Science-Tasks-master\Pandas-Data-Science-Tasks-master\SalesAnalysis\Sales_Data”)]
all_months_data = pd.DataFrame() #Creating empty dataframe called ‘all_month_data’
for file in files:
df = pd.read_csv(“D:\Self\Online Course\Solve real world Data science task\Pandas-Data-Science-Tasks-master\Pandas-Data-Science-Tasks-master\SalesAnalysis\Sales_Data/”+file)
all_months_data = pd.concat([all_months_data, df]) #Merging to the previous empty dataframe
#Checking the result
all_months_data.to_csv(“all_data.csv”, index=False) #single csv file contain 12 months data.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — —
It will take a little longer time because of heavy computation. But once it’s done, you can open the same directory folder and check the folder called “Output”. You will see a new csv file contains all 12 months data.
After we create this new csv file, you can delete the previous code (if you want) and we will use this file to answer all of the problems.
Now we only use this code to read all of 12 months data.
Now we’re ready to answer the problem number 1. To remind you, the question is: What was the best month for sales? How much was earned that month?
To answer this problem, obviously we need an additional column called “Month”. If you look carefully at Figure 5, you will see the first 2 characters in “Order Date” values represent months. So the next task we will do is to add “Month” Column.
Now, we get an issue here. There are NaN values in our data. You could spot on of NaN value at Figure 1 or Figure 5 in index 1. Now we need to clean up the data by dropping rows of NaN. Let’s spot more NaN value here. You don’t have to do this, I am just curious.
We use .isna().any(axis=1) to spot rows containing the NaN values (axis = 0 to spot column containing NaN values). Now, we’re gonna remove it from our dataframe using .dropna() method.
.dropna() method is successful, but we get a new issue here. There are values “Or” in our data. Let’s find it first.
We can see clearly from Figure 9 that the issue is the rows contain the same words as the title rows. So clearly ‘Or’ is coming from ’Order Date’. We need to drop this “Or” rows just simply change the equal sign (“==”) to not equal sign (“!=”).
We can see clearly from Figure 10 that we successfully created “Month” column and make its data type to integer.
Now, are we ready to answer the question? Not yet, we need obviously one more column called “Sales” Column. How can we get that? We get “Sales” by multiplying “Quantity Ordered” and “Price Each” values. Let’s create it.
Now we encounter a new issue. The values of the column “Quantity Ordered” and “Price Each” are strings. So the next task is to convert these columns to the correct type (“Quantity Ordered” is integer and “Price Each” is float). We’re gonna use pd.to_numeric() method to convert them to numeric.
Now the “Sales” Column is successfully created, we can answer the first question. What was the best month for sales? How much was earned that month? We can easily answer it by using groupby(‘Month’).sum() method.
Look carefully at Figure 13. We can clearly see that month 12 (December) is the highest sales in 2019 with approximately $4,810,000. But we need to visualize it to make our bussiness partner easier to understand. So we’re gonna import matplotlib and visualizing our results with bar chart.
It’s good. But we need to make it looks neater. So we’re just gonna add a little code.
Now, not only we can get the highest sales, but we can also get the lowest sales just by looking it for a few seconds. As a data scientist, we have to figure out why a certain month is better than others. Maybe the company spend more money on April so the product sales are increasing. Maybe the best product sales are on December because it’s holiday and Christmas. Those are just my hyphothesis, right now we don’t have enough data to prove that hypothesis. But we can take these as a consideration if you want to decide something that relates to product sales.
2. What city sold the most product?
To answer this question, obviously we need to create a new column called “City” column. How do we get that? As usual, we’re gonna check the top 5 data in our dataframe to figure out where can we get our “City” column using .head() method.
As you can see at Figure 16, the “Purchase Address” Column contain the city. We can’t get it directly, we need to extract the data. We can use one of most useful function in pandas, .apply() method.
To make it neater, we can use this
apply and lambda usually used to create new column based on other column (example .apply(lambda x: x*2. It means every input x in other column will be changed to x*2 in a new column). In this case we create “City” column based on “Purchase Address” column and we split the data into 3 part. The first one is before the first comma (index = 0), the second one is between the commas (index = 1), and the third on is after the last comma (index = 2). As we need to extract the city data, we use [1] to state it to index 1.
As you can see at Figure 17, we successfully created a “City” column. So are we ready to answer the second question? Not yet. We get an issue here. It’s not error, it’s the value of the “City” Column. This is just a rare case when there are 2 cities are named exactly the same. Example someone in New England and someone in West Coast would think Portland in different way. Someone in New England thinks Portland as Portland Maine and someone in West Coast thinks Portland as Portland Oregon. So in our dataset we actually had the overlapping cities between these two. So, we should also grab the state.
The function get_state() basically works as explained before. But in this function, we seperate the data again become three parts. The first one is before the whitespace (index = 0), the second one is between the whitespaces (index = 1), and the third one is after the last whitespace (index = 2). So that’s why we use .split(‘ ‘)[1] in the second split.
Now we’re ready to answer the second question, what city sold the most product? As we did before, we’re gonna group it by the city and summing all the values based on the group.
It’s too messy, but if you look carefully you can see that San Fransisco is the highest sold product of all cities with approximately $8,200,000. We clearly need to visualize it because it’s so hard to conclude anything just based on that numbers and also it will make our bussiness partner easier to understand.
Now we successfully plot it. But there is a big issue here. If you notice that the values (Figure 20) and the plot (Figure 21) are not synchronized. The highest sales should be San Fransisco. What’s wrong with our code?
There’s an issue between .unique() method and plt.bar(). Their cities order are different. we’re gonna syncronized the order by simply fixing the variable ‘cities’.
Now, we fix the issue and successfully plot it. As a data scientist, we need to figure out why San Fransisco is the highest sale compare to other cities. Maybe Sillicon Valley need more electronic products. Maybe the advertisement is better in San Fransisco. We can use this data to improve the sales of bussiness.
3. What Time Should We Display Advertisements to Maximize Likelihood of Customer’s Buying Product?
As usual, to remind what our data look like, we use the .head() method to show the top 5 of our updated dataframe.
If we’re gonna use our data to answer this question, we need to aggregate the period in 24 hours distribution. Look carefully at Figure 23. In “Order Date” column, there are times data. We could extract it like we did before. But to make it more consistent, we need to convert the “Order Date” Column into date time object. We’re gonna use pd.to_datetime() method.
It will take a little bit longer because of the heavy calculation. Now we can create a new column called “Hour” contain the extraction of “Order_Date_DTO” data. We only need the hours data, so we can extract them by doing this.
Now we can answer the third question, what time should we display advertisements to maximize likelihood of customer’s buying product? To answer this, we’re gonna group it by the hours and counting all of the orders.
If we want to answer the third question, we only need the “Quantity Ordered” column. Now let’s visualize it. We want it to be the line chart because this spesific data (hours) are more logical to show using line chart than bar chart because the data has to be continue.
As you can see from Figure 27, there are approximately 2 peaks at the data. They are 12 (12 PM) and 19 (7 PM). It makes sense since most people shopping during the day. From this data, we can suggest to our bussiness partner to advertise their product right before 12 PM and/or 7 PM. It could be 11.30 AM and/or 6.30 PM.
Remember, this chart is the total orders of all cities. Maybe you could make a spesific chart for a spesific city and planning the advertisement better for that city.
4. What Products Are Most Often Sold Together?
We’re gonna take a look to our top 5 data as usual.
Look carefully at Figure 28. We can see that “Order ID” indicate the transaction. So by grouping the product by the Order ID, we are able to know what products are often sold together. We’re gonna use .duplicated() method to find a duplicate values of “Order ID”.
Now we want to create a column called “Product Bundle” that contain example Google Phone and Wired Headphone (transaction 17650) at the same line. We’re gonna use the .transform() method to join values from two rows into a single row.
It’s good, but we have an issue here. We have the same order at least twice because we did merge them in every situation in groupby without dropping the duplicate values. Now let’s drop the rows with duplicate values.
Now, we need to count the pair of products. We need new libraries because they have all we need to count all the combination of products bundle. We’re gonna use itertools and collections libraries.
It’s too messy, let’s just showing the top 10 data
Now we can clearly see that the most often products sold together are iPhone and Lightning Charging Cable with 1005 transactions. We could count the 3 product bundles by just changing the count.update index into 3.
We can see the most often sold products (3 products) together are Google Phone, USB-C Charging Cable, and Wired Headphones with 87 transactions. It’s not really significant compare to the 2-Product Bundle. So we’re gonna ignore the 3-Product bundle.
What would we do with this data? Well, we could offer a smart deal to the customer that buy iPhone, you could recommend the charging cable with discount. That’s one of the possibility and you can bundle the remaining products if you need to.
5. What product sold the most? Why do you think it did?
As usual, let’s see our data looks like again.
We need to sum up the “Quantity Ordered” based on grouping the Product. So let’s do it.
To make it easier to understand, let’s visualize it.
Now we can see what product sold the most, it’s AAA Batteries(4 pack). We can also see that AA Batteries (4 pack), Lightning Charging Cable, USB-C Charging Cable, and Wired Headphones are sold more than other products. Why are they sold the most? The first impression is that they are cheaper than other products. As a data scientist, let’s prove this hypothesis. We could do it by overlaying the graph by their actual price and see if they have direct correlation.
Now we will interprate our results. Our hypothesis is true if the high sold products have low price. From the graph we can see it is the case for AAA Batteries and all products except the Macbook Pro Laptop and ThinkPad Laptop. They have decent orders eventhough they are expensive. We can say that there are many people in the world need laptops. So the laptops are the exception because the laptops have high demand.
CONCLUSION
- What was the best month for sales? How much was earned that month?
The best month for sales is December. The company earned approximately $4,810,000.
2. What city sold the most product?
San Fransisco is the city with the highest sales.
3. What time should we display advertisements to maximize likelihood of customer’s buying products?
We can suggest to advertise the products right before 12 PM and/or 7 PM. It could be 11.30 AM and/or 6.30 PM.
4. What Products are most often sold together?
The most often products sold together are iPhone and Lightning Charging Cable with 1005 transactions.
5. What product sold the most? Why do you think it did?
AAA Batteries(4 pack) is the most sold product. Because it’s cheaper than other products and has high demand.