E-commerce Sales Data Analysis and forecasting future sales using Excel

Manali Deb
10 min readMar 14, 2023

--

Image Source:https://imatrix.com/blog/advantages-of-e-commerce/

About Data:

The data set was downloaded from Kaggle. It is real-life sales data from a Women’s clothing E-commerce shop.

Link to the data:

https://www.kaggle.com/datasets/shilongzhuang/-women-clothing-ecommerce-sales-data

The dataset contains sales data from June to September 2022. It includes the following columns:

Order_id: Unique Id which identifies a single order placed.

Order_date: Date and time of the order

SKU: Product number used by the retailer

Color: The color of the product ordered

Size: The size of the product ordered

Unit_price: Price per product

Quantity: Number of Quantities Purchased

Revenue: unit_price*quantity

After analyzing the data we need to answer some questions:

Business Questions

  1. What are the best and worst-selling SKU items? by color? size?
  2. What is the average order value?
  3. What are the peak days or time periods with the highest sales? Do sales follow a trend or a seasonality?
  4. Advanced: Predict sales in the next months.

Note: Order_id is not unique. Multiple products ordered with the same order id but different SKU/sizes/colors are represented as different rows in the dataset. The quantity column always has a value of 1.

Data Cleaning:

Here we will perform some data cleaning to prepare the data for further analysis.

  1. Removing Duplicates: I selected the entire data and then clicked Remove Duplicates in the Data Tab. There was no duplicate value in this case.

2. Checking for BLANK cells: Select the entire dataset and click Find and Select and finally select Go To Special.

Next, select Blanks and click Ok.

All blank cells should be highlighted like in the example below.

In this dataset, there were blank cells only in size. There were 37 blank cells. That’s quite a lot of data and I do not want to lose that much data. Since the data in column Size is categorical, it is best to replace the blank cells with Mode. I calculated the Mode by plotting a bar chart of the Size column. As evident from the below bar chart, the mode is XL.

To replace all blanks with XL, select the entire size column, go to Find and Select, and click replace. Replace all Blanks with XL.

After selecting replace all, we see that 37 blank cells have been replaced.

3. Separate Date and time: To analyze the data separately with date and time, I decided to separate the date and time. We can do this easily by using the Text to Column feature in Excel.

Select the date column and select Data->Text to columns->Select Space (since the date and time are separated by space in the column)

Next, I selected the Date, changed the destination since I did not want to replace the data, and finally clicked Finish.

Columns C, D, and E are the data that have been extracted.

I only want to keep the date format in column C and get rid of the trailing zeros. Hence, I selected column C, right-click ->Format cells, and choose the date format.

Now, I want to combine time in column D with the AM/PM in column E into a new column Order_Time. Here I used the following formula:

I do not need the previous Order Time column hence I deleted column D.

I used a simple IF formula to create another column Morning/Evening to represent AM by Morning and PM by Evening.

4. Spell Check: Next I wanted to spell check my data to maintain uniformity. I used Spelling from the Review pane after selecting the color column.

In this case, we need a space between Skintone. When we click Change All, all Skintone will be replaced by Skin tone

Similarly, I replaced Spelling errors from the size column

Finally, I like to align my data so that it becomes uniform and easier to read. In this case, I used center alignment.

Exploratory Data Analysis

  1. What are the best and worst-selling SKU items? by color? size?

I selected the entire dataset to answer this question and created a Pivot table in a new sheet.

For finding the best and worst selling SKU items, I used SKU in rows and Sum of Quantity as values.

To create a chart, I selected the entire Pivot table and selected the Bar chart from the Insert tab.

Best and Worst Selling SKU items

The best-selling SKU is 799 and the worst-selling SKU is 29

Now we need to find the Best and Worst Selling SKU items by size and color. I decided to use Slicer for this so that we can see the best and worst-selling items for each color and size.

Go to Pivot Table Analyze, select Insert Slicer, and then select color and size.

Since the best and worst-performing SKUs changed by color and size, I made an interactive chart where we can see the performance of SKUs based on size and color.

Which SKU is the best and worst sold by filtered by size and color?

2. What is the average order value?

Since order Ids are repeated. First I created a pivot table that calculates the Sum of revenue per order Id.

From the Pivot table, we can get the Total number of unique orders and Total Revenue from all Orders.

To find the revenue per order, we can divide Total Revenue by the Total unique number of orders.

The average order value is $ 536.70

3. What are the peak days or time periods with the highest sales? Do sales follow a trend or a seasonality?

To find out if sales are higher/lower on any particular day, I extracted the day from Order_date using the TEXT function.

Since column D has a formula, we will not get it as a field in the pivot table. Hence I copied and pasted(Paste Special) the values of column D into another blank column. I named the column Order_day.

To find days with the highest sales, I created a pivot table with Order_day in rows and the sum of Revenue as values. After that, I created a column chart from the Insert tab.

The chart looks like the following

Revenue Earned by Day of the week

Revenue is higher on weekends than on weekdays. Monday has the highest revenue. After analyzing the seasonality in data, I noticed that approximately 52% of the revenue on Monday is earned from days where there is a spike in revenue as seen in the fig, Change in revenue by Order date

Similarly, to find the time of the day with the highest sales, I used the Morning/Evening field and the Sum of revenue and created a pie chart.

Time of the day with higher sales

Evening time is a clear winner for higher revenue. From the below graph, peak times for higher sales, we see that the two times with the highest sales were 2:52 pm and 7:15 pm.

Peak times for higher sales

Seasonality in data:

I created a line graph of Order date and Sum of revenue to examine if we have seasonality in the data.

Change in revenue by Order date

As evident from the above fig, there are spikes in revenue at the beginning of the month. After analyzing the peak sale dates I found that on 09/04/2022 60% of the orders were placed by order id 34. On 06/15/2022 50% of the orders were placed by order id 18.07/07/2022 and 09/09/2022 had unique order ids. Hence, the spike in sales seems to be from a mix of repeat and unique order ids.

To get a clearer picture I right-clicked on the order date of the previous pivot table and grouped dates in groups of 15 days.

I plotted the following line graph which highlights that there is a drop in revenue towards the end of the month each month with an exception in the month of August.

Change in revenue beginning of the month vs end of the month

4. Advanced: Predict sales in the next months

We will use Forecast Sheet from the Data tab to predict monthly sales from October to December.

I created a pivot table of month and sum of revenue. I pasted the value from the pivot table in separate columns in order to make further calculations.

Excel’s forecast function does not recognize non-numeric input for months. Hence I inserted a column month number and used the month number instead of the month name.

Next, select the month number and total revenue and click on Forecast sheet from the Data pane.

When the forecast sheet worksheet is expanded by clicking on Options we get various choices.

Forecast End: I changed the forecast end period to 12 from 10 since we need to predict till December.

I kept other options as default.

Confidence Interval: Selecting a 95% confidence interval means that 95% of the future values will fall in the range of upper and lower confidence boundaries.

Interpolation: Interpolation is used to fill in the missing points based on the weighted average of neighboring points where values for certain months are missing.

Aggregate duplicates using: If there are multiple values for some months, then an average value is calculated by default.

Now when we click on create, Excel will create forecasted data in a separate sheet.

Forecasted revenue from October to December

Excel calculates forecasts based on FORECAST.ETS function. The advantage of using this function is that it automatically detects seasonality in the data and leads to better prediction.

Below is the final chart after I made some changes in chart style and font.

Actual and forecasted revenue

Recommendation

Let’s revisit the business questions and provide answers.

  1. What are the best and worst-selling SKU items? by color? size?

The best-selling SKU is 799 and the worst-selling SKU is 29. It is advisable to keep a larger stock of SKU 799 and minimize storage of SKU 29. A similar decision should be made by analyzing each SKU item by size and color from the interactive chart. Eg. BOBO should be stocked more for blue and 799 should be stocked most for Medium size. The worst selling item for blue color is 628 and the worst selling item for Medium size is 79

2. What is the average order value?

The average order value is $536.70. Adding personalized product recommendations and providing discount coupons are great ideas to increase the average order value.

3. What are the peak days or time periods with the highest sales? Do sales follow a trend or a seasonality?

Evening time has better sales than morning. It might be a good idea to advertise the E-commerce shop during morning hours to attract prospective customers. From the bi-weekly chart, it is evident that revenue is higher at the beginning of the month and gradually decreases towards the end of the month with the exception of the month of August. Further research is necessary to determine if inventory is declining at the end of the month due to which customers could not place orders.

Thank you for reading this article. I appreciate your clap and comment!

--

--

Manali Deb

Data Analyst skilled in Excel, SQL, Tableau | GitHub @Manali1608 | Tableau @manali.deb