Superstore intro to data analysis with Power BI
In this article we will follow the data analysis of an online store dataset.
At the menu, connection to the dataset, data cleaning with Power Queries, Exploratory data analysis, then a bit of dashboard design.
TLDR : We use Power BI to do a quick analysis of a sample dataset.
Summary
Dataset and use case presentation
Data analysis :
1. Import and Clean data
2. Discover the dataset
3. First insights and observations
6. Ask the good questions, find the right answers
7. Create an user friendly dashboard page to our marketing department
Dataset and use case presentation
This dataset contains sales data from an online store who sales furniture, office supplies and tech stuff.
It includes : order id, date of order / shipping, info about the customer, info about the product, total sales, profit …
For this use case, let’s pretend the Marketing department came to see us with this dataset.
They’d like our insights regarding the existing data and they came with these questions :
- Can we identify the best product / sub category / category, regarding quantity, sales or profit ?
- What items are generally sold together ?
- What other insights and observations can you provide us ?
- Can you help us monitoring these KPI over time ?
Let’s answer their questions using Power BI !
1. Import Data
First, we need to import and understand the dataset :
In power BI, once you’re connected to your source, (in that case, a traditional CSV file, but it could be anything : SQL Server, API, Spreadsheet…) we arrive on this view who gives us info regarding the data quality.
Best practice : we always turn on the : Column distribution, Column Profile and Column quality. It will help us checking the distribution of each column, % of errors, duplicate, empty values …
Here we can see that some rows are missing, that column contains numerical vs categorical values, the number of distinct values by column, min / max of each column …
2. Data cleaning
Let’s order this :
- First, change the column Data Type when needed : Data type will be very important to create calculation, comparison …
- As a good example, the date values are currently text (string), sometimes in universal or US format . We need to standardize the format, then change the type to date type.
- Replace every “.” by “,” in numerical columns
- To delete error rows, we have multiple choices here : we can replace the missing values by the mean of the column, by the most common value (the mode), or simply delete the incomplete rows. We don’t have enough values to select the mode / mean option, let’s delete the incomplete rows.
Next, the current dataset doesn’t show the unit values (price, margin, % margin) … We need to add those columns :
- Product price
- Product margin
- % margin
Finally, we will add dates column once the original date is defined as date type : Order Day, order month, order year … (this will be helpful when we analyze the best sales days / months / years).
Naming convention is fundamental in data analysis. Think of your future you, or your teammate who will use your dashboard in 2 months and won’t understand the difference between : sales, total sales, value, test_measure_sale_total…
A solid naming convention is mandatory !Here we will prefer :
Current columns : Total sales, total margin, total profit
Adding the unit column : price sales, price margin, price profit
3. Exploratory Data Analysis
Now the data are cleaned, we can dive into the next layer of details, first by using simple data visualization :
We can see on the images below :
- Some cities are performing better, in term of sales and profit
- The 3 product categories are relatively close in term of sales but not in term of quantity sold and margin
- Sales and profit are growing each year
- We are addressing 3 customer segments, with different sales pattern
- Our sales are best performing between the 9th and the 13th of each month. first days of the month are really low. The best month to sale are in 11/ 9 / 12 …
Some cities are loosing money, we need to dive deeper to understand why. It’s doesn’t seem to be correlated to the total sales of the shop :
Some sub cat are loosing money as well. We will zoom in to identify which product are loosing money
With this first intro to EDA, we can complete the marketing questions by adding our own questions :
Marketing questions :
1. Can we identify the best product / sub category / category, regarding quantity, sales and profit ?
2. What items are generally sold together ?
3. Can you help us monitoring these KPI over time ?
Our first observation leads us to ask :
4. What products are sold by the best shops ?
5. What shops are loosing money, what items do they sell the most / the least ?
6. What are the best days to sale product ? We can next pass this info to our marketing department to select the perfect day to : launch new product / launch ad campaign …
7. Do we need to delete of the shelves the product who lose money ? or do we need them to bring clients ? What product people buy with these product ? Do the low profit items help us selling the high profit items ?
8. Does the 3 customer segments have the same patterns and habit ?
9. What values can we use to predict sales in the future ?
We won’t cover in this article the full set of questions, but you understand the idea here. Data analysis is way more efficient when you are curious, when you understand the client needs, when you perform extra research with your dataset.
4. Question 1 : What are the best product in term of sales, profit, quantity ?
Let’s create a product page. We need to keep in mind that some of our visualizations will remains in the marketing Dashboard.
Quantity sold :
Let’s create a simple bar chart to visualize our quantity by category / sub category / product. We use the “drill down” feature of Power BI, to avoid creating too much charts :
The company sells 3 type of product, let’s add colors to identify category easily :
In real life, that’s the moment when Dani from marketing asks you : ”hum but what’s wrong with your colors ?” He’d be right ! Best practice, stick to your company style guide.
Check this article (in French) from Manon Lay about color harmony :
Get back to our report, how to add custom colors ?
Add a column “category_number”, who applies a number regarding the category label. Furniture = 1, Office = 2 …
In power query it should looks like :
= Table.AddColumn(#”Renamed Columns”, “Category_number”,
each if [Category] = “Furniture” then 1
else if [Category] = “Technology” then 2
else if [Category] = “Office Supplies” then 3 else null)
On your chart, you can then create a conditional formatting like this :
Margins :
Let’s use the same color set
We can dive deeper and check the products margin. The company is selling hundreds of product, it won’t be easy to visualize all of these on a bar chart.
Sales :
Same job here :
How do we compare these 3 values ?
Can we identify the best product regarding quantity, sales or profit ?
Don’t get lost in your analysis, we’re here to answer questions !
As we can see on the next image, in this case we have different product / price / quantity and margin, we can’t just compare these values together.
Marketing concept : Product mix effect
To make it short, the product mix allows you to understand what product contributes the most to the company’s margin. It will gives you an absolute answer to the question.
To find it, you multiply contribution to Sales in % by Margin in %.
We then add 3 columns to our dataset : contribution to Total sales, margin, margin x contribution to total sales :
Here is the ordered list of our best products !
5. Question 2: Which products are the most sold together ?
Why do we need this info ?
- To understand which products help us selling the other products
- To develop cross sell
- To decide whether or not we keep low margin products
Spoiler alert : Unfortunately, in this sample dataset we never sold twice the same couple of items, (it was worth it to try, we wouldn’t have found out otherwise), but the idea remains the same.
Here is one way to get the set of product aggregated and ordered :
Duplicate your data frame in order to get only the column you need : Order ID et product name.
- Sort by product name (we always want “product A — product B”, otherwise we would have some “product B — product A” mixed )
- Keep only the duplicated values on the “Order ID” column (we want products sold together, so we exclude the unique product orders)
- Group by Order ID :
4. Split the newly merged table into 1 column, using this line of custom query :
= Table.AggregateTableColumn(#”Grouped Rows”, “Count”, {{“Product Name”, each Text.Combine(List.Transform(_, (x) => Text.From(x)), “ — , — “), “Desired Result”}})
5. Lastly, we can plot the data on a bar chart to visualize the repartition :
In the real world we would get something like :
6. Last Question : How can we create an user friendly dashboard to our marketing friends
Despite our effort to keep the Dashboard clean, we need to do some housekeeping to design a robust and user friendly dashboard
Before :
After :
Finally we need to display here :
- A filter panel : here we can filter by category / sub category, City, Date.
- A top row with cards
- Our 4 bar charts
Why we don’t use pie chart to display the sales repartition ?
We can remove unnecessary visual / table of our dashboard, users will be able to dive deeper using the UI if needed :
Conclusion
In this quick analysis, we performed standard and easy data manipulation to understand and analyze a dataset.
In the real world, real time data analysis (and especially real time dashboard design) often requires multiple iteration between the client (the team who needed the analysis) and you (the data analyst). Final users rarely know exactly what they want, their needs are rarely mature enough. Multiple workshops and iterations are essential.
If you’re still reading, you probably notice that we performed a superficial analysis here… And you’re right ! In real life, we would need to answer every additional question we found during the exploratory analysis, and probably more topics to analyze once all the said questions are answered.
Finally, Power BI is a great tool to dive into dataset, and it provides robust and easy-to-use dashboard for end users.
Credit :
Photo by Artem Gavrysh on Unsplash
Dataset on Kaggle
Story telling with data by Cole Nussbaumer Knaflic