Superstore intro to data analysis with Power BI

Christophe Oliveres
9 min readApr 27, 2022

--

In this article we will follow the data analysis of an online store dataset.

At the menu, connection to the dataset, data cleaning with Power Queries, Exploratory data analysis, then a bit of dashboard design.

TLDR : We use Power BI to do a quick analysis of a sample dataset.

Summary

Dataset and use case presentation
Data analysis :
1. Import and Clean data
2. Discover the dataset
3. First insights and observations
6. Ask the good questions, find the right answers
7. Create an user friendly dashboard page to our marketing department

Dataset and use case presentation

This dataset contains sales data from an online store who sales furniture, office supplies and tech stuff.

It includes : order id, date of order / shipping, info about the customer, info about the product, total sales, profit …

For this use case, let’s pretend the Marketing department came to see us with this dataset.

They’d like our insights regarding the existing data and they came with these questions :

- Can we identify the best product / sub category / category, regarding quantity, sales or profit ?
- What items are generally sold together ?
- What other insights and observations can you provide us ?
- Can you help us monitoring these KPI over time ?

Let’s answer their questions using Power BI !

1. Import Data

First, we need to import and understand the dataset :

Power BI queries view
The Power Queries menu, embedded in Power BI for desktop. The tool help you understand your dataset.

In power BI, once you’re connected to your source, (in that case, a traditional CSV file, but it could be anything : SQL Server, API, Spreadsheet…) we arrive on this view who gives us info regarding the data quality.

Best practice : we always turn on the : Column distribution, Column Profile and Column quality. It will help us checking the distribution of each column, % of errors, duplicate, empty values …

Here we can see that some rows are missing, that column contains numerical vs categorical values, the number of distinct values by column, min / max of each column …

2. Data cleaning

Let’s order this :

  • First, change the column Data Type when needed : Data type will be very important to create calculation, comparison …
  • As a good example, the date values are currently text (string), sometimes in universal or US format . We need to standardize the format, then change the type to date type.
  • Replace every “.” by “,” in numerical columns
  • To delete error rows, we have multiple choices here : we can replace the missing values by the mean of the column, by the most common value (the mode), or simply delete the incomplete rows. We don’t have enough values to select the mode / mean option, let’s delete the incomplete rows.

Next, the current dataset doesn’t show the unit values (price, margin, % margin) … We need to add those columns :

  • Product price
  • Product margin
  • % margin

Finally, we will add dates column once the original date is defined as date type : Order Day, order month, order year … (this will be helpful when we analyze the best sales days / months / years).

Naming convention is fundamental in data analysis. Think of your future you, or your teammate who will use your dashboard in 2 months and won’t understand the difference between : sales, total sales, value, test_measure_sale_total…
A solid naming convention is mandatory !

Here we will prefer :
Current columns : Total sales, total margin, total profit
Adding the unit column : price sales, price margin, price profit

3. Exploratory Data Analysis

Now the data are cleaned, we can dive into the next layer of details, first by using simple data visualization :

We can see on the images below :

  • Some cities are performing better, in term of sales and profit
  • The 3 product categories are relatively close in term of sales but not in term of quantity sold and margin
  • Sales and profit are growing each year
  • We are addressing 3 customer segments, with different sales pattern
  • Our sales are best performing between the 9th and the 13th of each month. first days of the month are really low. The best month to sale are in 11/ 9 / 12 …
First dataviz to understand the dataset

Some cities are loosing money, we need to dive deeper to understand why. It’s doesn’t seem to be correlated to the total sales of the shop :

When the dark blue is on the left side, the store is actually losing money.

Some sub cat are loosing money as well. We will zoom in to identify which product are loosing money

Like earlier, dark blue on the left side means the product subcategory is losing money

With this first intro to EDA, we can complete the marketing questions by adding our own questions :

Marketing questions :
1. Can we identify the best product / sub category / category, regarding quantity, sales and profit ?
2. What items are generally sold together ?
3. Can you help us monitoring these KPI over time ?

Our first observation leads us to ask :
4. What products are sold by the best shops ?
5. What shops are loosing money, what items do they sell the most / the least ?
6. What are the best days to sale product ? We can next pass this info to our marketing department to select the perfect day to : launch new product / launch ad campaign …
7. Do we need to delete of the shelves the product who lose money ? or do we need them to bring clients ? What product people buy with these product ? Do the low profit items help us selling the high profit items ?
8. Does the 3 customer segments have the same patterns and habit ?
9. What values can we use to predict sales in the future ?

We won’t cover in this article the full set of questions, but you understand the idea here. Data analysis is way more efficient when you are curious, when you understand the client needs, when you perform extra research with your dataset.

4. Question 1 : What are the best product in term of sales, profit, quantity ?

Let’s create a product page. We need to keep in mind that some of our visualizations will remains in the marketing Dashboard.

Quantity sold :

Let’s create a simple bar chart to visualize our quantity by category / sub category / product. We use the “drill down” feature of Power BI, to avoid creating too much charts :

We can drill down the data in order to visualize the next layer.

The company sells 3 type of product, let’s add colors to identify category easily :

Sometimes you can’t trust your instinct regarding the color palette

In real life, that’s the moment when Dani from marketing asks you : ”hum but what’s wrong with your colors ?” He’d be right ! Best practice, stick to your company style guide.

Check this article (in French) from Manon Lay about color harmony :

Get back to our report, how to add custom colors ?

Add a column “category_number”, who applies a number regarding the category label. Furniture = 1, Office = 2 …
In power query it should looks like :

= Table.AddColumn(#”Renamed Columns”, “Category_number”,
each if [Category] = “Furniture” then 1
else if [Category] = “Technology” then 2
else if [Category] = “Office Supplies” then 3 else null)

On your chart, you can then create a conditional formatting like this :

There are probably other way to do it, but it’ll do the job here

Margins :

Let’s use the same color set

Some products have negative margin, keep that info in mind

We can dive deeper and check the products margin. The company is selling hundreds of product, it won’t be easy to visualize all of these on a bar chart.

Keeping the same color palette help us being consistent in our visualizations

Sales :

Same job here :

Products sales and subcategories sales

How do we compare these 3 values ?

Can we identify the best product regarding quantity, sales or profit ?
Don’t get lost in your analysis, we’re here to answer questions !

As we can see on the next image, in this case we have different product / price / quantity and margin, we can’t just compare these values together.

How do you know what’s good or bad here ?

Marketing concept : Product mix effect

To make it short, the product mix allows you to understand what product contributes the most to the company’s margin. It will gives you an absolute answer to the question.
To find it, you multiply contribution to Sales in % by Margin in %.

Product / % Sales / % margin / Margin Contribution

We then add 3 columns to our dataset : contribution to Total sales, margin, margin x contribution to total sales :

Numbers seem very low, keep in mind the total margin is below 5%

Here is the ordered list of our best products !

5. Question 2: Which products are the most sold together ?

Why do we need this info ?

  • To understand which products help us selling the other products
  • To develop cross sell
  • To decide whether or not we keep low margin products

Spoiler alert : Unfortunately, in this sample dataset we never sold twice the same couple of items, (it was worth it to try, we wouldn’t have found out otherwise), but the idea remains the same.

Here is one way to get the set of product aggregated and ordered :

Duplicate your data frame in order to get only the column you need : Order ID et product name.

  1. Sort by product name (we always want “product A — product B”, otherwise we would have some “product B — product A” mixed )
  2. Keep only the duplicated values on the “Order ID” column (we want products sold together, so we exclude the unique product orders)
  3. Group by Order ID :
Power BI merge the data in a table, inside the table. Tableception

4. Split the newly merged table into 1 column, using this line of custom query :

= Table.AggregateTableColumn(#”Grouped Rows”, “Count”, {{“Product Name”, each Text.Combine(List.Transform(_, (x) => Text.From(x)), “ — , — “), “Desired Result”}})
Here is the new column. It contains every order’s products.

5. Lastly, we can plot the data on a bar chart to visualize the repartition :

After all of our research, it happens this dataset is only composed of unique sets of products. Never mind, we have learnt a new approach anyway

In the real world we would get something like :

In an other article we will only use Python to analyze a dataset, with a better result on this topic.

6. Last Question : How can we create an user friendly dashboard to our marketing friends

Despite our effort to keep the Dashboard clean, we need to do some housekeeping to design a robust and user friendly dashboard

Before :

We need to try things out before deciding to keep them on the dashboard

After :

Nothing too sophisticated here, less is more.

Finally we need to display here :

  • A filter panel : here we can filter by category / sub category, City, Date.
  • A top row with cards
  • Our 4 bar charts

Why we don’t use pie chart to display the sales repartition ?

Because Pie chart are evil

We can remove unnecessary visual / table of our dashboard, users will be able to dive deeper using the UI if needed :

Here the small bar chart can be expanded, we can also display the correspondent table.

Conclusion

In this quick analysis, we performed standard and easy data manipulation to understand and analyze a dataset.

In the real world, real time data analysis (and especially real time dashboard design) often requires multiple iteration between the client (the team who needed the analysis) and you (the data analyst). Final users rarely know exactly what they want, their needs are rarely mature enough. Multiple workshops and iterations are essential.

If you’re still reading, you probably notice that we performed a superficial analysis here… And you’re right ! In real life, we would need to answer every additional question we found during the exploratory analysis, and probably more topics to analyze once all the said questions are answered.

Finally, Power BI is a great tool to dive into dataset, and it provides robust and easy-to-use dashboard for end users.

Credit :
Photo by Artem Gavrysh on Unsplash
Dataset on Kaggle
Story telling with data by Cole Nussbaumer Knaflic

--

--