Solving Real-World Business Questions with Python Pandas

Published in

Dev Diaries

7 min readJul 3, 2020

People are always saying “the best way to boost our skills in dealing with data is to get our hands messy and play with it.”

I always find it’s kind of scary, but when we’re learning new things, aren’t we always feeling like we’re jumping off a cliff and learning to fly on the way down?

Luckily, nowadays it’s amazing to have tons of resources online to check out. This post is inspired by this YouTube video. The only thing we need is to get a few datasets, define some questions that can be solved through the found dataset, and start our forever-never-ending journey with Stack Overflow.

Yup, I’m so ready to jump off the wagon of spending day and night on the good old internet.

This article will cover:

Take a look at the datasets I’ve found
Define some questions which I will answer through the datasets
Start cleaning data
Plot charts and graphs to visualize results

Skills will be used:

Import sets of CSV data files and concatenate them
Sort through data according to date and reset the index
Drop NaN values from DataFrame
Remove rows based on a condition
Change the data type of values in columns for our use
Add columns by splitting cell values from other columns, etc.
Use .apply, .transform, .map functions to manipulate cells
Use .groupby function to perform aggregate analysis
Plot charts using pandas DataFrame method

The environment I used:

Anaconda
Python 3

Data are here — it’s a set of mock sales datasets of the full year 2019. Perhaps from a shopping website in the US.

So, let’s get started.

Take a look at the dataset, let’s say, Apr CSV file:

There are several columns:

Order ID: order id
Product: product name
Quantity Ordered
Price Each: in USD
Order Date: month, date, year
Purchase Address: street, city, state

So we can try to explore several questions here:

1. What was the best month for sales? How much was earned that month?
2. Which city had the highest number of sales?
3. What time could be the best time to display our advertising to customers?
4. What product combinations were most often sold?
5. Which products were sold the most? Is it related to their average price?

Then I start importing all the datasets together. The total size of our dataset is 186850 rows:

I can see the order of ‘Order Date’ seems quite random. We may want it to be sorted, so we can use .sort_values() :

Now Jan data comes up nicely:

Let’s start from Question 1:

What was the best month for sales? How much was earned that month?

To answer this we may need to augment data with additional columns e.g. split ‘order date’ into ‘month’, add a ‘Sales’ column.

Before I started to create columns, there are several NaN rows to clean up. Find NaN rows:

Since rows that contain NaN are across all columns, I can drop them with the below code:

Then add the ‘Month’ column. I can successfully get the first few words of ‘Order Date’ to be ‘Month’, but when converting the value type of ‘Month’ from string to integer, there’s an error. There seem to be some words like ‘or’ in our cells:

So we need to figure out where those ‘Or’ from, clean it up and add the new ‘Month’ column into the data frame:

Then we’re good:

Next, add the ‘Sales’ column. Again, when I try to create the ‘Sales’ column by multiplying ‘Quantity ordered’ and ‘Price Each’, here comes an error:

So we need to convert the value type again and add the new ‘Sales’ column into the data frame:

Now it’s good:

Other than the ‘Month’, ‘Sales’ columns, I also need the ‘City’ column. Need to watch out that several city names are duplicated in different states, like Portland (Maine) and Portland (Oregon), so we need to add State names after cities to avoid confusion:

I can now plot a chart for sales by month to answer my first question. Here I tried matplotlib ways to chart it out:

We can see that the best month for sales is Dec with 4.6 million (USD) and the worst month is Jan with 1.8 million (USD). It makes sense because, in the US, Xmas is always a huge shopping season. As a business, we can try to launch a specific advertisement campaign for this month!

Answer spoiled! :D https://unsplash.com/photos/gZXx8lKAb7Y

Question 2:

Which city had the highest number of sales?

I can use the below code to chart numbers out to answer the question. Here I used the data frame method to chart directly:

With the chart, we see that San Francisco has more sales. This is interesting. We can ask ourselves more questions around it, like what makes sales high in San Francisco, maybe it’s because of the products. If the products bought are more of electronics, then we can assume that it’s possibly because Silicon Valley is in San Francisco. They may purchase a lot of electronics, etc. We can also think about why we’re not doing well in certain cities. A lot of further analysis we can play around with it!

Question 3:

What time could be the best time to display our advertising to customers?

I can first convert ‘Order Date’ values into DateTime format and add ‘Hour’ and ‘Minute’ these new columns accordingly:

Then use the below code to chart out the time versus sales plot:

We can see 11 am — 12 pm and 7 pm as two major peaks of purchase. This could be our prime time to do more advertisings during the day!

Answer spoiled again! :D https://unsplash.com/photos/f-gxmsZlj9c

Question 4:

What product combinations were most often sold?

Some products seem to be purchased through the same ID, also delivered to the same address. I can find those IDs by using the below code:

We can see order ID: 144804 seems to be someone ordered Wired Headphones and iPhone, also shipped to the same address.

Next, I can group those items from the same ID into a new column, ‘Grouped’:

Then there are two rows with the same ID and Grouped items. The next step is to filter out the columns I need for my question — ‘Order ID’, ‘Grouped’, and drop the duplicated rows in ‘Grouped’. I also sorted each row’s items, because later on I will use a function called ‘combination’ which will return a tuple to me, and different orders of the same items will be redundantly counted. I would like to avoid that:

Then I can use the combinations function to count unique product pairs, and chart them out:

We can see the top 5 combinations are:

‘Lightning Charging Cable’, ‘iPhone’
‘Google Phone’, ‘USB-C Charging Cable’
‘Wired Headphones’, ‘iPhone’
‘Google Phone’, ‘Wired Headphones’
‘Apple Airpods Headphones’, ‘iPhone’

This shows that we can promote more bonus deals with these combinations to consumers. This could be appealing to them!

Question 5:

Which products were sold the most? Is it related to their average price?

I first created a bar chart for ‘Product’ and ‘Quantity Ordered’ and another line graph ‘Product’ and ‘Price Each’, but I think it’s better to overlay them together to see the relationship between quantity and average unit price:

We can see when the average price (Price Each) is low, quantity ordered is high, like AA & AAA batteries. Similarly, when the average price is high, the quantity ordered is low, like LG dryer and washing machine. However, there’s an exception — Laptops’ average price is high but the demand is quite strong, like MacBook laptop and ThinkPad laptop.

If this happens in an FMCG company that has several products in its product line, they may want to find products that have good potential — those that have a decent average price and relatively strong demand given a higher unit price!

That’s it. Above are the 5 questions I was trying to answer out of the mock sales dataset.

Thanks for reading!

Solving Real-World Business Questions with Python Pandas

Written by Li-Ting Liao