Solving A Real World Data Science Tasks With Python

Cem ÖZÇELİK
8 min readJan 16, 2022

--

Hello everyone on the last day of the week. In our previous article, the widely used metric to measure the performance of the model established in machine learning algorithms. In this article, we will perform data manipulation using pandas library on a sample data set belonging to a technology store, and we will find answers to some business problems by making visualization with matplotlib library. We know that doing the work we have done in order to perform our work more efficiently, to reduce our costs and to increase our sales, based on data will increase our profits.

Taking our data set, it is certain that knowing customer consumption preferences such as knowing which cities our customers shopping from our technology store are from, knowing which products are sold more or knowing which product combinations are frequently preferred by customers will certainly contribute to the profitability of our business.

Let’s take a look at the following business problems that we briefly summarized in the previous passage :

  • What was the best month for sale ? / How much was earned that month ?
  • Which city had the highest number of sales?
  • What time should we display advertisements to maximize likelihood of customer’s buying product ?
  • What Products are most often sold together ?
  • What Product sold th most?

Here are our questions. Without further ado, let’s import our main libraries that will benefit us in our work.

Import the Main Libraries

In the first stage, these two libraries will do the job for us to import our dataset and simply to perform manipulations on our dataset.

Now let’s import our dataset. I will be presenting the data set I have used at the end of our study. However, when we first download our data set, we see that we have 12 different data sets with separate records for each month. In the first place, we have to combine these 12 different datasets into a single dataframe.

The Datasets Seperated By Month

Let’s combine these separately kept datasets.

Mergered Data Set -1-

We combined our datasets, which are in separate forms, and exported them to an external .csv file.

After re-importing our mergered dataset into our study, we will perform some of data cleaning processes that may cause problems in our study and may cause problems while answering our questions.

First, let’s clear the NaN values from our dataset.

Drop to the NaN Values From Dataset

In the next step, let’s add the Month column first to make the Order Date column more functional.

Create A New Column Named ‘Month’

I would like to briefly explain the process we have done here. First, we kept the first 2 characters of the values in our Order date column as a separate column. We did this because the date information stored in our Order Date column was in MM/DD/YY format. Afterwards, we made a numerical value conversion for our new column that we created to carry a numeric value.

Then, the textual element in the Order Date column that does not contain date information.

Clean Up Data From ‘Order Date’ Values

We can say that we have made the necessary arrangements to answer our first question. So let’s go to our answer by remembering our first question again.

Question 1 : What was the best month for sale ? / How much was earned that month ?

Answer For #Question 1 -1-

To make some explanations on the image, here we have use groupby our sales figures by month and we sorted the sales ascending form and listed.Now let’s visualize this information that we have kept in a list.We use the matplotlib.pyplot library for visualization.

Visualization of the answer to the #Question 1
  • We see that the graph is on a regular rise starting from the bottom in January until April. Then we see a recurring dream.
  • If we make a comment based on the inference made from the data, we can say that the advertising expenditures we made before January showed
    themselves and carried us until April.
  • The reason for the jump in sales after September is that when we consider the public holidays in the United States, it is possible to say that the Christmas Holiday caused this situation. As it is known, the business has taken its share from this situation as everyone buys gifts for each other at Christmas.

Now that we have answered the first question, we can move on to the next one.

Question 2 : Which city had the highest number of sales?

In order to answer the Question 2, we need a new column. We will obtain state and city information by performing some operations on the Purchase Address Column, which keeps the information from which city the orders are from.

Create City Column

After creating the city column, let’s group the sales values according to the cities using groupby as in the Question 1.

Groupby the Cities by Sales Values

Let’s visualize this list form seen and make it more understandable.

Visualization of the answer to the #Question 2
  • All we seemed San Francisco so much higher than the other places . This is an electronics store maybe it’s because Silicon Valley needs more electronics so you know that would be one reason . Another reason maybe the tech to advertise is like better than San Francisco or more people have more money on San Francisco.

We can create a lot of hypotheses about this visual but I guess that’s enough for now. We answered to Question 2. Ok, at the now we can answered Question 3.

Question 3: What time should we display advertisements to maximize likelihood of customer’s buying product ?

For the answered to Question 3 we have to look at Order Date column and we’re going to basically need to figure out a way that we can kind of aggregate
all of the order dates into there like distribution over a 24-hour period. I will convert the Order Date column to converted into a Date-Time Object.

In the original form of our dataset, the order date column looks like this:

The original Order Date Column

However, I convert the Order Date column to a datetime object.

Converted Order Date Column

As you can see, we performed the conversion on the Order Date Column. Afterwards, we create the hour and minute columns to find out at which times the orders are more frequent.

New Looks the Sales Dataset

Now we have coverted new dataset. The hour and minute columns are our new columns. And the now, we aggregate to sales by Hour or Minute Cols.

Visualization of the answer to the #Question 3

As we can see on the plot, the answer’s is “Which hour probably is the best?” question, we have peaks at around 11.00 a.m or 07.00 p.m. Cause of this plot show us 2 peaks time at 11.00 a.m and 07.00 p.m. If we put an ad we should put the ad those 2 peaks times. Because these time periods are the time periods where the highest sales figures are observed.

After answering our third question, we can move on to our next question.

Question 4: What Products are most often sold together ?

Answering this question is really challenging but not impossible. Firstly, we have a look the dataset again…

Solution for Question 4 -1-

If you look to dataset, you can see Order ID column. The Order ID column is keep the unique values each of orders. If you have 2 orders and this order id is the same then we can say “This order has been sold together”.

Solution Question 4 -2-

As you can see in the syntax in the image, we first duplicate the sales in our dataset in order to group the product sales according to their combinations. In order to be able to group, we need to create a new column that holds the information of the products sold in combination.

Creating the Grouped Column

When we look at the table, the values of the Order ID and Grouped column are the same, but the Product value is different. The reason for this is that since each product in the group is added to the cart as an order, it is returned as a repeating order in each one. For this, we clear the values that we have just duplicated from our data set.

We need to count the pairs of what occurs together most frequently and we’re gonna use a couple new libraries for counting unique pairs of numbers into a
dataframe.

Answer is “What occurs together most frequently?” question.

Okay, we have answered Question 4 . We have counts of occurs together most frequently. So what would we do with this ?

  • You try to be smart with promotions that you’re offering if you’re selling an IPhone maybe like try to smart offer a smart deal so you get more people.
  • People already probably want to buy an extra lightning charging cable but maybe you do a smart deal where you can kind of lure an even bigger
    audience to buy that lightning charger cable and you can kind of repeat that these other common pairs. Or you may be try to use that to your advantage as a business user so this data would help a business kind of make decisions like that.

After answering our fourth question, we can move on to our last question.

Question 5: What Product sold th most?

Okay, we’re on last question. This one just be little bit definitely simpler than Question 4. It’s gonna be what products sold the most and why do you think it’s sold the most.

Top selling products

Let’s show this list with a graph:

Visualization of the answer to the #Question 5

We performed our manipulations on our dataset with the pandas library. We create some new columns, we made our visualizations in order to find answers to our questions with the new data sets we converted.

I hope you had an enjoyable reading session, see you in our next work.

--

--