HANDLING END-TO-END DATA SCIENCE PROJECT

Kübra Sak
8 min readOct 9, 2022

--

Hello,

Today I will talk about the basic principles that a data analyst/data scientist uses when handling a job. While doing this, I will give examples using the work I did in the VBO Bootcamp/Miuul finish project.

Our titles here will be as follows.

1- Understanding our purpose and the problem we are trying to solve

2- Getting to know the data closely

3- Data cleaning

4- Feature Engineering

5- Standardization

6- Estimation

7- Clustering

8- Visualization

1- Understanding our purpose and the problem we are trying to solve

The first questions we should ask ourselves or when starting a job in the institution we work for are “What is the problem we are trying to solve? / What is our goal? Or what will be the contribution of this work?” should be. These questions are invaluable whatever our job is data analytics, data science, application development or project management.

These questions are very important both for us to choose the right technique and for us to make the right deductions and develop our competencies by owning the work. Also it is important in choosing the right technique because the method changes according to the purpose and goal of the work we will do. In this project, we actually experienced this. Using the same data, we brought solutions to two different problems/targets with two different techniques.

The dataset we use:

The outputs we aim to provide solutions for: Estimating the budget we will spend to gain customers, getting to know our customers better by segmenting them, and preparing more suitable campaigns for them. We used a regression method for budget estimation, while we used a clustering method for customer segmentation.

Alright; In terms of softskils , when we come to our profit, when we know the purpose/benefit of our work, we will both own the work more and see the opportunities that will come out of our work better. So our productivity will increase.

2- Getting to Know the Data Closely

“What does the data tell me?” For the question, we need to get to know our data closely. There are certain questions that we should ask ourselves.

1- What is the interdependence of the data here? (This is important for our future extractions)

2- On what basis is the data found?

3- Is there any NULL value in the data?

4- Is there an anomaly in the data?

5- Is the data historical?

6- What do the columns in the dataset represent?

Data exploration is even more important if you use a Kaggle dataset like us to answer the questions above.

1- First of all, you should consider all the explanations in the system from which you are pulling the data, then you can check whether there are outliers and null records in your data by doing the following operations.

2- It is important to identify the categorical variables, numeric variables, categorical variables that appear to be numeric, and variables with high cardinality in our data

3- After the determination of the numerical variables, it should be examined on a quarterly basis to see if there are any outliers.

4- The distribution of categorical variables in the data should be examined, and it should be determined which values ​​are more dominant and which values ​​are rare.

5- The effect of the variables on each other should be examined by performing a correlation analysis. New KPIs should be obtained from highly correlated variables and one of the high correlations should be deleted. It will be better for estimation to keep the one with higher correlation with the dependent variable in while making a choice. The percentage of variables explaining the dependent variable could be better.

6- By making various groupings, you can catch the singularity of the objects.

( The dataset we used included store-based, product-based and customer demographics.)

For this, I am sharing an example of the grouping work we did below.

As a result of the groupings, it turned out that there is STORE_SALES=UNIT_SALES*SRP value, these concepts were far from us since we are not in the retail sector, we did a Google for this and then compared the result of the grouping, you should do it too 😊

3- Data Preprocessing

According to our conclusions from Data Discovery, there were no outliers or null records in our data. There was a repeating column, we removed it.

The following information was highly correlated with each other, which was the correlation relationship we expected.

→Grossy_sqft x Meat_sqft High Negative Correlation

→ Store_sales x Store_cost High Positive Correlation

→ Store_sales x SRP High Positive Correlation

→ Gross_weight x Net_weight High Positive Correlation

→ Salad_bar x Prepared_food x Coffee_bar x Video_store x Florist Meidum Positive Correlation

4- Feature Engineering

I think it is very important to define our purpose and data before going to this step. Then, “How can I create KPIs?”, “How can I generate added values ​​from the data I have?” Asking questions is very valuable. If you are working in an organization, it is important to ask and understand well what the situations faced by the business unit are in order to identify the problem.

Our first goal in this project is to estimate the budget spent by a grocery chain for customer acquisition, so “why am I estimating?” because I will adjust my budget accordingly in the future, what are my important factors in customer acquisition, I want to reduce the cost I spend.

For this, we produced 41 new variables by doing Feature Engineering-Feauture Extraction & Interaction and Onehot.

First of all, we converted categorical variables into numerical variables in order to be able to put them into algorithms.

Since there is an ordinal relationship between variables such as education status, Membership, we have converted them to the numerical version as follows.

As in the media column, we have obtained new columns by separating the columns with more than one value with the following operations, so we have seen which media channels are used more and which have more impact on the cost variable.

We produced new features by capturing important key words for the categorical but high cardinality column like the promotion column. For example, promotions with slogans such as Day and Weekend may have a different effect on the user, since they are within a limited day. Or, the words Saving-Save may attract users more because they represent less spending.

The columns we preferred to go through Onehot were on columns that did not have a hierarchical superiority over each other and did not take too many different values. For example: Country ,food_family,occupation etc.

5- Standardization

This study is usually done to prevent the effect of one variable in the data to be superior to other variables and to shorten the training period.

We preferred the StandardScaler method because there was no outlier in our data.

**If there was an outlier in our data, we would prefer the RobustScaler method.

6- Estimation

While we were estimating in our project, we actually got the estimation success score of each method by putting it into 7 different machine learning methods. And we continued by optimizing Hyperparameter with the lowest one. Before the hyperparameter optimization, we examined the importance of our variables and eliminated the non-significant and highly correlated variables, thus aiming to both break the correlation relationship and shorten the training time.

Feature Importance

7- Clustering

We also wanted to work on customer acquisition and customer retention in our project, so we segmented our customers. For this, we used the cost we spend to retain a customer and the margin KPIs that our customer brought to us. I am sharing the picture with you.

8- Visualization

Today, data is very valuable, but if we do not know how to use it, it becomes like an unprocessed mine, and it is very valuable to be able to describe the data as much as to process the data. The best way to describe data is to visualize the data.

In this project, we make a Dashboard using Microstrategy and actually work.

Store Type Based Store Sales &Store Cost: In this graphic, we wanted to show the sales and cost values ​​in the store on the basis of Store Type, and we used a bar plot for this.

City of Store Sales: Here we used a map to show the city-based distribution of shops.

Total Customer: Here we wanted to show the distribution of customers in each country and we chose pie-chart because there are only 3 values ​​in our data.

Acquisition Cost of Promotion: We wanted to determine the effect of promotions on the cost of customer acquisition.

Gender &Marital Status Based Customers: We also showed our customer’s demographic information with a pie chart

Number of Customers by Brand: We used the WORD-CLOUD method for our customers’ brand preferences in shopping.

Channel Member & AVG Yearly Revenue Distribution: We made the promotions we gave to our customers with a branch graph, which membership we went to the most when we exited from how many different channels and which audience with the most income from this membership.

Number of Customers by Member Card: We showed which membership our customers are the most with pie-chart.

Average Revenue of Store Based Customers: We showed which customer profile came to which store the most with a grid.

Customer Segmentation: We used the scatter plot for customer segmentation.

As a result of this segmentation, we had 5 separate clusters. You can get to know these clusters closely and come up with action plans according to your company strategy. The action plans we have made are as follows 😊

→High cost, high margin: I spend a lot of cost to attract my customers, but it pays me high. I can determine which channel these customers receive the most communication from and reduce the cost there.

→ High cost, low margin: I spend a lot of cost to attract my customers, but the return is low for me. The products and brands preferred by my customers may not be in my store, the results can be compared with a survey study.

→ Low cost, low margin: I spend very little cost to attract my customers, but there may be an audience that prefers me only for certain products with low returns for me. I may not be their primary market of choice in the sector. Our product catalog can be deepened with a survey about our price range and reliability. By making a campaign on the products preferred by the customers, we can also make inferences according to their trends.

→ Low cost, high margin: Customers I can reach the fastest and earn the most income, I can launch special campaigns for these customers.

→Medium cost, low margin: I spend money on attracting my customers, but the return is low for me. The products and brands preferred by my customers may not be in my store, the results can be compared with a survey study.

Media Type Based Cost and Pred Cost: We represented the actual revenue spent by Media Type based customers and the cost value we estimated with a line graphic.

You can find the codes and reports of our work below.

--

--