Bike Buyers Analysis

Tuba Ozdemir
BilgeAdam Teknoloji
6 min readFeb 3, 2022

Welcome to the Bike Buyers analysis and the prediction of who purchased or not. With the data set we have, we will try to guess that people will or will not buy a bike based on their specifications.We will analyze data first, i.e acquiring, examining, querying the data. Then, we will visualize the data and determine needs for cleaning that is the most important phase of any data project. After completion of data understanding phase, we will prepare the data. In the data preparation phase, we will determine how to use the data set. For example, correction, removing or replacing etc… Finally, in the modeling phase, we will use some algorithms that are supervised Classification i.e we already know actual outputs of our data and we will try to reach accuracy scores.

We will write all of them with Python 3.10 and use Jupyter Notebook as interface.

This is the raw data set ;

Each row represents a customer who purchased or not. There is also some information about the customer like;

Marital Status :married or single

Gender: Female,Male

Income: Yearly income

Children: How many child he/she has?

Education: Graduated degree

Occupation : Job

Home Owner: Does he/she have any home? Yes/No

Cars: How many cars he/she has

Commute Distance: Distance between home and job

Region: Which region he/she lives

Age: How old is he/she

Purchased Bike: Did he/she buy a bike? Yes/No

In this project, we are going to use the ‘bike_buyers.csv’ file .After importing the data , we’ll make some transformations like cleaning , replacing and adding calculated column.Let’s examine the data first;

These are the steps that clean and transform our data set :

· Replace na Home Owner values with ‘Yes’ which is most frequent one. Since this is qualitative value.

· Replace na Marital Status values with ‘Married’ as in HomeOwner

· Replace na Gender with ‘Male’ also most frequent.

· Replace na Cars with 1, because we calculated the mode value of Cars as 1.Since this is the counting quantitative value.

· Replace na Age with 43, because we calculated the median value of Age as 43. Since this is the numerical quantitative value.

Also , we can try a Retiring Situation if that is effective on buying a bike or not . So , lets calculate a Retired column like ;

If Age >=60 and Region is Europe then Retired (1)

If Age >=60 and Region is North America then Retired (1)

If Age >=50 and Region is Pacific then Retired (1)

Else Not Retired (0)

After these cleaning and creating processes, our data will look like this:

We are now ready for some visualizations:

We can see Occupation professional ones purchased more than others. Also, same for the Pacific Region.

In marital status graph, we can observe that single ones purchased more than others.

On the other hand, it wouldn’t make sense to use Retired column since there is not too much differences between two situation.

Also, in lower distances number of purchased people is more than others.

In children graph, we cannot group it like less or more.

We can say people who has less car purchased more.

In the Education graph, we can see Bachelors and Graduate Degree people have purchased more. We can group it as HighDegree.

We are going to encode categorical columns,convert them into numeric values.

This is the last version of our data. Everything is Boolean;

Also, this is the correlations between the columns;

As we can understand from the correlation matrix, our prediction models may not have the high accuracy score. On the other hand, we tried to clean data and fill the null values in a meaningful way which is one of the most important parts of the data processing.

Firstly,

Try Logistic Regression which is used for

· Probabilistic classifier.

· Use when your output variety is less.

· In some applications threshold value must be adjusted to get correct results.

We split the data to test(%30) and train .

Picture A
Picture B

Picture B : In first code block; Accuracy Score here is %63.29

Second block; There is confusion matrix which means 154 value has purchased and it predicted true(true positive)

146 value has not purchased and it predicted true (true negative)

Third block; we can see here some definitions;

Accuracy means Percentage of correnctness of the output values.

Precision means: Positive prediction accuracy

Recall means : True Positive/ True Positive+FalseNegative

Roc Curve for the Logistic Regression, the score is not good.

Keep trying…

SVM Hyper Parameter Tuning:

Gaussian Naive Bayes (It is also %60)

Decision Tree (It looks like more than %60 😊 (%63) )

Random Forest (It is also %60)

We can see from all models , they all give us among %60. Our dataset has 1000 rows and it might not be good enough for prediction. Also there are not enough relation between columns. For this dataset, we could reached %63 score.

Additioanlly,

Comparison of the models: (*)

We may need to make additional analyzes and changes to the dataset to enrich the models. If you have any advice, I would be very happy if you contact me. Thank you for your time 😊

References:

(*) https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/

--

--