SUV Purchase Prediction Using Logistic Regression

Published in

Analytics Vidhya

4 min readAug 29, 2020

In this blog-post ,I will go through the process of creating a machine learning model for suv cars dataset. The dataset provides information regarding the age ,gender and Estimated Salary. There is one more column in dataset which is our target variable i.e Purchased.The machine learning model we are going to apply is Logistic Regression.I have downloaded above dataset from kaggle which you can download from here.

Let’s get started

First we will import the libraries which we are going to use in this model.

After import these above libraries we will get load the dataset.

Let’s Explore our dataset:

The dataset comprises of 5 columns:

1.User ID

2.Gender

3.Age

4.Estimated Salary

5.Purchased

Here our target variable is Purchased which consist of mainly two values:0 or 1.Here 0 means that person has not purchased the car and 1 means has purchased the car.

Data Exploration

As we can see from above output that,the dataset comprises of 400 total values and 5 columns. The columns UserID, Age, EstimatedSalary and Purchased is of integer type , Gender is of object type.

Let’s Further Explore our dataset:

Does it contain any null values?

How to check?

No worries.We will use function isnull().This will give us whether our dataset contains any null values or not.

Very well, our dataset does not contain any null values.Now we can proceed further.

Split DataFrame into X and y

Here we have only considered only two columns for training our model i.e Age and EstimatedSalary and y is used for dependent variable i.e Purchased.

We will split our model into 25% testing and 75 % for training model.

If you have observed our dataset closely we can see that the column EstimatedSalary have values ranging from 1k to 15k. These values needs to be in specific range so that our model can perform better and analyse the data very easily.So for that purpose we will use StandardScaler library from sklearn.preprocessing.

This will bring our data in proper range or we can say that the values will be in the range of 0 to 1.Not only does it helps in running the model better but also reduces the time in preprocessing those values and helps in calculation much faster.

Now, Let’s Apply the Algorithm

Now that we successfully fitted our model,Lets do some prediction on our test dataset.

The above output shows that our model has predicted values and now we will compare it with original values and check the difference between them.Let’s Find out how accurate our model has predicted values.

Some of the values of y_pred has predicted wrong.I have point it out those from using red circles in y_test.You can compare and check it by yourself.y_test contains original values of Purchased column of the dataset,while y_pred are the predicted values which our model has predicted.

Accuracy Score of Model:

Our model has accuracy score of 89.0% Which is quite good.

Confusion Matrix

Let’s Plot the confusion matrix

Summary:

We started data exploration where we find out about our dataset contain any null values or not.Afterwards we split our dataset and applied our machine learning model i.e Logistic Regression.We fit our model and also predicted values and compared them with original values and we found that some of those values our classifier has predicted wrong.Then we take a look at the accuracy score of our model which is 89.0% and Lastly, we looked at it’s confusion matrix and plot it using seaborn library.

The code can be found out on my Github repo here and also see my kaggle notebook for this dataset which you can found here.