Day 60 of 100DaysofML

Published in

100DaysofMLcode

4 min readAug 18, 2020

Comparison of Cross Validation approaches. So in the last few blogs, I have been writing quite a bit about cross validation and I thought of taking a live example by using a random dataset, so i picked up the Diabetes dataset since it is easy to understand the columns.

The link to the diabetes dataset is given here:

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

www.kaggle.com

So I have taken the following dataset and I created my notebook on Kaggle itself. The link to my notebook is given below but I shall run through the syntax for creation of models and a rough comparison of the accuracies of the differently trained models. Let us get right into the implementation.

Let us start with importing all the required libraries.

# Import required libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings('ignore')

I strongly recommend importing the warning libraries because during the fitting phase, sklearn throws a number of warnings, so it is always important to ignore them using this library. Let us finish the import by importing all the sklearn modules required.

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut

We shall now import the dataset into our notebook. Since, I am using Kaggle, the location shall be different but if you’re using it on your local PC, just make sure you specify the directory of your csv or dataset.

data=pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

Once we are done with importing the dataset, we can check the attributes of the dataset using the following commands.

print('The shape of the data is {}'.format(data.shape))

Shape of diabetes dataset represented as (rows,columns)

data.describe().transpose()

Describing attributes of diabetes dataset column by column.

The next important set is to split our features and target variables from our dataset. If we closely look at the columns, we may notice that it is a classification problem and based on the “Outcome” column, we are determining whether the person is diagnosed with diabetes or not so that would be set as our target column and the remaining columns which represent the sex, bmi and other important features would be categorized as our features of the model.

features = data.drop('Outcome', axis=1).values 
target = data['Outcome'].values

The next step is to create a Logistic regression model and fit our data to it. We then calculate the accuracy for 30% of the data which has been set as test data and the remaining as training data. We then calculate the score or accuracy by checking the predicted targets from the test data with the actual values from the target column and print out a score.

# Evaluate using a train and a test set
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(features, target, test_size=0.30, random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))

Accuracy obtained with simple Logistic regression

Here, we split our data into training and testing using train_test_split from the sklearn module directly.

We shall now check the accuracy of the model when we use k fold approach while splitting our training and testing data and check the resultant accuracy.

kfold = model_selection.KFold(n_splits=10, random_state=100)
model_kfold = LogisticRegression()
results_kfold = model_selection.cross_val_score(model_kfold, features, target, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0))

The next approach is to create a model using the LOOCV approach which I had described in the last blog and I’m trying to implement using sklearn.

loocv = model_selection.LeaveOneOut()
model_loocv = LogisticRegression()
results_loocv = model_selection.cross_val_score(model_loocv, features, target, cv=loocv)
print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))

I had described these 3 methods in my blogs so i thought of comparing the 3 and showing how the accuracy makes a difference based on the cross validation method.

That’s it for today. Thanks for reading. Keep Learning.

Cheers.

Day 60 of 100DaysofML

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

Written by Charan Soneji