Diabetes prediction system with KNN algorithm

Abdalla A. Mahgoub, MSc / CISI
CodeX
Published in
7 min readMar 20, 2021

First I would like to approach this mini project as if I have a problem to solve, the project will be solved with adopting K-nearest neighbour (KNN) predictive model , so obviously it is a classification model challenge. In this case I’ve chosen a dataset related to healthcare which is associated to a diabetes database from the National Institute of Diabetes and Digestive , I will try to test the data if a person is a diabetes or not . The dataset was obtained from Kaggle which is a website offers a variety list of different datasets based on real live data and occurrences. Therefore, without further ado let’s start deciphering the dataset and create a predictive model using KNN with cross validation model.

Description of dataset :

I’ve obtained the subjected dataset from Kaggle , however the dataset was initially presented by the National Institute of Diabetes and Digestive and Kidney Diseases .The dataset is consist of predictive variables and Outcome in which it describes if a person is a diabetes of not. The dataset represents a list of study from different patients that leads to classification of either diabetic or not. For this coursework I will use these presented data and adopt a Knn algorithm to test some given data of patients and see if they are under either category diabetes or non-diabetic. Total number of studied list in this dataset related to diabetic and non-diabetic patient is 768 , which we will manipulate ,scrap and clean these data to use them in our KNN predictive model.
Before we start working on our predictive model using Knn algorithm , we need to know a bit about what is KNN algorithm .

KNN algorithm is a supervised machine learning algorithm that deals with similarity . KNN stands for K-Nearest Neighbors. It’s basically a classification algorithm that will make a prediction of a class of a target variable based on a defined number of nearest neighbors. It will calculate distance from the instance you want to classify to every instance of the training dataset, and then classify your instance based on the majority classes of k nearest instances.

Distance between data points in Knn algorithm

For this project the library by default will consider the Euclidean distance to measure the distance between two data points or vectors from the dataset.

In [1]:

import os
from IPython.display import Image
print("**Euclidean Distance Formula**")
Image(filename="../input/euclidean-distance/euclidean distance.JPG", width= 500, height=200)
**Euclidean Distance Formula**

Out[1]:

Reading and exploring the dataset

Below we start by opening the subjected dataset using pandas syntax csv_read() which read the dataset and transform it to a structured tabular data for us to read.

In [2]:

# First let's start with calling all the dependencies for this project 
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
%matplotlib inline
location = '../input/diabetes/diabetes.csv'
f = pd.read_csv(location)
data = pd.DataFrame(f)
data.head()

Manipulating and Cleaning our dataset

In this section , we will attempt to clean our dataset from al zeros and missing values such as NaN , and replace them with the mean of the designated columns. I ‘ve decided to use a specific number of columns to do the cleaning as these subjected columns which are mentioned as following [‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’,’Pedigree’] , because they are the most important data with a visible impact which determine if a patient is diabetic or not .

In [3]:

#cleaning the dataset  from missing values or zeros
#zeros or missing values will be replaced by the mean of that particular column
# this practice is the best practie to have a readable and consistent data values
cols_clean = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Pedigree']
# with this function , i dealt with missing values and NaN values
for i in cols_clean:
data[i] = data[i].replace(0,np.NaN)
cols_mean = int(data[i].mean(skipna=True))
data[i] = data[i].replace(np.NaN, cols_mean)
data1 = data
data1.head().style.highlight_max(color="lightblue").highlight_min(color="red")
In [4]:
# Let's take a quick statistcal view of the data provided

print(data1.describe())
Plotting the datasetThe diabetes updated dataset is ready for a basic plotting, in order to see how would our data looks like, also plotting at this stage will help me decide which column I will choose to run a K-nearest neighbour (KNN) experiment. For plotting I’ve used pairplot() function with the help of Seaborn library , that will give me a range of graph plotting for each group of data presented in the dataset .In [5]:graph = ['Glucose','Insulin','BMI','Age','Outcome']
sns.set()
print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))
<seaborn.axisgrid.PairGrid object at 0x7ff431934610>

It’s obvious we are dealing with a rich multideminsional dataset with many data points belong to the presented variables. To make our life easier and for simplicity , we will select only a few variables to test our model.

In [6]:

# for the purpose of simplicity and analysing the most relevent  data , we will select three features of the dataset
# Glucose , Insulin and BMI
q_cols = ['Glucose','Insulin','BMI','Outcome']
# defining variables and features for the dataset for splitting
df = data1[q_cols]
print(df.head(2))
Glucose Insulin BMI Outcome
0 148.0 155.0 33.6 1
1 85.0 155.0 26.6 0

Splitting the dataset into training and testing dataset

A particularly important part of machine learning modelling or preparing data for machine learning algorithms is splitting our dataset into training and testing datasets.

Mainly , datasets undergo a splitting process for the purpose of testing the model , the testing process will determine how accurate your machine learning algorithm in predicting every testing sets against training sets, and how that will take shape in the real world. Then , we take the presented data and compute the accuracy rate of the Machine learning algorithm. Ideally , the higher the accuracy rate of your machine learning algorithm the better is your model in predicting presented sample data.

In [7]:

# let's split the data into training and testing datasets
split = 0.75 # 75% train and 25% test dataset
total_len = len(df)
split_df = int(total_len*split)
train, test = df.iloc[:split_df,0:4],df.iloc[split_df:,0:4]
train_x = train[['Glucose','Insulin','BMI']]
train_y = train['Outcome']
test_x = test[['Glucose','Insulin','BMI']]
test_y = test['Outcome']

We need to run a quick syntax to see if these data are split correctly

In [8]:

a = len(train_x) 
b = len(test_x)
print(' Training data =',a,'\n','Testing data =',b,'\n','Total data length = ',a+b)
Training data = 576
Testing data = 192
Total data length = 768

Knn algorithm dealing with similarity between the sample test data and training data. This similarity is determined by K values , These values are defined by the closest data to the sample data points in this case , we will use two distance measurement to get the closest distances between our test data and the training dataset . The chosen distance measurement in this exercise is the Euclidean distance, However , I used a build-in library to run these operations on the model , the library I used was scikit-learn library.

KNN function

I wrote a function to populate the result of adopting KNN algorithm against the split data. This function will run the KNN algorithm K times and populate the result in a form of Lines plot .

In [9]:

# let's test it using KNN  classifier with a loop to cover as much n-neightbors as possible 
def knn(x_train, y_train, x_test, y_test,n):
n_range = range(1, n)
results = []
for n in n_range:
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(x_train, y_train)
#Predict the response for test dataset
predict_y = knn.predict(x_test)
accuracy = metrics.accuracy_score(y_test, predict_y)
#matrix = confusion_matrix(y_test,predict_y)
#seaborn_matrix = sns.heatmap(matrix, annot = True, cmap="Blues",cbar=True)
results.append(accuracy)
return results

For this exercise , i will test and plot the model with K values from 1 up to 500 and see where are we with the best overall k values

In [10]:

n= 500
output = knn(train_x,train_y,test_x,test_y,n)
n_range = range(1, n)
plt.plot(n_range, output)

Out[10]:

[<matplotlib.lines.Line2D at 0x7ff42c0bf450>]

Having the opportunity to experiment with different K from n=1 to n=500 , From the figure I can conclude that the best k that could optimize this model is between 100 to 200 offering a 77% accuracy .

The ideal k value for this dataset should be 120 give or take.

--

--

Abdalla A. Mahgoub, MSc / CISI
CodeX
Writer for

Master's in Data Science. Technology Entrepreneur ,Data Scientist, Ops Analyst (ICT),Strategic Business developer, Speaker ,Writer, Full Stack developer