Using KNN Machine Learning Model to Predict Diabetes Patients (With Code)

Jyotiraj Nath
4 min readMar 25, 2024

--

Figure 1: Artistic Representation of Machine Learning (Courtesy: Freepik)

Before diving into the KNN model, let us first see why we need the KNN in the first place:

WHY KNN?

To understand this, let us take an example

Fig-2: Is it a cat or dog (Courtesy: Pinterest)

Suppose you are moving with your kid and come across a cat. It’s the first time your child has seen a cat, so you first tell them about their different characteristics (Such as making meow sounds or having smaller ears) based on the data and compare other data sets of other animals, such as dogs. It makes a prediction. Your kid confirms it indeed is a cat.

For instance, we can do classification using the KNN Classifier.

So, WHAT IS KNN?
KNN called the K Nearest Neighbors, is one of the most straightforward Supervised Machine Learning Algorithm mostly used for classification (It classifies a data point based on how its neighbors are classified)
- KNN stores all available cases and classifies new cases based on a similarity measure

Now, let’s see how we choose this k value.

CHOOSING K VALUE:
KNN Algo is based on feature similarity: Choosing the correct value of k is a process called parameter tuning and is essential for better accuracy.

Now, there lies some problem: if we choose a minimal value of k than required, we will end up not accounting for the case scenario, and if the value is too large, the code will keep running.

So, to select a value of K, we can:

  1. Sqrt(n), where n is the total number of data points or
  2. The odd value of K is selected to avoid confusion between two classes of data

WHEN DO WE USE KNN?
When we have the information that the data is

  1. Labeled,
  2. The data is noise-free,
  3. And the dataset is small.

HOW THE KNN MODEL WORKS?

  1. First, find the nearest neighbors; we calculate Euclidean distance using the distance between two points in the plane with coordinates (x, y) and (a, b) is distance d =sqrt(x-a)² + (y-b)²

2. Based on the distance calculated, the new data point is classified accordingly

Let’s make use of KNN for an example: Predict Diabetes

The objective will be as follows:

  • Predict whether a person will be diagnosed with Diabetes; for the dataset, we will use the 768 people’s information provided in the video’s description. (Link)

The codes will be as follows:

#Doing the imports

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
dataset = pd.read_csv(r'C:\Users\jrnat\Downloads\diabetes.csv')
print(len(dataset))
print(dataset.head())
768
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
#Now values of columns like "Insulin", "Blood Pressure" cannot be accepted as zeros because it will affect the outcome
#We can replace such values with the mean of the respective column
#Replacing Zeros
zero_not_accepted = ['Glucose', 'BloodPressure','SkinThickness','BMI','Insulin']
for column in zero_not_accepted:
dataset[column] = dataset[column].replace(0,np.NaN)
mean = int(dataset[column].mean(skipna=True))
dataset[column] = dataset[column].replace(np.NaN, mean)
#Before proceeding further we need to split the dataset into train and test:
#Split dataset
X = dataset.iloc[:,0:8]
y = dataset.iloc[:,8]
X_train, X_test, y_train,y_test = train_test_split(X,y,random_state=0,test_size=0.2)
#Now doing the Feature sclaing
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Now let us define the model using KNN Classifier and fit the train data in the model
#Define the model: init K-NN
classifier = KNeighborsClassifier(n_neighbors=11,p = 2, metric = 'euclidean')

#Fitting the model
classifier.fit(X_train, y_train)

KNeighborsClassifier?i
KNeighborsClassifier(metric='euclidean', n_neighbors=11)
#Now let us evaluate the model
#Before predict the test data set results
y_pred = classifier.predict(X_test)
y_pred
array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int64)
#Evaluate model
cm = confusion_matrix(y_test,y_pred)
print(cm)
print(f1_score(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
[[94 13]
[15 32]]
0.6956521739130435
0.8181818181818182
#The accuracy of the model which is 80% tell us that it is a pretty fair fit in the model

#Day8#Quantum30 #QCI

QuantumComputingIndia

--

--

Jyotiraj Nath

Masters student delving into quantum physics for cutting-edge quantum computing. Sharing insights on Medium. 🌌🔬 #QuantumTech