Creating a Classifier from the UCI Early-stage diabetes risk prediction dataset

Published in

Analytics Vidhya

8 min readAug 31, 2020

Image from https://images.app.goo.gl/nu25PM9WQP6JhAmD8

To begin we must first go and download the dataset from the UCI dataset repository. The link for the dataset can be found below.

https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.

After downloading the dataset, as long as it is not too big, I like to look at it in a spreadsheet to get a sense of what I am working with.

As you can see we have 17 total variables with what appears as binary record values for each field except for ‘Age’. From here we’ll open the dataset in a notebook environment to explore it more. For this project, I used Google Colab which is based on a Jupyter notebook environment and does not require any configuration before using.

There are a few ways to pull data into Google Colab from a personal location of yours. For this project, I ran the following command which allows you to browse your local computer for a file to upload.

from google.colab import files
uploaded = files.upload()

From there we’ll load in some necessary libraries.

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import sys
np.set_printoptions(threshold=sys.maxsize)

The next step is to read in the data to a DataFrame and to explore the variables to see if we will need to do any data imputation.

Dia_df = pd.read_csv('diabetes_data_upload.csv')Dia_df.head()Dia_df.info()

We see that there are no NULL values so we will not need to do any imputation on the dataset. However, we will still need to do some reformatting to get the data into a format that a machine learning algorithm can handle.

Before we do that though we will split our data into a training set and a test set. This way we are able to have an idea of how our model will generalize with new data it has not seen before. Then we will split up the training instances and label instances. Since ‘Class’ is the label we are trying to predict we create a series with just that value in it called dia_labels.

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(Dia_df, test_size = 0.2, random_state = 42)dia_train = train_set.drop("class", axis=1) # drop labels for training set
dia_labels = train_set["class"].copy()

As mentioned previously we will need to reformat the data into a structure that can be fed into a machine learning algorithm. To accomplish this we will transform the dia_train DataFrame into a NumPy array. We will need to convert the words into numbers in order to do this. My approach was to convert the ‘Yes’ records to 1’s and the ‘No’ records to 0’s. I could have manually done this, but I wanted to build a Pipeline out of my model so it could be more dynamic in future use. I started with the following Class and Pipeline.

from sklearn.preprocessing import LabelEncoderclass MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encodedef fit(self,X,y=None):
        return self # not relevant heredef transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return outputdef fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['Polyuria', 'Polydipsia', 'sudden weight loss', 'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring', 'Itching', 'Irritability', 'delayed healing','partial paresis',  'muscle stiffness', 'Alopecia', 'Obesity']))
])

This code will allow us to have most of our column values converted once we execute fit_transform(), but we still need to make some changes to the ‘Age’ and ‘Gender’ columns. You may be wondering why we need to make changes to ‘Age’ since it is already filled with numbers. The reason is due to the scale those numbers have compared to everything else. Since the other columns are only values of 1 and 0 the model would place much more significance on the ‘Age’ column. To handle this we can use Sklearns built-in StandardScaler transformer, which is built into sklearn.preprocessing.

Finally, we have the ‘Gender’ column which is categorical. We naively will assume that the ‘Male’ and ‘Female’ hold the same weight when it comes to determining which ‘Class’ they will fall into. We can use another built-in transformer called OneHotEncoder(). This function can also be found in sklearn.preprocessing. OneHotEncoder creates a matrix with a male column and a female column. If someone is male there will be a 1 in the male column and a 0 in the female column. The opposite would be true for a female.

Since there are only 2 different categories in the ‘Gender’ column we could have used normal encoding, so we could make the ‘Male’ instances 1’s and the ‘Female’ instances 0, all within the same column. This essentially would just be using one value to perform the math behind the scenes. Since theoretically in the future, there could be an ‘Other’ option for the ‘Gender’ column I decided to use OneHotEncoding.

OneHotEncoding helps to eliminate unfair bias if there is not a natural order between categorical values in a column.

Now we will wrap all of these techniques into one pipeline to handle all of our columns.

enc_attribs = ['Polyuria', 'Polydipsia', 'sudden weight loss', 'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring', 'Itching', 'Irritability', 'delayed healing', 'partial paresis', 'muscle stiffness', 'Alopecia', 'Obesity']
num_attribs = ['Age']
cat_attribs = list(train_set[["Gender"]])full_pipeline = ColumnTransformer([                           
        ("num", StandardScaler(), num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
        ("encoding", encoding_pipeline, enc_attribs)
    ], remainder='passthrough')dia_prepared = full_pipeline.fit_transform(dia_train)

By calling dia_prepared we can ensure that everything transformed as planned.

dia_prepared

At last, we have our data in a spot to use a machine learning algorithm. Since we are building a classifier we will want to use a classifying algorithm such as a Decision Tree, Logistic Regression, or a Support Vector Machine Classifier. It’s fairly easy to experiment with these different algorithms built into Sklearn to find one that performs well.

I ultimately went with the Support Vector Machine Classifier using the code below. You can experiment by importing different algorithms instead of SVC from Sklearn.

from sklearn.svm import SVCclf = SVC()
clf.fit(dia_prepared, dia_labels)

With our model fit it is now time to do some evaluation on how it performed on our training data.

from sklearn.metrics import accuracy_scorediabetes_predictions = clf.predict(dia_prepared)
accuracy_score(dia_labels, diabetes_predictions)
from sklearn.metrics import confusion_matrixcm = confusion_matrix(dia_labels, diabetes_predictions)
print(cm)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm)
plt.title('Confusion Matrix')
fig.colorbar(cax)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

We received an accuracy of 97.6% with 4 False Negatives and 6 False Positives.

With our metrics in a good spot, we can move on to testing the test_set data to get a feel for how well our model will generalize.

We’ll first drop the label from the test set similar to how we dropped it from the training set.

X_test = test_set.drop("class", axis=1) # drop labels for testing set
y_test = test_set["class"].copy()

Since the test set has all the same variable names as the training set we are able to run it straight through our ‘full_pipeline’ to get it in the correct format for our trained model to make predictions with it.

full_pipeline = ColumnTransformer([                           
        ("num", StandardScaler(), num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
        ("encoding", encoding_pipeline, enc_attribs)
    ], remainder='passthrough')X_test_prepared = full_pipeline.fit_transform(X_test)
X_test_prepared

We can now make predictions by calling predict in our ‘clf’ model and passing in ‘X_test_prepared’.

test_predictions = clf.predict(X_test_prepared)
accuracy_score(y_test, test_predictions)cm = confusion_matrix(y_test, test_predictions)
print(cm)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm)
plt.title('Confusion Matrix')
fig.colorbar(cax)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Our model performs with an accuracy of 96.2% with 1 False Negative and 3 False Positives.

Now that we have trained and tested our model we can wrap the preparation and the predictor into one pipeline and test it out.

sample_data = dia_train.iloc[5:10]full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("class", SVC())
    ])full_pipeline_with_predictor.fit(dia_train, dia_labels)
full_pipeline_with_predictor.predict(sample_data)

The above code takes a small sample of our training data, prepares it for an algorithm, and then makes a final prediction based on the training weights from our training data.

Lastly, we can save the weights of our trained model into a pickle file. This will allow us to use the model in future code without having to retrain our model. While our model did not take very long to train some can take hours or even days, so using it for predictions would not hold a lot of utility if you had to train it every time. This can be done using ‘dump’ from the library joblib.

diabetes_model = full_pipeline_with_predictorimport joblib
joblib.dump(diabetes_model, "diabetes_model.pkl")

You can then bring the model into your new code and make predictions by using the following code.

diabetes_model_loaded = joblib.load("diabetes_model.pkl")
diabetes_model_loaded.predict(sample_data)

This was an example of how we can use machine learning to make predictions on data we do not know a lot about. Fortunately for us, the dataset we used was quite clean and did not require a lot of data preprocessing. Most real-world projects would have more time devoted to that step of the project. For the full notebook of this code, check out the link below.

https://github.com/jjevans25/UCI-Diabetes-Data/blob/master/Diabetes_Model.ipynb

— — — — — — — — — — — — — — — — — — — — — — — — — —

I would like to note that I am not a medical professional and would not recommend using this model to make any health-related decisions.

Creating a Classifier from the UCI Early-stage diabetes risk prediction dataset

Written by Jarrett Evans