Using Machine Learning to predict if someone has diabetes

Published in

Learn stuff with Ed

5 min readFeb 24, 2018

Just over 30 lines of code, and 15 minutes of your time to be able to predict if someone has diabetes or not. Machine Learning has come a long way!

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. Since Keras defaults to TensorFlow, that’s what we are going to use as a backend.

By the end of this, you will have built a neural network as complex as the image below that can predict whether someone has diabetes or not based on 8 variables.

Alright! Look at the time now, because in fifteen minutes you should have your first model trained and able to make predictions!

Project dependencies are tensorflow, keras and numpy, so make sure you have those installed before we start coding.

We will use a dataset that contains information about patients with diabetes, they contain the variables mentioned above on this order:

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skinfold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)²)
Diabetes pedigree function
Age (years)

I used a public machine learning dataset repository to retrieve this data, you can download it here. The dataset is a CSV with 9 columns, 8 with the variables listed above and a ninth one containing a boolean informing whether the person has diabetes or not.

There’s a lot going on above, how are we doing with the time?

We should have some time left to talk about what’s going on above. From lines 1 to 3 we are just importing a few dependencies, no biggy.

On line 6 we are using numpy to load our dataset.

Then on lines 9 and 10 we are splitting the input from the output, making sure the variables that characterize someone with diabetes (or normal blood sugar levels) are separated from the actual information that the person has diabetes (or not).

On line 13, we are instantiating the sequence of layers of our neural network, on line 14 we are both defining our input layer and readying it to receive 8 inputs, and creating the first hidden layer using the activation function “relu” with 12 nodes.

We then define another hidden layer on line 15, containing 8 nodes, that feeds the output layer defined on line 16.

Taking us to line 19 where we are compiling our model, making it ready for training — or fitting — on line 22.

We finish up online 24 where we save the model so it can be used for predictions later.

In short, what we’ve done above is, we fed a system with the same 8 characteristics of several people and them told the system if each of these people had diabetes or not. We then generated a model that contains all of that information. A system that can read that model will be able to use the historical data contained in it to predict if a person has diabetes or not.

It is that easy. I linked a few resources above, in case you are just starting and aren’t familiar with the jargons, also left out a few details that felt like would only overwhelm you at this stage.

Just run the above and a trained model named “diabetes.h5” should be created in the same folder as your application. Quick, run it! It takes a bit to train the model!

Once that’s done, let’s look at the following code:

If you run it as is, assuming you have the “diabetes.h5” file placed in the same folder, you should get a “1.0”, what that is is a prediction that the person described in line 7, does have diabetes. You can play around with the number and see the system behaves.

Since the CSV contains the actual information as to whether the person has diabetes or not, you could pick — for instance — the person described on line 601 (1,108,88,19,0,27.1,0.400,24), who we know doesn’t have diabetes and use that to tweak the values on line 7 and check if the system can predict the result correctly.

And if you are using the same model as me, it will predict correctly that the person does not have diabetes. Then you can try person from line 604 (7,150,78,29,126,35.2,0.692,54) who does have diabetes.

BOOOOM!

False negative! Not as accurate huh?!

Well, even doctors get it wrong…

Shall we try another one? How about person from line 747 (1,147,94,41,0,49.3,0.358,27)?

AAAAND it works!

So… What time is it? Leave a comment below if you made it on time!

But now you are doubting the accuracy of this thing, right? Well, we only fed it with data from 600 people, it’s known that the more data you put in, the more accurate it becomes, but how accurate, exactly it is right now? You can use the code below to find out:

In this case, the accuracy will be of 83.20%. Not perfect but also not bad huh?

What is next?

You can try adding more (or reducing) hidden layers, playing with the number of nodes, or modifying the activation function and seeing how that affects the accuracy of your model.

Using Machine Learning to predict if someone has diabetes

What is next?

Written by Edward Leoni