Symptoms to Disease Prediction Model

5 min readJun 24, 2020

Are you also searching for a proper medical dataset to predict disease based on symptoms?

I wanted to make a health care system in which we will input symptoms to predict the disease. I searched a lot on the internet to get a big and proper dataset to train my model but unfortunately, I was not able to find the perfect one. Then I used a relatively smaller one which I found on Kaggle Here. Then I found a cleaned version of it Here and by using both, I decided to make a symptoms to disease prediction system and then integrate it with flask to make a web app.

In this story, I am just making and training the model and if you want me to post about how to integrate it with flask (python framework for web apps) then give it a clap 👏

Getting Started:

To train the model, I will use PyTorch logistic regression. Pytorch is a library managed by Facebook for deep learning. It has a lot of features built-in. Here I am using a simple Logistic Regression Model to make predictions since the data is not much complex here.

Importing Utilities

First of all, we need to import all the utilities that will be used in the future. Read the comments, they will help you understand the purpose of using these libraries.

Handling CSV Files

Now I am defining the links to my training and testing CSV files.

Now we will read CSV files into data frames. (Dataframes are Pandas Object). Keep reading the comments along the code to understand each and every line.

Train CSV to train_df

Test CSV to test_df

Now we are getting the number of diseases in which we are going to classify. These are needed because the logistic regression model will give probabilities for each disease after processing inputs.

Total Classes

Now we are getting the names of columns for inputs and outputs.
Reminder: Keep reading the comments to know about each line of code.

Column Names

The below code will make a dictionary in which numeric values are mapped to categories. For further info: check pandas cat.categories and enumerate function of python.

Now we have to convert data frame to NumPy arrays and then we will convert that to tensors because PYTORCH WORKS IN TENSORS.
For this, we are defining a function that takes a data frame and converts that into input and output features.

Read the Comments

The above function will give NumPy arrays so we will convert that into tensors by using a PyTorch function torch.from_numpy() which takes a NumPy array and converts it into a tensor.

Repeating the same process with the test data frame:

Datasets

The test CSV is very small and contains only one example of each disease to predict but the train CSV file is large and we will break that into three for training, validating, and testing. And then join both the test datasets into one test dataset.

Now we will set the sizes for training, validating, and testing data.

In the above cell, I have set the manual seed value. We set this value so that whenever we split the data into train, test, validate then we get the same sample so that we can compare our models and hyperparameters (learning rates, number of epochs ).

Now we will get the test dataset from the test CSV file

Now will concatenate both test dataset to make a fairly large dataset for testing by using ConcatDataset from PyTorch that concatenates two datasets into one.

DataLoaders

Now we will make data loaders to pass data into the model in form of batches

Batch size depends upon the complexity of data. Since the data here is simple we can use a higher batch size. In image processing, a higher batch size is not possible due to memory. If you have a lot of GPUs, go for the higher batch size 😉. The higher the batch size, the better it is.

Defining Utility Functions

Now we will define the functions to train, validate, and fit the model.
Accuracy Function:
We are using softmax which will convert the outputs to probabilities which will sum up to be 1, then we take the maximum out of them and match with the original targets. If they are equal, then add 1 to the list. torch.sum adds them and that they are divided by the total to give accuracy value.

Optimizer and Loss Function

loss function calculates the loss, here we are using cross_entropy loss

Remember : Cross entropy loss in pytorch takes flattened array of targets with datatype long.

Optimizer change the weights and biases according to loss calculated, here we are using SGD (Stochastic Gradient Descent)

Fit Function:
This will print the epoch status every 20th epoch.

Model

Now we will use nn.Module class of PyTorch and extend it to make our own model class.

Read Comments

Predict_Single Function Explanation
Sigmoid vs Softmax

Sigmoid converts all numbers to list of probabilities, each out of 1
Softmax converts all numbers to probabilities summing up to 1
Sigmoid is usually used for multi labels classification
Softmax is used for single-label classification.
You might be wondering why I am using Sigmoid here?? So the answer is that I also want my system to tell the chances of disease to people. If I use softmax then my system is predicting a disease with relative probability like maybe it’s 0.6 whereas sigmoid will predict the probability of each disease with respect to 1. so my system can tell all the disease chances which are greater than 80% and if none of them is greater than 80% then gives the maximum.
Read all the comments in the above cell. Each line is explained there.

Initializing a model

initial evaluation (first is a loss, other is accuracy)

Training the model

learning rate = 0.1
epochs = 100

Testing

Predicting Single Value to check

Save the model

Method 1
This will save the whole model to a file
torch.save(model,'model.pth')
Method 2
This will save the model state dict (weights and biases) to a file
torch.save(model.state_dict(), 'model_st_dict.pth')

Graphs

Using matplotlib to plot the losses and accuracies

Loss Graph:

val_losses = [his['validation_loss'] for his in history]
train_losses = [his['training_loss'] for his in history]
val_acc = [his['validation_acc'] for his in history]plt.figure(figsize=(6,4))
plt.plot(val_losses, '-r', label="val_loss")
plt.plot(train_losses , '-b', label = "train_loss")
plt.xlabel("Epochs")
plt.ylabel("Losses")
plt.legend()
plt.show()

Accuracy Graph

Conclusion:

We trained a logistic regression model to predict disease with symptoms.
If you want to ask anything, you can do that in the comment section below.
If you find anything wrong here, please comment it down it will be highly appreciated because I am just a beginner in machine learning. This course was the first step in this field.

References:

Check out these documentations to learn more about these libraries

Pytorch Documentation
Pandas Documentation
Matpltlib Documentation
Numpy Documentation
or you can use Stackoverflow to search the queries
Check out full notebook here

THANK YOU ❤️