The Eyes of an Eye Doctor-Detect Blindness with Deep Learning

Published in

Analytics Vidhya

14 min readJan 29, 2020

https://media.giphy.com/media/d1E0XQlBb5QIQEFi/giphy.gif

Introduction

It is an undeniable fact that advances in Deep Learning can make our lives better in many ways, one of the areas in which they are creating a huge impact is Healthcare by making it more affordable.
There are various applications of Deep Learning in Health care
like malaria ,cancer ,pneumonia detection etc, but have you ever wondered
How does this work under the skin?.
How does a computer analyze the scanned images and process them?.
How can a deep learning network think like a doctor’s brain?.
In this blog all the above questions will be answered by going through a solution for a real world problem.

What is the problem?.

The problem chosen here is “Diabetic Retinopathy” it is a condition which affects eyesight in diabetic patients, The uncontrolled blood sugar levels will lead to damage of blood vessels and when this happens in the light sensitive tissues at the back of the eye (retina) vision starts to deteriorate and will end in complete loss of vision, for more information on this please refer link.

How Can Deep Learning Help here?.

The best way to fight against any health issue is to prevent it, to prevent diabetic retinopathy one has to detect it as early as possible and
How does one detect this condition?
Ans) The Retina of patient is scanned and the scanned image is evaluated by a highly trained doctor to provide diagnosis.
Now what happens in a case where the patient cannot afford a doctor? or what happens when there are millions of patients to be diagnosed and number of doctors available are few?.
Imagine if we can replace the evaluation by doctor part with a neural network? or maybe just precede the doctor’s evaluation with neural network so that the doctor has to look at only the severe cases or the cases in which the neural network is unsure, wouldn’t that be a great solution?.

Aravind Eye Hospital from India is trying to provide Diabetic Retinopathy Diagnosis to millions of patients from rural areas by leveraging the idea presented above, To Build a neural network for detecting blindness they have hosted a kaggle competition (APTOS 2019 Blindness Detection) from which we will acquire data and build solution.

Basics of Deep Learning

Note-If you are familiar with concepts like neural networks and convolutional neural networks you can skip this section.
A Neural Network:-

https://www.quora.com/What-is-the-differences-between-artificial-neural-network-computer-science-and-biological-neural-network#

The idea here is to mimic a neuron in human to come up with an algorithm which can think like a human this might seem a bit complicated but is pretty simple to understand.
Let’s see what happens in a biological neuron, suppose you touch a hot pan then the receptors on your skin send electrical signals to your brain they are received by tentacle like structures called dendrites and these signals from various sources are processed in the nucleus after which they are sent via axons to other neurons or brain.
Scientists wanted to leverage this idea and build a simple algorithm which can take some numbers as input, process these numbers in a neural network like structure and give outputs. There are different types of neural networks one such type which works well for image data is Convolutional Neural Network which we will be using to solve the problem in hand.

The in-depth working of neural networks and CNN’s are vast and cannot be covered in a single blog post, i shall leave you with some useful resources if you are interested in learning them.
For now think of a CNN as a structure which takes in images in a numerical form then performs some operations on these input numbers to arrive at an output.

Now there might be a few questions from the above intuition of CNN like
1.How are images represented as numbers?.
Ans. Everything we store in a computer is internally stored in form of zeros and ones (binary), let’s take a look at an image which is represented.

https://mitchellkscscomputing.wordpress.com/2015/10/21/how-bitmap-images-are-represented-in-binary/

In the above illustration observe how the images is represented on bitmap the dark squares are represented by 1 and the light cells are represented by 0.
The resolution of the image also decides the size of bitmap for example the image above has a resolution of 8X8 (8 rows and 8 columns) similarly an image with resolution of 128X128 will have a bitmap with 128 rows and 128 columns.

2. How can we convert image files in .jpg or .png format into numerical form?.
Ans. Thanks to opencv this task has a single line answer in python just type numerical_rep = cv2.imread (path).
The function imread will read the image from the path specified and convert it into numerical array which is now stored in variable ‘numerical_rep’.

3. How does the CNN structure predict outputs based on inputs?.
Ans. There are two phases here the first one is training phase which can be related to the task of teaching a human brain for example we register an animal like dog or cat in a child by showing multiple examples of dogs and cats similarly we give the numerical representation of an image of dog or cat as input these numbers then undergo some operations in the CNN structure and an output (dog or cat) is given this is also called forward propagation.

https://towardsdatascience.com/everything-you-need-to-know-about-neural-networks-and-backpropagation-machine-learning-made-easy-e5285bc2be3a

Let’s just say the operations performed at first attempt are random and the output we received is ‘Cat’ but the original output is ‘Dog’, since our CNN predicted the output wrong we find the difference between the original and predicted output and pass this error back into the CNN structure which is also called backward propagation through the process of repeated forward propagation and backward propagation the CNN learns the patterns present in the image which has a output label of ‘Dog’ similarly the structure learns the patterns present in an image which is labelled ‘Cat’ examples of patterns and features learned are color pattern, ear shape, eye color, nose length etc.

Once the training phase is over the CNN has learned the patterns from training images now using this knowledge we can predict the output on unseen data which is also called test phase.
With this you should now have a very basic idea on how a CNN works and will be able to understand rest of the blog.

APTOS 2019 Blindness Detection

Aim- To build a neural network which can take in scanned retina images as input and give out a number between 0–4 based on the severity of blindness where ‘0’ means ‘No DR’ and ‘4’ means ‘Proliferative DR’.

Performance Metrics
These are different measures to evaluate the performance of our model so as to determine how good it is at given task.

Weighted Kappa Score
Confusion Matrix

Confusion matrix is well known but Kappa score isn’t but it is very similar to accuracy, it can also be understood as an extension to simple accuracy measure.

Kappa is a score which takes into account both accuracy of the model with respect to the doctor’s diagnosis and also the agreement of the model and Doctor by chance it is represented by ‘κ’ and is defined as

https://en.wikipedia.org/wiki/Cohen%27s_kappa

where ‘po’ is the relative observed agreement among raters (identical to accuracy), and ‘pe’ is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then kappa =1. If there is no agreement among the raters other than what would be expected by chance (as given by ‘pe’ ), kappa =0.

Weighted Kappa is a small variation to this, here if two raters disagree with each other then the score is given according to the distance of the ratings given by both raters.That means that our score will be higher if (a) the real value is 4 but the model predicts a as 3, and the score will be lower if ( b ) the model instead predicts a as 0.

Challenges
1. Number of images provided for training are less (3662), deep learning usually requires large datasets to obtain good results.
2. The sources of images are multiple, the images are captured in various lighting conditions and quality of images also vary.
3. There are no strict latency restrictions but the model shouldn’t take more than a few minutes to diagnose an image.

Exploratory Data Analysis and Feature Engineering

We are provided with two folders of images, one train and other test and two corresponding csv files the train csv file contains the image name and in first column ‘id_code’ and the corresponding diagnosis given by doctor in the second column ‘diagnosis’ where as the test csv file has only the image name and the diagnosis column of it is not given, let’s load the csv files and see a sample of train csv.

Before we proceed any further is it important for us to stratify split the given train set into train and cross validate so that we will have a set of unseen data to evaluate our models on.

Distribution of output variable ‘diagnosis’ in train and cross validate

Distribution of output variable in train set

Distribution of output variable in cross validation set

From the above distribution plots we can see that the percentage of points in each class of train set and validation set are similar.

How do we know that a patient have diabetic retinopahy?. There are at least 5 things to spot on

Let’s now look at examples of the images belonging to each category

Examples of images belonging to each category

The first row in the image above contains 5 examples of retina scans which are diagnosed as ‘0’ ( ‘No Diabetic Retinopathy’ ) and the last row contains examples of retina scans which are diagnosed as ‘4’ ( ‘Proliferative DR’ ) if you observe more closely the images in first row are much cleaner and do not seem to have any abnormal shapes or bulges in them but the images in last row have visible spots, bulges like aneurysm, cotton wool spots and others.

Image Pre-Processing
This is the most important part of any image based task, a few common processing steps are color conversion, cropping, resizing etc.

The above function reads an image from hard disk converts it into RGB (by default it is read in BGR) resizes the image into 128X128 pixels and there is an option to enable Ben’s Preprocessing technique if enabled a Gaussian filter is applied which is a low pass filter as it helps removing noise the image and it is merged with the original image by giving different weights as it helps in enhancing the features we care about.
Images belonging to each category after preprocessing without cropping dark extras.

From the above image it is clear that the preprocessing has worked since the features are more pronounced.
‘Crop_dark_extras’ this function is used to remove the additional dark space in our image as no useful information can be attained from this example-

After processing of every image the corresponding numerical representation of the image is added to a numpy array, train, cv, and test sets have different arrays. These arrays are also stored in .npy format on disk so that they can just be loaded when required.

Transforming Output Variables
Originally the output variables are one-hot like encoded (if the output class is 3 then variable ‘y’ is encoded as [0,0,1,0,0]) and hence the problem posed here is a “Multi-calss Classification” problem which is fine but since this is a healthcare problem and the cost of detecting ‘False Negative’ is very high we shall re-frame the problem to an ordinal regression problem.
When converted to Ordinal regression problem if the output class is 3 then the variable ‘y’ is encoded as [1,1,1,0,0] which also means if the data point belongs to class 3 then it also belongs to class 1 and 2 also (Reference).

More Data!
Having more data is always beneficial and with some search I came to know that a similar competition was held in 2015 on Kaggle, hence we shall download and preprocess this 2015’s dataset as well.

Modelling

As we all know there is no one universal Model that will work for any data, hence we will try and experiment with multiple Model Architectures and see what works best in our case (For more details on step by step explanation please refer to the ipython notebook on github).

Before we experiment with any architecture let’s decide the constants
Calculation of Kappa score
Since the problem is posed as an Ordinal regression problem the final layer of the model will be a sigmoid layer which will give us probabilities between 0–1 and treat each class as a separate binary classification problem for example the output from final layer is [0.8,0.4,0.7,0.2,0.1] and we set the probability threshold as 0.5 then the output is [1,0,1,0,0] now the highest class is taken and all the classes before it are considered to be 1 so the output becomes [1,1,1,0,0] this is then compared with the original output variable to determine Quadratic Weighted Kappa score.
Loss Function
Binary cross entropy is calculated on each output class and is summed up to obtain the loss for each data point
Optimizer
Adam

Architecture Experiments
Model-1 (Baseline model with simple architecture)
This is built to see how well a basic solution can perform and achieve a reference kappa score.

(For detailed logs and code please refer to the ipython notebook on github.)
The model is a relatively small model as it has only about 2.8M parameters

The model performs decent by achieving a kappa score of 0.70.
Many more such architectures were experimented in a random search like approach by taking help of Hyperas (to enable GPU processing) and the best architecture from this (Model-2 with 2019’s data) could achieve a kappa score of 0.80 which is good but from the recall matrix we can see that the class-3 is dominating and most of the points are misclassified as class-3.

Transfer Learning Techniques
There are many state of the art architectures available which are pretrained on image datasets, these can be leveraged to solve our problem in hand and along with this we shall also include two more enhancements to our approach as to get the best best solution.

a) Perform Data Augmentation with Image Data Generator techniques
Perform operations such as rotate/flip on the train images before they are fed to the model this will help the model to deal with unseen data better and is also useful when the available dataset is small

Visualizing images generated by ImageDataGenerator

b) Utilize 2015’s competition data
All the models from now on are pretrained on 2015’s competition data by following the pipeline mentioned below.
Preprocess 2015’s data->Combine train,cv and test sets to form one whole set of train data (as we validate only on 2019’s competition data)->Define a model with imagenet weights as starting weights ->Train the model with 2015’s data for 5 epochs and save the weights.

Densenet 121
The Densenet architecture is first pretrained on 2015’s data and the weights from this model are used as starting point for the model using 2019’s competition data.

From the above cell we can see that the number of parameters here are around 7M which is significantly greater than our baseline models.
This model performs very well by score a kappa score of 91.34.
The Recall matrix of this model is a huge improvement to our baseline model.

Resnet 50
This is a larger architecture with more parameters than Densenet 121, same process of pretraining on 2015’s data is followed here.

The number of parameters here have increased to aorund 23.5M as compared to the Densenet121 architecture with only 7M pramaeters.
Although the performance of this model is not significantly better that what is achieved by Densenet121 as this achieved a kappa score of 91.65 and also the recall matrix for this model is very similar.

Efficientnets
This is an architecture which is designed work similar to state of the art architectures but with lesser number of parameters and hence less computation power, the architecture is based on Mobilenet (this uses concepts like depth-wise convolution to reduce the number of parameters) and Efficientnet paper focuses on how to scale a CNN architecture effectively.
Based on the depth and width of the model’s architecture Efficientnets are divided into 7 models where Efficientnet-B0 is the smallest architecture and Efficientnet-B7 is the largest architecture.

To compare the performance of Efficientnets to Densenet121 and Resnet50 we shall choose Efficientnet-B4 in which the number of parameters are comparable (Efficientnet B0 and B3 are also experimented with in the github code).

The number of parameters used by this architecture is more than Densenet121 and less than Resnet50 but the performance achieved here is the best among the three, as this model scores a kappa score of 92.01 and also the recall matrix has improved.

Plot for Validation Kappa score on Efficientnet Model

The Kappa scores show that there is no overfitting

Also XGBoost on outputs of various models was tried out but the performance improvement is insignificant hence Efficientnet-B4 is chosen as final model.
A pipeline which implements everything from preprocessing raw data to predicting output is created and is present in the github link.

A Kappa score of 92 means that our best Model and Doctor agree with each other approximately 92% of the time which is exceptional!!

https://media.giphy.com/media/9Ai5dIk8xvBm0/giphy.gif

Links

LinkedIn- click here
Github- click here

References

Appliedaicourse.com (This the place where I gained all my knowledge on Machine Learning and Deep Learning)
https://github.com/btgraham/SparseConvNet/blob/kaggle_Diabetic_Retinopathy_competition/competitionreport.pdf
https://arxiv.org/abs/1905.11946
https://arxiv.org/abs/0704.1028
https://www.kaggle.com/xhlulu/aptos-2019-densenet-keras-starter
https://www.kaggle.com/c/aptos2019-blindness-detection/discussion/108065

Future Work

Pretrain the Best model on 2015’s data for more than 5 epochs to attain a more powerful starting model.
Experiment with more image augmentation techniques.
The model can be deployed on cloud so that it can be accessed across the world in rural medical camps.