Multi-class illness Classification on Healthcare Data

Deepak Senapati
Zairza
Published in
4 min readAug 15, 2018

India has a population of 1.21 billion of which 13.1 per cent is child population aged 0–6 years. In India the number of under-5 mortality rate and infant mortality rates are 49 and 42, respectively.

This was stated in a report of UNICEF. The above excerpt represents the grim reality of child healthcare in India. In India, the eight socio-economically backward states of Bihar, Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttaranchal and Uttar Pradesh, referred to as the Empowered Action Group (EAG) states, lag behind in the demographic transition and have the highest infant and under 5 mortality rates in the country as is represented by the following bar graph.

Illness-predictor is a machine learning model which analyses the Clinical, Anthropometric and Biochemical(CAB) data to predict the illness-type in children under the age of five based on factors such as age, weight, sex, hemoglobin content and nutritional values like months of breast feeding, month when the child was first fed water, vegetables etc. Through this model the project tries to identify high risk zones that have a higher probability of common diseases like diarrhoea, dysentery etc among under 5-year-olds.

Data

Data for this project was obtained from Kaggle.

The data had 1.8 million entries but data pertaining to children under 5 years was approximately 121000, thus the entries that did not refer to under 5 year children were dropped. The missing values in the various columns were imputed with mean or median (in case of continuous features) and mode(in case of categorical features). Few of the unwanted and rudimentary feature columns were also dropped. After cleaning the final size of data was 120969 rows * 16 columns (out of which 15 were feature columns while 1 was label column).

Model Training and selection

This dataset was used now for training the various models and the best model among them in terms of accuracy and acceptable errors was chosen for the final prediction. Five classifiers were considered initially for training and testing purposes and finally the best one was selected.

The five models are :-

  1. Random Forest Classifier
  2. Logistic Regression Classifier
  3. Decision Tree Classifier
  4. K-nearest Neighbour Classifier
  5. Bagging Classifier

Two techniques, Cross-validation and Stratified k-fold Cross-validation were used to determine the accuracy of the models.

Evaluation of Models using Cross-Validation :-

The data-set was divided into training and testing sets in a ratio of 70:30. This resulted in a training data-set of 84,678 entries and a testing data-set of 36,291 entries. The classifiers were first trained and then their accuracy was evaluated using the test data-set. The confusion matrix for each classifier was also plotted using heat-maps.

Accuracy of the Classifiers

On the basis of the accuracy of different classifiers and their confusion matrices we consider random forest classifier, decision tree classifier and k-nearest neighbour classifier for evaluation using stratified k-fold cross validation.

Evaluation of Models using Stratified k-fold Cross-Validation :-

Stratification is the process of rearranging the data as to ensure each fold is a good representative of the whole. Here, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The folds are selected so that the mean response value is approximately equal in all the folds.

Here, we divided the samples into 20 folds with shuffling. The classifiers are then trained on this data and their performance was recorded as follows.

Decision Tree Classifier :

Accuracy of Decision Tree Classifier

Decision tree classifier shows a maximum accuracy of 75.35% and a mean accuracy of 74.58%.

K-nearest Neighbour Classifier :

Accuracy of K-nearest Neighbours Classifier

K-nearest Neighbour Classifier shows a maximum accuracy of 74.48% and an average accuracy of 74.08%.

Random Forest Classifier :

Random Forest Shows a Maximum accuracy of 75.37% and an average accuracy of 74.68%.

Thus, Random Forest Classifier has the highest accuracy among the three classifiers both in case of maximum accuracy and average accuracy.

Hence, we select random forest classifier for our predicting purposes.

Saving the model for prediction

The selected model (random forest classifier) was saved by the process of pickling for future predictions.

Web App

A web app was developed for the project using flask. It includes the UI for making prediction on user provided data and also incorporates visualization using D3.js .

Through this model the project tries to identify high risk zones that have a higher probability of common diseases like diarrhoea, dysentery etc for infants aged upto 5-year-olds. Accordingly, various stakeholders, namely the government, the district administration, likely-to-be-affected families as well as the various non-profit organisations can put in their resource to charter a new course of safety and thereby, welfare in India.

--

--