# MATLAB Benchmark Code for WiDS Datathon 2020

By Neha Goel

Hello all, I am Neha Goel, Technical Lead for AI/Data Science competitions on the MathWorks Student Competition team. MathWorks is excited to support WiDS Datathon 2020 by providing complimentary MATLAB Licenses, tutorials, and getting started resources to each participant.

**To request your complimentary license, go to the** **MathWorks site****, click the “Request Software” button, and fill out the software request form. ***You will get your license within 72 business hours*.

The WiDS Datathon 2020 focuses on patient health through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative. Brought to you by the Global WiDS team, the West Big Data Innovation Hub, and the WiDS Datathon Committee, open until February 24, 2020.

The Datathon task is to train a model that takes as input the patient record data and outputs a prediction of how likely it is that the patient survives. In this blog post I will walk through basic starter code in MATLAB. Additional resources for other training methods are linked at the bottom of the blog post.

Register for the competition and download the data files from Kaggle. “ *training.csv*” is the training data file and “ *unlabeled.csv* “ is the test data.

*Tip: Save the csv files as .xlsx to avoid end of file blank rows.*

# Step 1: Load Data

Once you download the files, make sure that the files are in the MATLAB path. Here I use the readtable function to read the files and store it as tables. *TreatAsEmpty* is the placeholder text to treat empty values to numeric columns in file. Table elements corresponding to characters ‘NA’ will be set as ‘NaN’ when imported. You can also import data using the MATLAB Import tool.

`TrainSet = readtable('training.xlsx','TreatAsEmpty','NA');`

# Step 2: Clean Data

The biggest challenge with this dataset is that the data is messy. 186 predictor columns, 91713 observations with lot of missing values. Data transformation and modelling will be the key area to work on to avoid overfitting the problem.

Using the summary function, I analyzed the types of the predictors, the min, max, median values and number of missing values for each predictor column. This helped me derive relevant assumptions to clean the data.

`summary(TrainSet);`

There are many different approaches to work with the missing values and predictor selection. We will go through one of the approaches in this blog. You can also refer to this document to learn about other methods: Clean Messy and Missing Data.

*Note: This approach of data cleaning demonstrated is chosen arbitrarily to cut down number of predictor columns. ***Remove the character columns of the table**

The reason behind this is that the algorithm I chose to train the model is *fitclinear* and it only allows numeric matrix as the input arguments.

`TrainSet = removevars(TrainSet, {'ethnicity','gender','hospital_admit_source','icu_admit_source',... 'icu_stay_type','icu_type','apache_3j_bodysystem','apache_2_bodysystem'});`

**Remove minimum values from all the vitals predictors**

After analyzing the **WiDS Datathon 2020 dictionary.csv** file provided with the Kaggle data, I noticed that the even columns from column 42 to 168 correspond to minimum values of predictors in the **vital** category.

`TrainSet = removevars(TrainSet, [42:2:168]);`

**Remove the observations which have 30 or more missing predictors**

The other assumption I made is the observations (patients) which have 30 or more missing predictor values can be removed.

`TrainSet = rmmissing(TrainSet,1,'MinNumMissing',30);`

**Fill the missing values**

The next step is to fill in all the NaN values. One approach is to use the fillmissing function to fill data using linear interpolation. Other approaches include replacing NaN values with mean or median values and removing the outliers using the CurveFitting app.

`TrainSet = fillmissing(TrainSet,'linear');`

In this step I move our label predictor *hospital_death* to the last column of the table because for some algorithms in MATLAB and in Classification learner app the last column is the default response variable.

`TrainSet = movevars(TrainSet,'hospital_death','After',114);`

# Step 3: Create Training Data

Once I have the cleaned training data. I separate the label predictor *hospital_death* from the training set and create two separate tables *XTrain*: Predictor data , *YTrain*:Class labels

`XTrain = removevars(TrainSet,{'hospital_death'}); YTrain = TrainSet.hospital_death;`

# Step 4: Create Test Data

Download the *unlabeled.csv* file from Kaggle. Read the file using the *readtable* function to store it as a table.

`XTest = readtable('unlabeled.xlsx','TreatAsEmpty','NA');`

I used a similar approach for cleaning test data as the training data above. *XTest* is the test data with no label predictor.

**Remove the character columns of the table**

As the *unlabeled.csv* file contains the *hospital_death* with NA values, I removed it along with other character type columns.

`XTest = removevars(XTest, {'hospital_death','ethnicity','gender','hospital_admit_source',... 'icu_admit_source','icu_stay_type','icu_type','apache_3j_bodysystem','apache_2_bodysystem'});`

**Remove minimum values from all the vitals predictors**

After removing the *hospital_death* column, the minimum values of the **vital** category are now offset so they correspond to the odd columns from column 41 to 167.

`XTest = removevars(XTest, [41:2:167]);`

**Fill the missing values**

`XTest = fillmissing(XTest,'linear');`

In MATLAB you can train a model using two different methods.

- Using custom MATLAB machine learning algorithm functions
- Training the model using Classification learner app

Here I walkthrough steps for doing both the methods. I would encourage to try both the approaches and train the model using different algorithms and parameters. It will help in optimization and comparing different model’s scores.

# Step 5 — Option 1: Using custom algorithms

A Binary classification problem can be approached using various algorithms like Decision tress, svm and logistic regression. Here I train using fitclinear classification model. It trains the linear binary classification models with high dimensional predictor data.

Convert the table to a numeric matrix because *fitclinear* function takes only numeric matrix as an input argument.

`XTrainMat = table2array(XTrain); XTestMat = table2array(XTest);`

**Fit the model**

The name value pair input arguments within the function gives the options of tuning the model. Here I use solver as *sparsa* (Sparse Reconstruction by Separable Approximation), which has default lasso regularization. To optimize the model, I do some Hyperparameter Optimization.

‘ *OptimizeHyperparameters*’ as ‘ *auto*’ uses *{lambda, learner}* and *acquisition function* name lets you modify the behavior when the function is overexploiting an area per second.

You can further cross validate the data within input arguments using cross-validation options: *crossval*, *KFold*, *CVPartition* etc. Check out the fitclinear document to know about input arguments.

`Mdl = fitclinear(XTrainMat,YTrain,'ObservationsIn','rows','Solver','sparsa',... 'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',... struct('AcquisitionFunctionName','expected-improvement-plus'))`

**Predict on the Test Set**

Once we have your model ready, you can perform predictions on your test set using predict function. It takes as input the fitted model and Test data with similar predictors as training data. The output is the predicted *labels* and *scores*.

`[labelOpt1,scoresOpt1] = predict(Mdl,XTestMat);`

# Step 5 — Option 2: Using Classification Learner App

Second method of training the model is by using the Classification Learner app. It lets you interactively train, validate and tune classification model. Let’s see the steps to work with it.

- On the
**Apps**tab, in the Machine Learning group, click**Classification Learner**. - Click
**New Session**and select data (**TrainSet**) from the workspace. Specify the response variable (**hospital_death**). - Select the validation method to avoid overfitting. You can either choose
**holdout validation**or**cross-validation**selecting the no of k-folds. - On the
**Classification Learner tab**, in the**Model Type**section, select the algorithm to be trained e.g.*logistic regression*,*All svm*,*All Quick-to-train*. - You can also try transforming features by
**enabling PCA**to reduce dimensionality. - The model can further be improved by changing parameter setting in the
**Advanced dialog box**. - Once all required options are selected, click
- The history window on the left displays the different models trained and their accuracy.
- Performance of the model on the validation data can be evaluated by
**Confusion Matrix**and**ROC Curve** - To make predictions on the test set I export the model by selecting
**Export Model**on*Classification Learner*tab.

**Predict on the Test Set**

Exported model is saved as **trainedModel **in the workspace. You can then predict labels and scores using **predictFcn**.

The **label** is the predicted labels on Test set. **Scores** are the scores of each observation for both positive and negative class.

`[labelOpt2,scoresOpt2] = trainedModel.predictFcn(XTest)`

After a classification algorithm has trained on data, we examine the performance of the algorithm on our test dataset. To inspect the classifier performance more closely I plotted a Receiver Operating Characteristic (ROC) curve. By definition, a ROC curve shows true positive rate versus false positive rate for different thresholds of the classifier output.

The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between the range of possible AUC values is [0, 1].

Confusion matrix plot is used to understand how the currently selected classifier performed in each class. To view the confusion matrix after training a model, you can use the MATLAB plotconfusion function.

# Step 6: Evaluate Model

To perform evaluation model, MATLAB has perfcurve function. It calculates the false positive, true positive, threshold and auc score. The input arguments to the function include test labels, scores and the positive class label. For your self-evaluation purpose you can create the test label ( *YTest*) by partitioning a subset from *XTest*. I used the scores generated by **option 2** above which correspond to the *trainedModel* created by the Classification Learner app.

`[fpr,tpr,thr,auc] = perfcurve(YTest,scoresOpt2(:,2),'1');`

I get an **AUC of 0.85** and the below ROC Curve.

*Note: the auc calculated through this function might differ from the auc calculated on Kaggle leaderboard.*

# Step 7: Kaggle submission

Create a table of the results based on the IDs and prediction scores. The desired file format for submission is:

*encounter_id, hospital_death*

You can place all the test results in a MATLAB table, which makes it easy to visualize and to write to the desired file format. I stored the positive labels (second column) of the scores.

`testResults = table(XTest.encounter_id,scoresOpt2(:,2),'VariableNames',{'encounter_id','hospital_death'});`

Write the results to a CSV file. This is the file you will submit for the challenge.

`writetable(testResults,'testResults.csv');`

Thanks for following along with this code! We are excited to find out how you will modify this starter code and make it yours. I strongly recommend looking at our Resources section below for more ideas on how you can improve our benchmark model.

Feel free to reach out to us in the Kaggle forum or email us at studentcompetitions@mathworks.com if you have any further questions.

# Resources

*Originally published at **https://blogs.mathworks.com** on January 17, 2020.*