Machine Learning Technique for Classifying Longitudinal Trajectories

This post is intended to be a prelude to a journal paper that I will be writing on the subject. I am choosing to publish it on medium first to organize these concepts in a less formal format.

Motivation

In medical applications there are cases when a patient comes in for multiple assessments and a practitioner would like to get a classification of some condition. There are powerful machine learning techniques to use data from these assessments to make classifications. However, traditional machine learning techniques treat each assessment or observation as completely independent. This means that a classification can be made at each assessment but not using the data cumulatively from all the assessments. The method discussed in this paper takes into account data from each available observation to make an overall prediction for the patient. When a new observation is recorded for this patient the prediction can be updated using this new data.

The motivation for this method was established while researching differences in EEG signals in infants who would go on to develop autism and those of typically developing infants (Bosl 2014). The study involved taking multiple EEG readings of each patient between 3 and 36 months of age. The challenge with the data is that each patient had an irregular quantity and sequence of observations. One patient could have been measured at 3, 18, 21, and 24 months another could have been measured at just 9 and 18 months. Each observation contained meaningful data and it seems reasonable to expect to see more accurate predictions the more observations a patient had.

Making the problem harder was the fact that at each age we expected different EEG features to be significant. So, each age must be be considered with completely different features derived from the raw EEG signal. Our goal was to create a method that could generate a prediction from the first observation of each patient then update that prediction with the data from each subsequent observation.

Method

The method that we developed utilizes two very well studied techniques in statistics and machine learning: K-nearest neighbors and the binomial-beta distribution. In this method we deal with only binary classifications.

Algorithm:

  1. The features from each training/labeled observation is binned into their respective age (3,6,9 months etc.).
  2. The first observation, of the patient being classified, is measured with the training data for its observation age. The labels of the k nearest neighbors to the observation are recorded.
  3. The recorded labels are then treated as success or failures and used to update binomial-beta prior. The updated beta posterior is now treated as the prior for the next observation.
  4. Steps 2 and 3 are repeated for each of the remains observations for the patient being classified.
  5. The analytical mean of the final beta posterior is the probability of the class labeled ‘success’

This method is very simple and straight forward for readers who are familiar with k-nearest neighbors and binomial-beta bayesian models. A brief description of these method will be explained in the context of this method.

KNN

K-nearest neighbors is a very common supervised classification technique. This method uses the observations in a labeled training set that are closest to the observation in question (x) to generate a prediction. The ‘closeness’ measure is typically euclidian distance. In most classification applications of KNN, the prediction, Y, is simply assigned to the majority label of the k closest points to x.

In our method we take the k nearest labels and rather than compute the majority we treat each label as a ‘success’ or ‘failure’ in a binomial trial. The success/failure data at each patient observation is used to update a binomial-beta prior.

Bayesian Binomial-Beta

In bayesian inference a prior distribution is formed using prior knowledge about a system (if there is no prior knowledge an uninformative prior can be used). Then data is used to update that prior distribution to form a posterior distribution. This posterior distribution can be used to draw inference on some parameter that we are interested in. In our case we are interested in the probability that an infant is at risk of developing autism. We can use that data (success/failure) from a binomial trial to form a posterior distribution for this probability. The derivation is as follows:

Credit: James Wilson

This derivation shows the posterior that comes from a uninformative prior. We would use a uninformative prior from the first observation that a patient has. After that we can now use that posterior distribution as the prior for the next observation. This posterior for the next observation would be a Beta(a + y +1, b + n — y +1) where, a is the the alpha parameter from the prior and b is the beta parameter from the prior. y is successes for that observation and n is the number of trails, in our case n is equal to the k that we choose to use.

Example

To demonstrate this technique I’ve create a simple example with two a dimensional dataset. In this example we have a patient who has observations at ages 3, 6, 9, 15, 21, 24 months. First k nearest neighbors is run to identify the ‘success’/’failure’ counts.

The counts for each age are used to update the Beta prior distribution. The Beta distributions for each age are plotted below.

Beta(1, 1) → Beta(5, 7) → Beta(14, 8) → Beta(20, 12)→ Beta(29, 13) → Beta(31, 21) → Beta(36, 26)

The mean of the final posterior distribution is .66. This can be interpreted as the overall probability that the patient has autism.

Data

The aim of this method is to use all past observations to improve the classification accuracy compared to simply using the most recent observation. To test the effectiveness of this method we generated a several simulated data set. Each of which was designed to study a particular aspect of how this method performs. The data was simulated in the following manner:

X ~ Uniform(-2,2)

Typical ~ Normal(X, 1)

Autism ~ Normal(X+𝜹, 1)

Where, 𝜹 is a constant

We wanted to study how the model performed by influencing three factors different factors:

  • Different levels of separation. Meaning different levels of 𝜹. By changing 𝜹 we were able to test the sensitivity of our model.
  • Having full observations verses observation at random ages. Meaning each patient has an observation at each age (3–36) or at a random subset of those ages. In practice it seems that having random observations would be more common. We were interested to learn if and by how much the results are improved with a ‘full trajectory’.
  • Including ages that are uninformative. Meaning that 𝜹 is reduced. We wanted to see how our model could recover from uninformative ages. If there is very little separation between ‘typical’ and ‘autism’ at age 9 how well could the model use the following more informative observations to recover the results.

Each dataset consists of a total of 2000 patients. The autism labeled patients were chosen at a rate of 1 in 5. So, in each dataset the data is roughly 80:20 split between the two classes. The following list of 12 datasets were generated and studied:

List of datasets studied

Each observation consists of 9 numbers that are generated randomly in the before mentioned method. Here is an example of data with 𝜹=1.5, Random Ages, and no uninformative ages:

Example of data

Results

To evaluate this model we randomly split the data into 80% training data and 20% testing data. So, the data from 1600 patients were used to train the models and those model were tested on the data from 400 patients. Remember, each patient can have multiple observations. Each dataset was randomly split and evaluated 100 times. The plots below show the 95% confidence intervals around the average classification accuracy for these trials.

The results of the models accuracy at each age is plotted next to the performance of a gradient boosting classifier. The gradient boosting classifier evaluated each age as if the observation in that age were completely independent of all other ages. The updating model uses all of the observation that came before it to make a prediction. So, the age 12 prediction of a patient with observations at ages 3,6 12, 18 only uses ages 3, 6, and 12. This means that if the model is working as intended the classification accuracy should increase as more observations are added.

Plots 1–6 | All ages with equal separation (𝜹)

As we would expect the gradient boosting classifier’s results appear to be constant at each age. This makes sense because the average amount of data and average separation between classes for each age is roughly equal at each age. The updating model’s accuracy improves at each age for separation levels 𝜹=1.5 and 𝜹=2. However, its accuracy for 𝜹=1 appears to be roughly equivalent to the accuracy of the gradient boosting’s results. Interestingly, both model’s beat the 80% accuracy that would result from predicting all ‘typical’ meaning there is enough signal for classification accuracy that is slightly better than the choosing the majority class. However, the updating model does not improve its results over time.

For 𝜹= 1.5 and 𝜹= 2 the accuracy of the updating model clearly improves over time. At the first observation the accuracy is not significantly different from the gradient boosting result. However, with each observation the accuracy clearly improves. This indicates that the algorithm is working as intended.

Plots 7–12 | Some less informative ages

The plots in the left column have less separation (𝜹=1) for ages 6, 9, and 12. The first observation at age 3 have normal separation. It appears that the information gathered from the first observation is carried into the following 3 less informative ages as the accuracy stays higher than the gradient boosting results. At age 15 when the separation goes back to normal the accuracy of both algorithms appear roughly equal. After that the updating models accuracy starts to increase with each new observation.

Conclusion

This method certainly has some limitations. For instance it is not sensitive to trends. If one class has a concave up trend and another has an overlapping concave down trend this method will only detect difference in the ages that are sufficiently separated whereas the polynomial coefficients would be dramatically different. This method also has the same limitations and assumptions that K Nearest Neighbors has. Most notably, poor results with high dimensional data. In addition, it is advisable that features be standardized to ensure distance measurements are not biased by the magnitude of the features.

A couple applications where we think this algorithm is relevant:

  1. EEG research: There is lots of randomness in EEG data depending on the state of the patient. By taking multiple longitudinal measurements into account the true classification will emerge in the data.
  2. Blood marker classification: The levels of different proteins, enzymes and other markers vary with diet, lifestyle, genetics. After deciding to implement some treatment a longitudinal blood work study may be used to predict if the treatment is having the desired effect. This method could provide a ongoing assessment that is robust to the inherent stochacisity in the data.

There are a surprisingly limited amount of machine learning techniques that can account for longitudinal data. The updating method we have presented is simple and perhaps naive in in a few ways. However, we believe that it is a good starting place for further research in this area. With the amount of important applications (particularly in medical fields) collecting longitudinal, data there is plenty of motivation to develop this research further.