Automating the Classification of Harmful Brain Activity Using CNN models

Zoe S
The Quantastic Journal
6 min readJul 18, 2024

This article explores using a CNN model to expedite the diagnosis of neurological disorders by classifying harmful brain activity from EEG-derived spectrogram images, as part of a Kaggle competition with Harvard Medical School.

Motivation

Around 3 billion people around the world are suffering from some type of neurological disorder (ND) and they often face long and complex diagnostic journeys, with an average wait time of two years for a definitive diagnosis. This project aims to reduce the time to diagnosis by leveraging a CNN model to identify harmful brain activity patterns from spectrogram images (produced from raw EEG data), potentially improving patient outcomes.

Background to EEG and Spectrogram Imagery

Diagnosing an ND is a long and complicated process, requiring many different kinds of tests. These tests can range from non-invasive methods, such as EEGs, ultrasounds, or simple reflex tests, to more invasive procedures like spinal taps or biopsies.

Different tools for diagnosing an ND.
Different tools for diagnosing an ND.

One of the first tests that is typically done on a patient is an electroencephalogram (EEG), which is performed by attaching several sensors to the head and measuring the electrical activity of the brain. The raw data collected from this test is then transformed into a readable image called a spectrogram, which is then interpreted by multiple neuro-specialists. If there is harmful brain activity present, they will try to classify it into one of several categories (seizure, LPD, GPD, LRDA, GRDA, Other).

Typical EEG test, which produces a spectrogram image.
Typical EEG test, which produces a spectrogram image.

Since this can be one of the first steps in the diagnostic process, correctly classifying the spectrogram image is crucial, as this will affect the rest of the patient’s diagnostic journey. An incorrect classification can lead the patient down the wrong path, which only prolongs their frustration and suffering and delays intervention and treatment.

Our goal for this project was to develop a CNN model that can assist neuro-specialists in classifying spectrograms, and therefore reduce the time and resources needed to diagnose a patient.

What is a spectrogram?

A spectrogram is an image that has three different axes: frequency (Hz), time (sec), and intensity. Music is a good analogy for understanding the meaning of these different axes. Frequency is the number of oscillations per second, analogous to pitch (how high or low the sound it), while intensity is the amount of brain activity at a specific frequency, analogous to how loud or quiet the music is at a certain pitch. A full spectrogram image is composed of 4 sub-images, each representing a region in the brain.

Spectrogram Image.
Spectrogram Image.

Difficulties in Classifying Spectrograms

The main difficulty in classifying spectrogram images is that sometimes it’s just not clear. There are images that do clearly point to a specific harmful brain activity, but usually this is not the case. Our dataset contained about 11k images, and for each image a team of 5–16 specialists voted on a class (seizure, LPD, GPD, LRDA, GRDA, or Other). About 98% of those images did not have unanimous votes on the classification. This means that in 98% of cases, there was some disagreement between the specialists about which harmful brain activity was present in the image. So even for a team of experienced neuro-specialists who are highly trained in reading spectrograms, it is not always obvious which category an image will fall into.

How then can a CNN model help us classify images, if even the labels are not pointing to a specific class?

This question requires a reframing of the problem. If 98% of our data does not have a clear-cut label, it wouldn’t really make sense to simply classify the disease with the highest number of votes because it gives a false sense of certainty to a classification. For example, in a real-world scenario: if 40% of the specialists voted for seizure, 20% for LPD, 20% for LRPD, and 20% for GPD, that would mean that there is quite a lot of disagreement. It would be important to conduct other tests to further investigate. Similarly, if our model classified a spectrogram as a seizure, but in reality it is only 40% confident, this would be very misleading for a doctor. It is important to reflect the reality, which is that there is uncertainty.

Therefore, instead of simply classifying the image, we had to predict the distribution of votes. Our target is the true distribution of votes, and by comparing that with our prediction, we can evaluate the performance of the model. The metric used for comparing probability distributions is called the Kullback–Leibler divergence (KL divergence). The closer the value is to zero, the closer the probability distributions are to each other.

KL divergence formula

Results

With our preprocessed data, target variables, and our performance metric ready, we were able to compare several models. We started with a Random Forest model as our baseline model, and then moved on to more sophisticated CNN models. Our training did not utilize any fine-tuning, we only used the architecture of the model. To ensure the reliability of our results, we employed 5-fold cross-validation for each of the models.

Comparison table of results.
Comparison table of results.

The model with the best performance was the EfficientNetB0 with a KL divergence of 0.78. To give a comparison, on the competition’s leaderboard the lowest value achieved was 0.27. Although this may seem like a significant difference, our model’s performance was within the top 4% of submissions.

EfficientNetB0 architecture.
EfficientNetB0 architecture.

Conclusions

How could we have further improved the performance of our model?

Teams that achieved better results used the spectrogram images from the dataset, as well as the raw-EEG data provided (we only used the spectrogram images). Instead of having an image composed of 4 sub-images as we had, each image was composed of 8 sub-images. The additional 4 sub-images were also spectrogram images, but transformed from the raw EEG data by the participants themselves. These produced slightly different images than the spectrograms given in the dataset, which gave the model more information and allowed for better prediction results. The downside to using larger images is the amount of RAM and GPUs required for training the model.

How else could this data be used?

It would be interesting to use this data and apply an unsupervised learning approach, rather than supervised. Our model was limited to six pre-defined labels, but perhaps this limitation is preventing insights to patterns and relationships that we are not able to see. Especially in the field of neuroscience, where even the basics of brain function are largely a mystery, applying an unsupervised learning approach could potentially reveal a lot of valuable insights.

Acknowledgements

Thank you to Kaggle and HMS for hosting the competition and providing the dataset, and to the ITC staff for their guidance and mentorship during this project.

Our team:

Zoe Stankowska (LinkedIn) (GitHub)

Or Gewelber (LinkedIn) (GitHub)

Sacha Koskas (LinkedIn) (GitHub)

--

--