A Deep Learning Model to Detect Spinal Cord Compression in Cervical Spine MRI Scans

10 min readJan 12, 2019

Highlights

I trained a deep convolutional neural network to detect cervical spinal cord compression on MRI scans and achieved 93% accuracy on this classification task.
The model was tested on a dataset of MRIs including patients with cervical spine disease and healthy control patients. The model identified patients with cervical disease with high sensitivity (97%) and specificity (85%).
This deep learning model could be used in a primary care setting to rapidly interpret cervical spine MRI scans and flag patients with abnormal MRI scans for further review.

Introduction

A schematic representation of the constellation of anatomic changes that occur in DCM that lead to compression of the cervical spinal cord.

Degenerative cervical myelopathy (DCM) is a chronic disease that causes progressive non-traumatic compression of the cervical spinal cord. As the compression of the spinal cord worsens DCM can cause neurologic deficits, impaired mobility, and significant impairment in quality of life.

The CSM-International and CSM-North American clinical trials are the two largest clinical trials that studied clinical outcomes after surgical decompression of the spinal cord in DCM. Patients were included in the study if they had 1 or more clinical signs of myelopathy and imaging evidence of cervical spinal cord compression. Each patient had an MRI scan of the cervical spine and then went on to have surgery. The patients were then assessed 6 months, 12 months, and 24 months following surgery.

Data Representation

Each patient had a pre-operative MRI of the cervical spine that at a minimum included a T2-weighted and T1-weighted sequence with an axial and sagital series. Unfortunately the MRIs were stored in various formats. The majority were dicom files, but many were stored as a tiled series of jpegs or pngs. In addition some MRIs were missing or corrupted. I included only the MRIs that were stored as dicom files, which limited us to 289 patients.

I chose to represent each MRI as a series of independent axial 2D images. This was advantageous because I could make use of existing deep learning models such as VGG16 or ResNet50. I chose to consider each axial slice independently of the other axial slices within the scan. I thought this would be a reasonable compromise. The downside of this approach is that any feature that manifested predominantly along the Z-axis would be lost. I extracted the T2-weighted axial sequence for each patient and stored them as a new set of dicom files. This was accomplished manually using OsiriX Lite.

Summary of MRI parameters extracted from the T2-weighted axial dicom files across the study population

Data Labeling

There are a number of pathologic changes that can be identified in an MRI scan of a patient with DCM. The full range of imaging findings are summarized in this 2016 article from Neurosurgical Focus. (https://www.ncbi.nlm.nih.gov/pubmed/27246488)

To summarize, the structural changes related to DCM that can be detected on MRI include:

Spinal cord compression
Cervical Stenosis
Cord signal change
Ligamentous Pathology
Spondylolysthesis
Sagittal Alignment

I chose to focus the deep learning model on detecting spinal cord compression for the following reasons:

Spinal cord compression is highly sensitive for myelopathy. The following 2010 study of 103 patients (https://www.ncbi.nlm.nih.gov/pubmed/20150835) found that spinal cord compression was 100% sensitive and 79.6% specific for clinical myelopathy.
Spinal cord compression can be reliably graded on T2-weighted axial images using a number of grading systems. The inter-rater reliability of these grading systems is greater than 80% and in some studies was over 95%. (https://www.ncbi.nlm.nih.gov/pubmed/27246488)
Even though spinal cord compression is not 100% specific for clinical myelopathy the presence of spinal cord compression is a concerning finding that warrants continued follow up.

For these reasons I believed that a deep learning model capable of reliably detecting spinal cord compression would serve as a useful screening tool for detecting patients that had symptoms of clinical myelopathy or were at risk of developing clinical myelopathy.

To standardize the data labelling I used the qualitative criteria outlined in this 2010 study. https://www.ncbi.nlm.nih.gov/pubmed/20150835. Importantly I did not differentiate between Partial spinal cord compression and Circumferential spinal cord compression. Instead I defined spinal cord compression as any indentation on the spinal cord parenchyma which changed the contour of the spinal cord perimeter. Labelers assessed each T2-weighted axial slice and assigned a label of:

1: evidence of partial or circumferential spinal cord compression or
0: no spinal cord compression.

Results of Labeling

Two labelers independently labeled 110 patients, corresponding to 5635 individual axial images. The remaining 173 patients were not labelled at this stage and were kept for model testing.

As you can see the two labelers had excellent agreement (96.4%) on images that were not compressed. The agreement was still good (88.1%) on compressed images. I examined the images where there was disagreement between the labellers and I found that these images tended to be ones with minimal partial compression.

Recap

In the first part of this report I described the method of data representation and the process by which I prepared data. In summary I collected MRI scans from patients with degenerative cervical myelopathy (DCM) from the CSM-International and CSM-North American trials. I then extracted the T2-weighted axial sequence from each patient. I focused on identifying spinal cord compression in these axial images because spinal cord compression is a highly sensitive and specific finding for clinical myelopathy. Two labelers went through a subset of the images and labeled each T2-weighted axial image based on a pre-determined set of qualitative criteria to identify images showing spinal cord compression.

Summary of the dataset of labelled and unlabelled images

Model Architecture

I looked at established deep convolutional neural networks (CNN) and after some comparison decided to focus on ResNet50 because of its good performance on the ImageNet database and relatively smaller memory requirements. Previous studies have achieved good results using transfer learning applying weights from Imagenet to classify MRI and CT images. I thus attempted to do the same and I tested various degrees of fine tuning. I placed a priority on model simplicity. I thus attempted to achieve optimal performance from a single ResNet50 CNN prior to creating more complex models through ensembling.

The ResNet family of CNNs have become commonplace since placing first in the ILSVRC 2015 competition. The architecture makes use of residual units which avoids the problem of degrading accuracy. A downside of ResNet50 is that given its depth I would be unable to train the model from scratch. That’s OK because I was intending to use pre-trained weights for some of the layers anyway.

Model Training

I split the labelled dataset into a training/validation cohort with 80% of the data and reserved 20% for model testing. I trained a number of model architectures and used overall accuracy on the testing dataset as a metric to compare models. I used Keras v2.24 with a TensorFlow v1.5 backend for model implementation. I used data augmentation with random scaling, rotation, and horizontal flips during model training. The following architectures were tested.

Model 4, which had two fully connected layers with 512 units each, performed the best with 92.99% accuracy. There is certainly room for some improvement here but I started to run into memory constraints on my GPU with deeper networks so for now I settled with this performance. I was pleasantly surprised that I achieved ~93% accuracy with a relatively simple network configuration.

Detecting Patients With Myelopathy

So for I have tested the model on individual T2-weighted axial slices and have achieved 93% accuracy at identifying spinal cord compression. However, I have not yet demonstrated that the model would be useful in a real clinical setting.

In the real world patients can present to their primary-care physician with a wide variety of symptoms that may be suggestive of cervical myelopathy. These patients will often undergo an MRI of the cervical spine. Specialist radiologists then interpret the MRI scans and identify abnormal scans, which can be a laborious and time-consuming process.

I wanted to determine if the model would be able to distinguish between healthy patients and patients with confirmed diagnosis of DCM. I used a dataset of 32 healthy control patients that underwent MRIs of the cervical spine. I also used the 179 patients enrolled in the DCM-International and DCM-North American studies that had a confirmed diagnosis of cervical myelopathy. The model was not trained on any of these images. I thus had two cohorts of patients that I would attempt to classify with the model:

Healthy Control — 32 patients
Cervical Myelopathy — 179 patients

For each patient I applied the convolutional neural network model on each T2-weighted axial slice. The model output a class prediction for each slice. The number of slices per patient ranged from 18–82 with a median of 43. I used a simple threshold to generate a patient-level prediction. If the model identified >1 slice as showing spinal cord compression the patient was labeled abnormal.

Table 2 — Performance Characteristics of the Model at distinguishing between healthy and diseased patients based on a cervical spine MRI.

The model was able to distinguish between patients in the healthy control cohort and the diseased cohort with high sensitivity (0.9665) and high specificity (0.8529).

Predicting Surgical Outcomes

Patients with degenerative cervical myelopathy are often treated with surgery, especially if they have moderate or severe symptoms. Most, but not all, patients have an improvement in their symptoms with surgery. Others have attempted to develop clinical prediction models to predict outcome after surgery for DCM. This paper used a logistic regression model to predict clinical improvement after surgery based on pre-operative age, duration of symptoms, disease severity, psychiatric comorbidities, impairment of gait, and smoking status.

I hypothesized that I would be able to predict surgical outcome in the cohort of 279 patients by combining pre-operative clinical variables with radiographic features automatically generated by the model.

I used the modified Japanese Orthopedic Association (mJOA) score to measure surgical outcome. The mJOA is commonly used by experts in the field. An mJOA score of 15–18 indicates mild myelopathy, 8–14 indicates moderate myelopathy, and 3–7 indicates severe myelopathy. Previous studies have established an improvement in the mJOA by at least 2 points as being a ‘clinically significant’ improvement. I labeled patients who improved by at least 2 points in the mJOA at 6 months after surgery as ‘clinically improved’ after surgery.

Clinical Features

Pre-operative mJOA score
Duration of symptoms prior to surgery
Age
Weight
Height
Pre-operative gait impairment
Height
Duration of hospital stay
Lhermitte phenomenon
Bilateral sensory impairment
Hoffman sign
Gender
Hand atrophy
Motor weakness
Spasticity
Hyperreflexia
Smoking status
Marital status
Education
Operation length

Radiographic Features

For each patient I applied the convolutional neural network model on each T2-weighted axial slice. The model output a vector of class predictions for each patient. I then generated a number of summary features from the vector of class predictions.

Percent of the cervical spinal cord with spinal cord compression
Mean value of the positive class probabilities
Standard deviation of the positive class probabilities
Skew of the positive class probabilities
Kurtosis of the positive class probabilities
Location of the maximum compressed segment.

I trained a random forest model to predict surgical outcome at 6 months. I trained two models — one with just the clinical features, and a second with the clinical features and the automatically generated radiographic features. I split the data into a 75% training/validation cohort and 25% testing cohort. I tuned both models using a grid search strategy. I then trained the models using 10-fold cross-validation and compared performance on the testing dataset between the models with the area under the ROC curve.

Comparison of area under the ROC curve on the clinical features only model and the clinical + radiographic features model

The top 10 features from the clinical + radiographic features model out of 23 total features.

I ranked the 23 features used by the clinical + radiographic features model based on the feature importance. Out of the top 10 features used by the model, 5 were radiographic features.

Summary

I used a dataset of 5635 labeled MRI images from 110 patients to train a deep convolutional neural network to detect cervical spinal cord compression. I achieved high (93%) accuracy on this classification task.
I tested the model on a dataset of MRIs from 179 patients with cervical myelopathy and 32 healthy control patients. The model identified patients with cervical myelopathy with high sensitivity (97%) and specificity (85%).
I used the model to generate radiographic features for the patients in the dataset. I incorporated these radiographic features to improve a clinical prediction model to predict patient improvement 6 months after surgery.
The deep learning model could be used in a primary care setting to rapidly interpret cervical spine MRI scans and flag patients with abnormal MRI scans for further review.