About the “Brain MRI segmentation” Dataset on Kaggle

6 min readMar 2, 2022

I recently built a brain MRI segmentation project, that segments out tumors from MRI scans with 93% accuracy. Feel free to check out the article I made regarding this project:

Segmenting Brain Tumours from MRI Scans — 93% Accuracy!

An introduction to computer vision, how I came across the idea for this project, and the 12 main steps I took to get a…

medium.com

In this article, however, I will be diving deeper into the open-source dataset that I used. It is called “Brain MRI Segmentation” and can be found on Kaggle: https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation.

This dataset was talked about in a research paper that I discuss in this article and has been linked to the bottom of the page.

Obtaining this Dataset

The Kaggle contributor for this particular dataset is Mateusz Buda, who is a Senior Machine Learning Engineer at IQVIA. The dataset was obtained from The Cancer Imaging Archive (TCIA) and The Cancer Genome Atlas (TCGA).

He and the other contributors of this dataset used preoperative imaging and genomic data of 110 patients from 5 institutions with lower-grade gliomas (LGGs). For clarification, a lower-grade glioma is cancer that develops in the brain and tends to be slow-growing.

Something interesting I found is that, according to the research paper about this dataset, there were 120 patients identified initially, but 10 of them didn’t have genomic cluster information available. For clarification, genomic clusters are groups of two or more genes within an organism's DNA that have a similar polymer of amino acids joined together by peptide bonds.

Note: the entire dataset was split up into 22 subsets, with each dataset containing exactly 5 cases/5 patients’ data. However, in my project, I took all the elements from the folder and then split them up into training, validation, and testing (using the 60-20-20 rule).

Composition of the Dataset

In regards to the composition of the dataset, it has a total of 7858 .tif files (.tif is a type of image format, like .jpg or .png).

Out of these files, 3929 of them belong to images (the MRI scans) and the other 3929 belong to the masks.

If you visit the dataset page on Kaggle, you can see that it says there are 7860 files. However, to verify this I used the glob library and then got the path of all the images (stored in the root_path variable) in the dataset. Later, I find the length of the mask and image files, respectively.

import globroot_path = '/content/lgg-mri-segmentation/kaggle_3m/'all_files = glob.glob(root_path + "*/*.tif")
print(len(all_files))mask_files = glob.glob(root_path + "*/*_mask.tif")
print(len(potential_mask_files))image_files = glob.glob(root_path + "*/*_mask.tif")
print(len(potential_image_files))

Number of Tumours vs. Non-Tumours

To be more specific and check how many images in the dataset have tumours and how many don’t, I created a function called diagnosis. It either appends and returns 1 to the list if there is a tumor or appends and returns 0 to the list if there isn’t a tumor.

tumour_count = []def diagnosis(path):
    if np.max(cv2.imread(path)) > 0:
      tumour_count.append("1")
      return 1
    else:
      tumour_count.append("0")
      return 0

I found the diagnoses of all the image and mask files and saved them in a Pandas data frame.

import pandas as pdfiles_df = pd.DataFrame({"image_path": image_files, 
                         "mask_path": mask_files,
                         "diagnosis": [diagnosis(x) for x in mask_files]})print(files_df)

You can see the diagnosis in the data frame on the far right (Image by Author)

To count the diagnoses:

n_tumours = tumour_count.count("1")
n_nontumours = tumour_count.count("0")print("Tumours: " + str(n_tumours), "...........", "Non-Tumours: " + str(n_nontumours))

This prints out “Tumours: 1373 ……….. Non-Tumours: 2556”. On top of that, to visualize it in a graph, we can use matplotlib.

import matplotlib.pyplot as pltplt.bar(["Tumours - " + str(n_tumours), "Non-Tumours - " + str(n_nontumours)], [n_tumours, n_nontumours], color=["green", "red"])

I noticed that, in the dataset, there was a large number of images that didn’t have tumors. Thus, I did decrease the number of non-tumors that I had in my project. Once I decreased the number of non-tumors, the ratio was 1373 images with tumors and 846 images without tumors.

FLAIR

Other than looking at the numbers, the dataset contains brain MRI images together with manual FLAIR abnormality segmentation masks. FLAIR (Fluid-attenuated inversion recovery) is an advanced MRI imaging sequence that shows areas of tissue T2 as bright, while the cerebrospinal fluid (CSF) signal is darkened. This enables the detection of a brain lesion/abnormality, which appears as dark or light spots that don’t look like normal brain tissue.

The intention behind using FLAIR was that enhancing an LGG tumor is rare.

Along with the .tif files in the data set, tumor genomic clusters and patient data are provided in a .csv file (however, I did not use this file for my project).

Resolution

The resolution of these images is 256 x 256 pixels. I derived these numbers by, firstly, reading all the images and masks in the dataset, checking their size, and appending this size to a list.

image_sizes = []
mask_sizes = []for data in image_files:
  image = cv2.imread(data)
  image_sizes.append(image.shape[0])for data in mask_files:
  image = cv2.imread(data)
  mask_sizes.append(image.shape[0])

Now, in image_sizes and mask_sizes, we have a list of all the sizes for images and masks, respectively. If we print out the lists, there will be 3292 elements all containing the number 256.

Both these lists go on

We can also confirm this by finding the average size of all the images and masks.

# images
sum_img_sizes = 0
for i in image_sizes:
  sum_img_sizes += iprint(sum_img_sizes/len(image_sizes))# masks
sum_mask_sizes = 0
for j in image_sizes:
  sum_mask_sizes += jprint(sum_mask_sizes/len(mask_sizes))

This piece of code, above, prints out 256.0 for the average size.

Ground Truth

The ground truth for training the automatic segmentation model was obtained from manual segmentation. The images were from the patients, but the ground-truth masks were manually annotated by a medical school graduate with experience om neuroradiology using software in their laboratory. To validate all these masks, a board-eligible radiologist looked over them and modified those that were identified as incorrect.

These masks are in the form of .tif files representing masks, which contain the accurate segmentation of tumors from the brain MRI images. These masks contain 3 channels — RGB, but in my project, I transformed them into grayscaled images to reduce the number of computations performed on each channel.

The dataset on Kaggle does not contain any labels, but the images and masks can help derive the diagnosis (whether it contains a tumor or not) — I calculated the diagnoses for every file, which was discussed in the “Number of Tumours vs. Non-Tumours” segment.

To visualize the ground truth masks and the images that it corresponds to, I used the subplot method in matplotlib:

plt.figure(figsize=(20, 20))
for i in range(len(image_files[:10])):
  plt.subplot(1, 10, i+1)
  image = cv2.imread(image_files[i])
  plt.imshow(image)
  plt.axis('off')
  plt.subplots_adjust(wspace=0, hspace=0)
plt.show()plt.figure(figsize=(20, 20))
for i in range(len(mask_files[:10])):
  plt.subplot(2, 10, i+1)
  mask = cv2.imread(mask_files[i])
  plt.imshow(mask)
  plt.axis('off')
  plt.subplots_adjust(wspace=0, hspace=0)
plt.show()

After running this code, you can see that there are two horizontal “strips” — one for the images and one for the masks (ground truth).

State of the Art and Goals of Using this Dataset

The contributors were anticipating their findings to be a stepping stone to prove the association of tumor shape features extracted from MRI with its genomic subtypes.

Firstly, they had to segment tumors from MRI scans. The state of the art is doing this manually, which is costly, time-consuming, and not very accurate because, according to some further research I conducted, this task is not flawless and can sometimes miss out on important details in an MRI scan.

Thus, they looked to deep learning for medical imaging, which is known for its capability in this space. This would make the process fast, inexpensive, and potentially achieve (or even surpass) the performance of a skilled radiologist.

These were the phases the contributors/researchers took for automating the segmentation process:

Image preprocessing
Segmentation — UNET model
Post-processing algorithm to remove false positives

Researching into this dataset was so fun and I highly recommend that you read some of these sources that helped me:

For reference to the code I embedded into this article, feel free to check out my Colab notebook:
https://colab.research.google.com/drive/1WmQcn2rv8S8PKhfoXR5IsgMQ-MVyeYm6?usp=sharing