UMIE datasets — the largest in the world dataset of diagnostic imaging

Barbara Klaudel

Published in

TheLion.AI

10 min readJun 17, 2024

Last week, we released the largest-to-date dataset of radiological imaging!

Link to the project

✅ More than 1 million medical images

✅ 40+ labels and 15 annotation masks

✅ ready to use preprocessing pipelines

✅ a unified ontology for masks and labels.

The dataset collects more than a million CT, MRI, and X-ray images for classification and segmentation. The data comes from 20 open-source datasets. We unified the labels and masks to follow RadLex ontology by the Radiological Society of North America (RSNA). We also release pipelines for unifying the datasets to a common format. The pipelines can be composed of reusable steps, so anyone can easily expand the dataset with new data. The steps should cover all possible formatting, e.g. extracting masks from XML and converting DICOMs to PNG. We simplify the process to a drag-and-drop.

We officially released the dataset during our presentation at the Data Science Summit ML edition. Here are the annotated slides from our presentation.

In the past ML models followed the paradigm of creating an individual model for each task.
Right now, a new paradigm is taking over the previous one — the paradigm of foundation models.
Foundation models are generalizable. A single model is meant to solve multiple kinds of problems.
Currently, medical AI is trying to follow the global trend of creating foundation models.

However, medical foundation models are slightly different than everyday-use models like ChatGPT.
E.g., in the slide we have “a foundation model” that works only with imaging of an eye and can predict 10 different kinds of diseases.*
A bit disappointing, isn’t it?

* the paper is great btw. The critic is on the general trend.

Why don’t we have better medical foundation models?
The problem lies in the lack of data to train them.
The following slides explains deeper the data problems.

Back in 2019, our team worked on a system for predicting if a kidney tumor is malignant or not.
As it turned out, a couple of other teams were working on the same problem.
I read through 30 of such papers and here is what I found:
Out of 30 papers, 24 relied on their own secret dataset.
Out of 30 papers, 12 used an open-source dataset KITS-23 (still the only one available as opensource as of today).
How huge a large-scale dataset could we have had if all the authors had open-sourced their data?
On the other hand, how many great solutions would we have missed if it hadn’t been for the team behind KITS-23?

*btw this reasoning led to the idea of UMIE 2 years later

The few datasets that are available as opensource come from amazing platforms, such as The Cancer Imaging Archive, Kaggle, Stanford AIMI, or Grand Challenge.
The Cancer Imaging Archive: a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download.
Kaggle: A huge repository of community-published models, data & code (including non-medical stuff).
Stanford AIMI Shared Datasets: Stanford AIMI shares annotated data to foster transparent and reproducible collaborative research to advance AI in medicine.
Grand Challenge: a platform for end-to-end development of machine learning solutions in biomedical imaging.

Some of the datasets, in practice, turn out not to be available anymore…

A huge problem is that each team releases the data in their own way.
Example 1: KITS-23.
This is a relatively straightforward dataset. Images and associated masks are stored in nifti format.
Tumor histologic types can be retrieved from the KITS.json file from “tumor histologic type” key and used as labels.

Example 2: Stanford COCA — it gets messy…
Images are stored in DICOMs.
Masks are stored as individual points of a region of interest in an XML file in an array of arrays of dicts of arrays of dicts of arrays of array of dicts (I might have missed some more dicts or arrays 🤔 ). *
XMLs are messy. Really. Stay tuned to learn how to decode XMLs with UMIE.

* If you think that other datasets from Stanford AIMI share a common formatting style, you are very wrong.

Last example: Chest Xray14.
Images are stored in PNG. Labels are in an associated csv file.
Sounds simple? Not really 😂
Actually, Chest Xray14 is meant for multilabel classification, i.e. each image can have multiple labels.
However, all labels are stored in the same column “Finding labels”. To separate labels they used “|” symbol.
Sounds straightforward? OK, try to get the very basic info about what unique labels are in the datasets (see the result on the slide).
So, I guess you see by now that unifying multiple imaging datasets is hard. Luckily, you don’t have to. Just use UMIE datasets.

Last but not least, there is no common ontology for annotating medical data.
Each group chooses whichever labels they see fit.
This method generates a couple of problems.
Problem 1 — Mental shortcuts: The name of the label may be obvious to you but not to everyone else. E.g. in KITS-23 they used a mysterious label “cc_papillary”. After a thorough investigation (including searching the tumor bible — AJCC Cancer Staging Manual), we came to the conclusion that this means a tumor which has characteristics of both ccRCC and papillary RCC). *
Problem 2 — Regionalism: radiology around the world has its own regionalisms, i.e. a concept may exist in one country but may be totally not used in other countries. E.g., our team created a model for information extraction from Polish radiology reports. For our paper, we had to translate the examples into English. When we tried to translate “serce podparte na przeponie” (word-by-word translation: heart supported by the diaphragm), we couldn’t find a translation anywhere. The only result that we got was a forum for Polish radiologists working in the UK that had the same problem that we did.*
Problem 3 — different granularity: Chest Xray14 chose to identify multiple radiological observations, including pneumonia. CoronaHack chose to identify subtypes of pneumonia — pneumonia caused by viruses and pneumonia caused by bacteria.*

*yes, we do solve these problems in UMIE datasets. Stay tuned.

We introduce 🎉UMIE datasets🎉 the first publicly available installment in our UMIE series.

UMIE datasets combine 20+ open-source datasets that amount to more than a million images.
This is the largest in the world dataset of annotated radiological imaging so far.
It collects CT, MRI and X-ray images of many organs.
Each image has an associated label or a mask or both.

For each supported dataset, we created a preprocessing pipeline transforming the dataset from the source format to the format shared across all of the UMIE datasets.
Our pipelines are composed of reusable steps, so if you have a new dataset that you would like to use to extend your own UMIE just select the relevant steps.

E.g. if you have a dataset with images in DICOM with masks in XML ( do you remember example 2: Stanford COCA?) simply select the relevant ready-to-use steps.
get_file_paths and create_file_tree — 2 steps that are obligatory to use, so that each UMIE dataset has the same file tree.
convert_dcm2png — do you remember that the DICOM standard allows you to store images in PNG, JPEG, JPEG2000, or raw byte format with the most significant coming first (Big Endian) or last (Little Endian) or 20 other transfer syntaxes? Do you know that there are 3*3*2 different ways to write a nested structure in DICOM? You are safe to forget it, this step handles it for you.
create_masks_from_xml — do you remember the dicts of arrays of arrays of dicts […] from example 2: Stanford COCA? Forget them, we handle it for you.
add_umie_ids — add unique identifiers, so that the names follow the same convention across datasets and there are no images with the same name
Choose whether you want to remove_imgs_with_no_annotations or create_blank_masks.

Technical detail: our pipelines and steps are based on sklearn.pipelines.
Check our repo for a deep dive on how to create new steps.

We prepared a set of 20 reusable steps that should cover the vast majority of formatting that one can think of.
Using our pipeline to transform a new dataset should be as simple as “drag-and-drop” of relevant steps to a new pipeline.

Since we work with datasets coming from institutions from all over the world, a huge challenge was ensuring that the same label is not defined twice under different names.
For this purpose, we translated all of the labels and masks to a common ontology. We chose RadLex ontology published by Curtis Langlotz’s team and the Radiological Society of North America (RSNA).
The RadLex ontology covers the majority of possible radiological observations and clinical findings and assigns each of them a unique identifier — RadLex ID.

We consulted a radiologist and to each source label and mask, we assigned one or more RadLex IDs.
RadLex is not perfect and complete. For some of the labels, we could assign a very precise RadLex ID, for some, we had to use a more general name.
E.g. in the slide, we have an example of CoronaHack labels translated to RadLex.
For viral pneumonia, we have a precise RadLex ID — “pneumonia viral”. For Bacterial pneumonia, there is no RadLex ID.
We had to reformulate this problem to multilabel classification, i.e. each image can have multiple labels. So, an image with viral pneumonia has 2 labels in UMIE: pneumonia and pneumonia viral. An image with bacterial pneumonia has only 1 label — pneumonia.

Some of the medical open-source datasets do not allow redistributions, i.e. you can use their data to train your model but you cannot release their data in a modified or the same format.
To bypass it, we were inspired by how ImageNet solves this problem.
ImageNet consists of images scraped from Google. Since ImageNet creators do not own these data, they do not publish the photos themselves only the links to their localization on the Internet.
At UMIE, we share the instructions for how to download the data from the original source and preprocessing scripts unifying the data to the common UMIE format.
We are going to publish on HuggingFace the datasets in UMIE format that allow redistributions. We are currently looking into the licenses.

Why do we think creating a large-scale medical dataset is important?
So far, the default strategy for working with medical images is using models pretrained on the ImageNet dataset.
ImageNet collects natural images that are significantly different from radiological imaging.
We hope that with UMIE datasets, we can pretrain an encoder dedicated to medical imaging tasks and hence decrease the amount of data and compute required to fine-tune a new imaging model.

We are releasing UMIE datasets on HuggingFace for selected datasets soon.
We also plan to open-source our model.

If you enjoyed the article and like our project, consider giving a star to our repo on GitHub and please share the information about the project 🙏

We are currently in the alpha stage. Any feedback is very much appreciated!

Author and team leader: Barbara Klaudel

Co-presenters: Kacper Bober, Piotr Frąckowski

Current team: Barbara Klaudel, Kacper Bober, Piotr Frąckowski, Aleksander Obuchowski, Wasyl Badyra, Andrzej Komor

Legacy members: Łukasz Całka, Sebastian Krajna, Michał Dramiński, Maciej Kaczor

UMIE datasets — the largest in the world dataset of diagnostic imaging

Written by Barbara Klaudel