Unleashing the Potential of Biological Hierarchies in Histopathology: Introducing the OCELOT Dataset

Aaron Valero
Lunit Team Blog
Published in
5 min readJun 16, 2023

Introduction

Did you know that computational pathology (CPATH) involves working with extremely high-resolution images? And when I say ‘extremely high,’ I really mean it. A Whole-Slide Image (WSI), which is a scanned image of a biopsy sample, typically consists of billions of pixels. Moreover, for some tasks, such as biomarker discovery, a full understanding of the cell and tissue content is critical. Dealing with such large images and analysis requirements presents two significant challenges in machine learning (ML). Firstly, there’s a computational challenge since ML algorithms cannot directly process these images due to hardware limitations. Secondly, acquiring dense annotations is a highly expensive and time-consuming process.

These challenges are particularly observed in two fundamental tasks in CPATH: cell detection and tissue segmentation. These tasks play a vital role in histopathology for quantification purposes, enabling accurate treatment planning and diagnosis. Dense annotations are required to successfully train a ML model.

To tackle these challenges, ML models traditionally rely on small tiles extracted from the WSI for training and inference. This approach helps reduce the image size to a more manageable level, which also leads to the reduction of annotation efforts (indeed, annotating the WSI for detection and segmentation is virtually impossible). The image below illustrates the size of a typical size and its relative size in relation to the WSI.

Image source: Liu et al.

Despite the mentioned constraints, ML models have demonstrated remarkable performance. In fact, the performance is comparable to that of human experts in tasks such as cell detection and tissue segmentation. However, it is important to note that pathologists evaluate WSIs differently. They utilize sophisticated visualization tools that allow them to zoom in and out freely. These tools enable human experts to examine fine details of cell morphologies and larger tissue structures at the same time, which is crucial in their analysis. Pathologists have discovered through their experience that considering various zoom levels is essential for understanding the tumor microenvironment. In other words, analyzing the architecture of the tissues, and the hierarchical relations between cells and tissues is essential.

Thus, we find ourselves questioning whether we are limiting the capabilities of traditional CPATH algorithms by imposing size constraints. If that is the case, how much potential are we missing out on? and, can we strive for better performance?

OCELOT: a pathologist-inspired dataset

To explore this topic, we developed the OCELOT dataset, consisting of paired samples for two distinct tasks at varying zoom levels. The first task involves cell detection at high magnification. The second task focuses on tissue segmentation at a four-times lower magnification, facilitating the generation of larger context segmentation masks. Each sample in the dataset is meticulously annotated in pairs, ensuring that each cell detection tile has a corresponding tissue surrounding tile.

Our primary objective in creating this dataset is to enable the algorithm to comprehend two different zoom levels while proficiently performing two critical tasks in CPATH. We firmly believe that a paired dataset is crucial for investigating multi-task approaches. Furthermore, to the best of our knowledge, this dataset represents the first endeavor to propose a multi-task and multi-level dataset specifically tailored for learning cell and tissue relationships. Here is an illustrative example of a paired sample from the dataset:

Example pair in the OCELOT dataset with the corresponding annotations. On the left, we have the tissue tile and on the right, we have the corresponding cell tile. The cell tile lives in the red square in the tissue tile.

Our paper¹, titled “OCELOT: Overlapped Cell on Tissue Dataset,” accepted at CVPR 2023, introduces not only the OCELOT dataset but also a range of multi-task deep learning architectures. These architectures demonstrate remarkable performance improvements over the single detection task model (cell-only model) across multiple benchmarks, achieving substantial improvement margins.

Comparison of a single-task baseline and one of our proposed multi-task approaches. The performance is measured in terms of the population mean F1 (the higher the better cell detection)

It is important to notice that our multi-task algorithm generalizes well even to a large-scale IHC dataset, i.e. a Lunit Inc. internal dataset — CARP.

During our research, we discovered several intriguing findings:

  1. Adding larger-scale image input improves cell detection.
  2. Including the annotations of the second task, namely tissue segmentation annotation, improves performance too.
  3. Using both larger-scale image input as well as tissue annotation yields even greater enhancements.

Our work clearly shows that we haven’t yet reached the ceiling in cell detection performance.

What comes next?

There is still much to be explored in our investigation. For example, we suggest three directions:

  1. Improvement of the tissue segmentation performances as a result of the multi-task and multi-level approach is not yet confirmed. However, pathologists’ expertise indicates that high magnification levels (which capture cell-related information) are also valuable in identifying larger tissue structures with greater accuracy.
  2. Generalize the multi-task approach for unpaired data.
  3. Train ML models with more tasks and/or magnifications. Our approaches are still limited to 2 magnifications and 2 tasks. What would happen if an algorithm could freely explore pathologist visualization tools?

At Lunit Inc., we firmly believe in the potential of this research direction. We consider OCELOT to be a great benchmark for embarking on this journey. As a testament to our confidence, we are currently hosting the OCELOT challenge² at MICCAI 2023. We are excited to see how creative the participants can be in effectively combining two challenging tasks at distinct magnifications. By collaborating together, we aim to discover the next-generation machine learning model in computational pathology.

References

[1] OCELOT: Overlapped Cell on Tissue dataset, https://openaccess.thecvf.com/content/CVPR2023/html/Ryu_OCELOT_Overlapped_Cell_on_Tissue_Dataset_for_Histopathology_CVPR_2023_paper.html

[2] OCELOT 23 Grand Challenge, https://ocelot2023.grand-challenge.org/

[3] Download OCELOT for free, https://zenodo.org/record/7844149

--

--