Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

MHIST A New Public Histopathological Image Dataset for ML Community.

--

A minimalist histopathology image analysis dataset (MHIST)

MNIST is one of the most popular benchmarking datasets for machine learning practitioners. Until now, due to a variety of reasons, no such standardized dataset existed for histopathological images.

On January 29th, 2021, the researchers from Dartmouth College and Dartmouth-Hitchcock Medical Center posted a paper on arXiv.org preprint depository describing a new MHIST dataset with basic benchmarking data. They made the MHIST dataset freely available to the ML community on their GitHub site.

MHIST is a binary classification dataset of 3,152 fixed-size (224 x 224 pixels) images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists. MHIST also includes each image’s annotator agreement level. As a minimalist dataset, MHIST occupies 354 MB of disk space.

The two classes of images are hyperplastic polyps and sessile serrated adenomas. The former being benign and the latter premalignant lesions of the colon. Differentiating between hyperplastic polyp and sessile serrated adenoma can be challenging at times. In their paper, authors reported that in 16.7 % of the cases, four out of seven pathologists suggested one entity and three pathologists the other. Thankfully they had seven and not six pathologists classifying the lesions to go with the majority diagnosis for difficult cases.

The authors performed various ML experiments with the best results obtained with pretrained ResNet18 (AUC 92.7%).

To access the MHIST dataset, you have to register on their GitHub site by providing the personal data listed in the below image and agree to Dataset Research Use Agreement.

https://bmirds.github.io/MHIST/

Once approved, you get access to dataset files, including annotations.csv, image.zip, and MD5SUMs.txt files. The annotations.csv file contains image file names, corresponding majority-vote label, and degree of pathologists’ agreement. The image.zip file contains 3,152 image files, and the MD5SUMs.txt includes a checksum that can be used to verify that the contents of the dataset are correctly downloaded.

annotations.csv

There are twice as many hyperplastic polyps (HP) as sessile serrated adenoma polyps (SSA) in the MHIST dataset.

Diagnosis of sessile serrated adenoma can be difficult at times, even for experienced pathologists, as evident by the graph below. For over 200 cases, only four out of seven pathologists considered the polyp to be a sessile serrated adenoma.

To do a quick check of the dataset, I ran an ML model using transfer learning with pretrained ResNet18 for 100 epochs.

Confusion matrix

This simple model achieved 88% accuracy.

Despite the promise, Machine Learning shows in Healthcare and other related fields; there is a bottleneck that slows the rate of progress. That bottleneck is access to the high-quality datasets needed to train and test the Machine Learning algorithms. Numerous datasets exist, but few are easily accessible to researchers. This situation is mainly due to the nature of Healthcare datasets themselves; identifiable information in the data sets means access to the data is protected by several measures to maintain patients’ privacy.

In one of my previous posts, I described the lung and colon cancer dataset (LC25000) that my colleagues and I made available for ML researchers. I am sure that any ML researcher will welcome the new MHIST dataset, which is standardized and can serve as a viable tool for new ML algorithm creation and model benchmarking.

Thank you for taking the time to read this post.

Andrew

@tampapath

Reference: Wei, Jerry; Suriawinata, Arief; Ren, Bing; Liu, Xiaoying; Lisovsky, Mikhail; Vaickus, Louis; Brown, Charles; Baker, Michael; Tomita, Naofumi; Torresani, Lorenzo; Wei, Jason; Hassanpour, Saeed. A Petri Dish for Histopathology Image Analysis. eprint arXiv:2101.12355

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Andrew A Borkowski
Andrew A Borkowski

Written by Andrew A Borkowski

Pathologist and Deep Learning Enthusiast

Responses (1)