Check your model — checklist for responsible deep learning modeling

Weronika Hryniewska
ResponsibleML
Published in
5 min readJul 26, 2021

Well-prepared checklists significantly improve the quality of the modeling process. They help to avoid, or quickly detect and fix, errors. For this reason, we prepared the checklist for responsible analysis of lung images with deep learning models based on the analyzed studies and the errors we found in them.

Following the guidelines proposed in the paper (Hryniewska, 2020), we created an online GitHub repository that can be maintained by the community working on AI models for image analysis in healthcare. This repository is a starting point for further development of the proposed checklist to meet the evolving challenges in responsible modeling.

We would like to encourage you to go to this GitHub repository and share a filled checklist for your paper or data resource. In the next paragraphs, we will tell you how the checklist looks like and how to fill it.

How does the checklist look like?

Introduced denotements:

[R] — indicates that the point should be consulted with a field expert / radiologist
[D] — indicates that the point should be consulted with a model developer

— —

Data resources

  • [D] Does the data and its associated information provide sufficient diagnostic quality? If images are in DICOM, does the header provide the needed information? If not, is it provided in any other way?
  • [R] Are the low quality images (i.e., blurred, too dark, or too bright) rejected?
  • [D] Is the dataset balanced in terms of sex and age?
  • [R] Does the dataset contain one type of images (CT or X -ray)?
  • [R] Are the lung structures visible (“lung” window) on CT images?
  • [D] Are images of children and of adults labeled as such within the dataset?
  • [R] Are images correctly categorized in relation to class of pathology?
  • [D] Are AP/PA projections described for every X -ray image?

Image preprocessing

  • [D] Is the data preprocessing described?
  • [D] Are artifacts (such as captions) removed?
  • Data augmentation (if needed)
  • [D] Are the lungs fully present after transformations?
  • [R] Are lung structures visible after brightness or contrast transformations?
  • [D] Are only sensible transformations applied?

Transfer learning (if used)

  • [D] Is the transfer learning procedure described?
  • [D] Is the applied transfer learning appropriate for this case (i.e.: images of same type and content have been used to train the original model)

Model performance

  • [D] Are at least a few metrics of those proposed in (Albahri, 2020) used?
  • [D] Is the model validated on a different database than the one used for training?

Domain quality of model explanations

  • [R] Are other structures (i.e., bowel loops) misinterpreted as lungs in segmentation?
  • [R] All the areas marked as highly explanatory are located inside the lungs?
  • [R] Are artifacts (cables, breathing tubes, image compression, embedded markup symbols) misidentified as part of the explanations?
  • [R] Are areas indicated as explanations consistent with opinions of radiologists?
  • [R] Do explanations accurately indicate lesions?

How to fill checklist?

According to the prepared checklist, try to analyze which points are fulfilled by the study and the dataset used in your paper for the neural network training. If you are not a creator of the data resource or study, you can evaluate only the information contained in the article, and stress that you are not sure about the answers given to some questions from the checklist.

To add a new study or data resource to the repository, please find JSON’s template in folders: datasets_checklist, or papers_checklist. In the JSON file, please tell exactly which points from the checklist are fulfilled. Details about filling the checklist are presented in the paragraphs below. Then, create a pull request with attached the JSON file. Please justify your statement by putting comments on it in the pull request description. Your submitted pull request will be verified by community members. They can ask for corrections or clarifications. All discussions will be visible to the public.

— —

We applied the following denotements:
Y” means yes (if an answer is probable then the additional “?” is added),
N” means no (if an answer is probable then the additional “?” is added),
?” means there is no information provided,
n/a” signifies that the issue does not apply to a particular publication.

— —

The question regarding balance in the dataset has two components (sex and age). Sometimes, the dataset is balanced concerning only one criterion but not in terms of the second one. In such cases, please put “Y?” sign. Do similarly in cases where many metadata are missing, but there is a balance considering the existing data.

We would like to stress that for the question regarding data augmentation, put “N” when horizontal flip was applied.

In transfer learning, the main criterion is whether the authors used a model pre-trained on ImageNet dataset. Such behavior is not recommended as natural scene images are significantly different from medical images. The biggest difference is the fact that X-ray and CT images are in grayscale unlike images in ImageNet.

Summary

The paper mentions a long list of problems in modeling, but this analysis is not intended to criticize any of the mentioned articles, these are state-of-the-art papers often published in prestigious journals. However, the analysis of these articles makes one look critically at standards in AI for healthcare or rather the lack of them.

We hope that this paper will initiate the process of development of standards for responsible AI solutions in healthcare. In this paper, we showed that the verification of the XAI solutions for medical images is not only important but it is a must. We believe that if the proposed checklist is taken into account when building models, we will get better models.

The content of this article is taken from the work (Hryniewska, 2020). To read more about the checklist and see how other studies and data resources were assessed, see this paper. If you use any part of this article, please cite:

Hryniewska, W., Bombiński, P., Szatkowski, P., Tomaszewska, P., Przelaskowski, A., & Biecek, P. (2021). Checklist for responsible deep learning modeling of medical images based on COVID-19 detection studies. Pattern Recognition, 118, 108035. https://doi.org/10.1016/j.patcog.2021.108035

--

--

Weronika Hryniewska
ResponsibleML

PhD Student at the Warsaw University of Technology. Interested in XAI and DL in medicine.