Prevention is better than cure: a case study of the abnormalities detection in the chest

Weronika Hryniewska
ResponsibleML
Published in
7 min readAug 14, 2021

In this article, we would like to demonstrate how a series of simple tests for data imbalance exposes faults in the data acquisition and annotation process. We analyzed in detail a single use case — a Kaggle competition related to the detection of abnormalities in X-ray lung images.

Complex models are able to learn artifacts and it is difficult to remove this bias during or after the training. Errors made at the data collection stage make it difficult to validate the model correctly.

Problems in the training set can be divided into 2 groups: problems related to consistency among radiologist and data quality problems.

Inconsistency among radiologists

Unequal division of annotation work between radiologists

Different label distributions among radiologists. The left plot shows the number of images annotated by a radiologist grouped by whether the illness was found or not. The middle plot shows grouping by age and the bottom one grouping by sex.

As visible in the figures above, the radiologists can be divided into three groups.

The first group, R8-R10 worked on the same part of the X-ray dataset and annotated most of the images present in the dataset, both images with and without findings. Each radiologist annotated more than 6,000 images. Those three radiologists annotated 95% of all of the detected findings in this dataset.

The next group R1-R7 did not detect almost any lesion (R2 found 3, the rest none).

The last group, R11-R17. Each radiologist annotated less than 2,000 images with a high fraction of ‘no findings’ images.

Not clear annotation rules

Comparing the class labels given by different radiologists for a particular image, the consistency is remarkably low. In group R8-R10, radiologists (that annotated 95% of all findings) agreed with both colleagues on all classes only in 46% of images.

Different label for the same pathology

Another effect of unclear annotation rules is the significantly overlapping definitions of anomalies. A class ILD and Pulmonary fibrosis strongly overlap, similarly to Consolidation and Infiltration. The most vivid example is a “lung opacity”, which covers six other classes!

It will cause obvious inconsistency, if not clearly stated at the beginning of the annotation process.

Lesions present on chests with ”no findings” label

Our expert radiologist analyzed 10 randomly selected images annotated by each of the seventeen radiologists (R1-R17). He found some abnormalities in the images that were annotated as having ‘no findings’. The exact numbers are visible in the table below.

Examples of lesions found on images checked by three radiologists and classified as No finding. The image on the left should be annotated as containing consolidation/pneumonia label, and the image on the right as Other lesion (actually dextrocardia).

One bounding box for all lesions of the same type, or one for each lesion

Some radiologists use a single box to cover few anomalies, others mark each anomaly separately.

It influences model quality. The metric mAP at IoU 40, chosen for the competition, means that the predicted bounding box has to overlap with ground-truth box in at least 40%. The problem is that if radiologists’ annotations (ground truth) do not meet this requirement, how is it possible to train an AI model with such noisy labels to get a good result.

Examples of inconsistency between radiologists related to the usage of a single box to mark many anomalies of the same class. On the left image, there are two big boxes each for the left and the right lung and many small boxes, on the right one, there is single box covering both lungs.

Data quality

Lesions localization imbalance

In the database, there are 14 annotated anomalies. In regular clinician practice, all except 2 (aortic enlargement, cardiomegaly) are distributed similarly on both lungs sides. There should be a similar number of lesions in the right lung as well as in the left one. However, in the figure below, we placed heatmaps that should show the anomalies symmetrically appeared in both lungs.

Examples of lesions that should be present symmetrically in both parts of the lungs. Before heatmaps were calculated, images from the training set were centered.

Children present in the dataset

In the training dataset, there are 107 images of children (ages 1–17). This might be a problem as child anatomy is different from adults (i.e., shape of heart, mediastinum, and bone structure) and so are the technical aspects of the child’s X-ray (position of hands) (Hryniewska, 2020). The model might recognize such relationships. As children are not small adults, they should be removed in order not to introduce additional noise during model training.

According to (Nguyen, 2020), pediatric X-rays should have been removed from the data during the data filtering step, but we found they were accidentally left.

children’s lungs​ versus adult’s lungs

Two monochromatic color spaces

Another valid concern is Photometric Interpretation, which specifies the intended interpretation of the image pixel data. Some images are of type monochrome1 (17%) and some of monochrome2. The difference is that in the first case the lowest value of a pixel is interpreted as white and in the second case as black. This may produce some inefficient models when not taken into consideration.

monochrome1 versus monochrome2

Missing or wrong metadata in DICOMs

Some images suffer from missing data usually distributed in an extensive DICOM header, either from the lack of age or sex.

68% of the observations do not have information about age, and 17% about sex. The sex parameter is set to O (other) for 34% of the images. The rest of the dataset is fairly balanced (M: 26%, F: 23%).
There is a lot of instances where the age is equal to 0 or is far greater than 100 (i.e. 238). This leaves us with only 25% of images with valid ages between 1-99.

The lack of reliable information about age or sex is unfavorable because such attributes might be correlated with certain diseases, or having a disease at all. For example, for younger people, the probability of having lesions is significantly lower than for older people.

Density plots of age grouped by the existence of an illness. The probability of a young person having a lesion is lower than for the older person.

Parts of clothes present in the X-rays

Undesirable artifacts, presented in figures below, can be easily avoided during image acquisition, by asking the patient to remove all parts of the clothes that may influence X-ray imaging, for example, chains, bras, clothes with buttons, and zippers. If artifacts cannot be prevented, they can be removed during image preprocessing, before the image is shown to the model.

Example of clothes artifacts. From the left, there are buttons, a zipper, a bone in a bra.

Letters present in the X-rays

Letters and/or annotations present in some lung images should be removed during preprocessing to prevent a neural network from learning those patterns. The model should learn how to differentiate labels by focusing on image features, not on descriptions in the images.

Example of letters artifacts.

Conclusions

The quality of a model is inherently bound to the quality of the data on which it is trained. Development of a reliable model should begin with data acquisition and annotation. At the model development stage, we cannot make the model fulfill all responsible AI and fairness rules if the data and their annotations are of insufficient quality.

Bibliography

Hryniewska, W., Bombiński, P., Szatkowski, P., Tomaszewska, P., Przelaskowski, A., & Biecek, P. (2021). Checklist for responsible deep learning modeling of medical images based on COVID-19 detection studies. Pattern Recognition, 118, 108035. https://doi.org/10.1016/j.patcog.2021.108035

Nguyen, H. Q., Lam, K., Le, L. T., Pham, H. H., Tran, D. Q., Nguyen, D. B., Le, D. D., Pham, C. M., Tong, H. T. T., Dinh, D. H., Do, C. D., Doan, L. T., Nguyen, C. N., Nguyen, B. T., Nguyen, Q. V., Hoang, A. D., Phan, H. N., Nguyen, A. T., Ho, P. H., … Vu, V. (2020). VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. http://arxiv.org/abs/2012.15029

--

--

Weronika Hryniewska
ResponsibleML

PhD Student at the Warsaw University of Technology. Interested in XAI and DL in medicine.