Prevention is better than cure: a case study of the abnormalities detection in the chest
In this article, we would like to demonstrate how a series of simple tests for data imbalance exposes faults in the data acquisition and annotation process. We analyzed in detail a single use case — a Kaggle competition related to the detection of abnormalities in X-ray lung images.
Complex models are able to learn artifacts and it is difficult to remove this bias during or after the training. Errors made at the data collection stage make it difficult to validate the model correctly.
Problems in the training set can be divided into 2 groups: problems related to consistency among radiologist and data quality problems.
Inconsistency among radiologists
Unequal division of annotation work between radiologists
As visible in the figures above, the radiologists can be divided into three groups.
The first group, R8-R10 worked on the same part of the X-ray dataset and annotated most of the images present in the dataset, both images with and without findings. Each radiologist annotated more than 6,000 images. Those three radiologists annotated 95% of all of the detected findings in this dataset.
The next group R1-R7 did not detect almost any lesion (R2 found 3, the rest none).
The last group, R11-R17. Each radiologist annotated less than 2,000 images with a high fraction of ‘no findings’ images.
Not clear annotation rules
Comparing the class labels given by different radiologists for a particular image, the consistency is remarkably low. In group R8-R10, radiologists (that annotated 95% of all findings) agreed with both colleagues on all classes only in 46% of images.
Different label for the same pathology
Another effect of unclear annotation rules is the significantly overlapping definitions of anomalies. A class ILD and Pulmonary fibrosis strongly overlap, similarly to Consolidation and Infiltration. The most vivid example is a “lung opacity”, which covers six other classes!
It will cause obvious inconsistency, if not clearly stated at the beginning of the annotation process.
Lesions present on chests with ”no findings” label
Our expert radiologist analyzed 10 randomly selected images annotated by each of the seventeen radiologists (R1-R17). He found some abnormalities in the images that were annotated as having ‘no findings’. The exact numbers are visible in the table below.
One bounding box for all lesions of the same type, or one for each lesion
Some radiologists use a single box to cover few anomalies, others mark each anomaly separately.
It influences model quality. The metric mAP at IoU 40, chosen for the competition, means that the predicted bounding box has to overlap with ground-truth box in at least 40%. The problem is that if radiologists’ annotations (ground truth) do not meet this requirement, how is it possible to train an AI model with such noisy labels to get a good result.
Data quality
Lesions localization imbalance
In the database, there are 14 annotated anomalies. In regular clinician practice, all except 2 (aortic enlargement, cardiomegaly) are distributed similarly on both lungs sides. There should be a similar number of lesions in the right lung as well as in the left one. However, in the figure below, we placed heatmaps that should show the anomalies symmetrically appeared in both lungs.
Children present in the dataset
In the training dataset, there are 107 images of children (ages 1–17). This might be a problem as child anatomy is different from adults (i.e., shape of heart, mediastinum, and bone structure) and so are the technical aspects of the child’s X-ray (position of hands) (Hryniewska, 2020). The model might recognize such relationships. As children are not small adults, they should be removed in order not to introduce additional noise during model training.
According to (Nguyen, 2020), pediatric X-rays should have been removed from the data during the data filtering step, but we found they were accidentally left.
Two monochromatic color spaces
Another valid concern is Photometric Interpretation, which specifies the intended interpretation of the image pixel data. Some images are of type monochrome1 (17%) and some of monochrome2. The difference is that in the first case the lowest value of a pixel is interpreted as white and in the second case as black. This may produce some inefficient models when not taken into consideration.
Missing or wrong metadata in DICOMs
Some images suffer from missing data usually distributed in an extensive DICOM header, either from the lack of age or sex.
68% of the observations do not have information about age, and 17% about sex. The sex parameter is set to O (other) for 34% of the images. The rest of the dataset is fairly balanced (M: 26%, F: 23%).
There is a lot of instances where the age is equal to 0 or is far greater than 100 (i.e. 238). This leaves us with only 25% of images with valid ages between 1-99.
The lack of reliable information about age or sex is unfavorable because such attributes might be correlated with certain diseases, or having a disease at all. For example, for younger people, the probability of having lesions is significantly lower than for older people.
Parts of clothes present in the X-rays
Undesirable artifacts, presented in figures below, can be easily avoided during image acquisition, by asking the patient to remove all parts of the clothes that may influence X-ray imaging, for example, chains, bras, clothes with buttons, and zippers. If artifacts cannot be prevented, they can be removed during image preprocessing, before the image is shown to the model.
Letters present in the X-rays
Letters and/or annotations present in some lung images should be removed during preprocessing to prevent a neural network from learning those patterns. The model should learn how to differentiate labels by focusing on image features, not on descriptions in the images.
Conclusions
The quality of a model is inherently bound to the quality of the data on which it is trained. Development of a reliable model should begin with data acquisition and annotation. At the model development stage, we cannot make the model fulfill all responsible AI and fairness rules if the data and their annotations are of insufficient quality.
Bibliography
Hryniewska, W., Bombiński, P., Szatkowski, P., Tomaszewska, P., Przelaskowski, A., & Biecek, P. (2021). Checklist for responsible deep learning modeling of medical images based on COVID-19 detection studies. Pattern Recognition, 118, 108035. https://doi.org/10.1016/j.patcog.2021.108035
Nguyen, H. Q., Lam, K., Le, L. T., Pham, H. H., Tran, D. Q., Nguyen, D. B., Le, D. D., Pham, C. M., Tong, H. T. T., Dinh, D. H., Do, C. D., Doan, L. T., Nguyen, C. N., Nguyen, B. T., Nguyen, Q. V., Hoang, A. D., Phan, H. N., Nguyen, A. T., Ho, P. H., … Vu, V. (2020). VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. http://arxiv.org/abs/2012.15029