Stacking model-based classifiers for dealing with multiple sets of noisy labels

Giulia Montani
Data Reply IT | DataTech
9 min readJun 19, 2024

--

In the healthcare field ground truth labels are complex to obtain due to resource limitations and ambiguity in the data. Specifically, manual labeling is common in the healthcare field, and relevant examples include, but are not limited to, medical images like X-rays, MRIs, and histopathology slides. Medical professionals may encounter cases where a patient’s symptoms or test results do not fit neatly into a predefined diagnostic category, making it difficult to assign a definitive label. While human experts bring invaluable domain knowledge and expertise to the labeling process, this procedure introduces errors and inconsistencies, commonly referred to as noise. Thus resulting in multiple sets of noisy labels.

To solve the classification task by leveraging the availability of multiple data annotators it is introduced a novel density-based ensemble methodology constructed by combining model-based classifiers separately trained on single sets of noisy labels. The next section illustrates the notation and the framework used.

Ensemble model-based discriminant analysis

Ensemble model procedure

In a multi-class multi-label classification problem each data point is assigned one of the G classes by each of the M annotators, resulting in M potentially distinct labels for each observation. Considering a set of N observations, denote with x = {x1, …, xN} the features matrix and with
ym = {y1m, …, yNm} the set of noisy labels linked to the m-th annotator. Operationally, the ensemble model is constructed as follows:

  • Step 1 — Fit the base learners
    The base learners of the ensemble model are obtained employing a model-based discriminant analysis approach to build a classifier for each annotator’s training set {x, ym }, m = 1, . . . , M. Assuming a Gaussian distribution for each class, the training of the model involves estimating via maximum likelihood (ML) the parameters of the multivariate normal distributions for each class. These parameters are means, covariance matrices and the mixing proportions:
  • Step 2 — Weighted average of the base learners parameters:
    Assuming to know the weight associated to each annotator, compute the ensemble parameters as convex combinations of those obtained by fitting separate models for each annotator:
  • Step 3 — Classification
    Update the sample labels with Maximum a Posteriori (MAP) rule:

The final output of the procedure includes the set of ensemble parameters and the estimated posterior probabilities. The stacked model-based classifier can then be readily employed for predicting the class of an unlabeled observation via the MAP rule. We have outlined the steps necessary for building the ensemble model under the assumption of known weights. In practical situations, the weights need to be selected. Various strategies for this purpose are proposed in the following section.

How to determine annotators’ weights

In this section, four distinct strategies to generate weights are proposed. Annotators’ expertise is the key factor in ensuring that higher weights are assigned to highly skilled annotators. Two solutions necessitate a-priori information, such as the partial knowledge of the ground-truth labels or the annotators’ level of expertise. Differently, the remaining two approaches are entirely data-driven. The data-driven computation of the weights serves as a proxy for annotators’ expertise, thus automatically inferring it as a by-product of the learning process.

Partial knowledge of ground truth labels

In the medical field, partial availability of ground truth labels frequently arises when a group of patients undergoes specialized and informative diagnostic exams in addition to manual labeling by the annotators. Nonetheless, these exams may be invasive or time-consuming so much so that only a subset of the training set will be subjected to them, leading to an incomplete view of the ground truth labels.

In this context, the weights generation procedure is approached using a scoring system (algorithm 1). It counts the number of agreements between the annotator’s labeling and the true class, resulting in M integers that represent the associated scores s = {s1, …,sM}. A high score indicates that an annotator has consistently assigned the correct label, thereby reflecting his/her expertise in the labeling process.

Knowledge of expertise level

The choice of weights outlined in this section is based on the a-priori knowledge of the annotators’ level of expertise. The approach leverages on this information to assign unequal weights that reflect the expertise level.

This approach allows for an informed decision-making process where the opinions of annotators with different levels of expertise are appropriately taken into account. Nonetheless, since expertise level is stated rather than estimated, it can be subjective and prone to biases. On this wise, some data-driven approaches are proposed in the following subsections.

Majority Voting

The following approach to weight determination employs a data-driven strategy based on the principle of majority voting (MV). MV theory is a decision-making approach that revolves around selecting the class with the highest number of votes. In our approach, we draw parallels to the concepts discussed in the standard MV framework while focusing on a distinct objective. Instead of reducing the multiple sets of labels into a single one through MV, we leverage MV in an earlier stage, that is during the determination of the weights to be assigned to each annotator. We make again use of a scoring system that will ultimately lead to the identification of the more reliable annotators. In detail, for each training unit, we first identify the supposedly true label (denoted as the majority label) by applying majority voting to the labels provided by the M annotators. Having determined the set of majority labels, each annotator is assigned a score reflecting the frequency with which his/her label is equal to the majority label.

Even if there is substantial disagreement among annotators, MV would still compute a majority label, without considering the uncertainty introduced by the conflicting annotations. To mitigate this drawback, one option would be to set a threshold for the minimum percentage of agreement required to actually compute the majority label. This threshold can vary depending on the application and the desired level of agreement.

Iterative algorithm

The final procedure presented for assessing the annotators’ contribution to the ensemble learner involves an iterative algorithm, where weights and sample labels are sequentially refined until no more changes occur.

The solution begins by training the M base learners and computing the initial set of labels y0 = {y01 , . . . , y0N} using majority voting. Subsequently, the iterative phase alternates a weights generation step through a scoring system, followed by the re-estimation of the ensemble model, which is then used to compute a new set of labels yt, at iteration t, using the MAP rule (see Algorithm 2).

In its current form, the method comes with a drawback, as the information about the uncertainty of the estimated label is not taken into account. To address this issue, a different scoring system is also investigated where instead of assigning a unit score for every agreement, we make direct use of the probability of class assignment estimated for the m-th annotator. The utilization of probabilities instead of integer values enables the distinction between varying levels of confidence among annotators.

Application

The considered real-data application concerns the identification of gastrointestinal lesions through regular colonoscopic videos. In detail, a group of M = 7 clinicians (comprising 4 experts and 3 novices) were tasked to review the recording for N = 76 patients, providing assessments to determine whether the lesions were benign or malignant. The dataset is publicly available in the University of California Irvine Machine Learning data repository.

The study aims to identify particular lesions called polyps. There are three distinct types of polyps: hyperplasic, adenoma, and serrated adenoma. Adenomatous polyps are the types most likely to develop into cancer if left untreated. Hyperplastic Polyps are typically benign, but there is a specific subtype known as serrated, that has a slightly increased potential for progression to cancer, especially when they are large in size or found in certain locations of the colon.

Model evaluation

In this first analysis as base learner is employed a particular model of the Eigenvalue Decomposition Discriminant Analysis (EDDA) family, known as Linear Discriminant Analysis (LDA) assumes equal covariance matrices among the classes.

The primary aim of the study is to maximize accuracy, ensuring the most precise classification of these lesions leveraging the set of noisy labels. The predictive performance is assessed through a training-test split, the training set has 50 observations, and the accuracy is then calculated on the remaining data, forming the test set. We repeat the analysis for 50 different train-test splits. The average accuracy and standard deviations for the base learners and the ensemble models are shown below:

Observing the second table, a clear pattern emerges: every ensemble model achieves higher accuracy values compared to the base learners. The reasons why the Majority Vote and Expert/Novice achieve slightly lower values arise from two different issues. For the former, it is possible that the level of disagreement between the labels is high and therefore the scoring system based on majority vote is not effective. In the case of Expert/Novice, the weights are fixed based on pre-existing knowledge of annotator expertise, which might be not accurate. To further explore the uncertainties that have emerged regarding the annotators expertise, we undertake a comprehensive analysis of the outputs generated by the scoring systems with the following boxplot:

The Partial Ground Truth (PGT) strategy exhibits greater variability that can be attributed to the fact that, in each simulation run, a subset of the data (approximately 10%) is randomly sampled, wherein the true labels are known. As a result, the PGT model scores encounter more variation due to this subset influence on the overall scoring distribution. Comparing the boxes obtained with PGT and the two iterative algorithms, which are the strategies with the highest accuracy, it is evident the distinction between the four experts on the left and two novices on the right, first and third novices. However, the second novice appears to be inaccurately identified as inexperienced. Given the significant difference highlighted by ItAlg and considering the relatively uniform scores of MV, accompanied by a relatively low accuracy value of the latter, it was deemed necessary to reconsider the assumptions regarding the expertise level of annotators. Thus, the Expert/Novice method is reevaluated considering also the sixth annotator as expert. This reassessment results in an increase in accuracy from 64% to 68%, which currently represents the best performance achieved in the analysis.

Conclusion

Through the real data application, notable improvements in predictive performance have been observed compared to employing single sets of noisy labels. It also emerged how on the one hand the EN method outperforms the others if the experience level of the annotators is known, but at the time when this information is uncertain it is better to rely on data-driven approaches. All in all, thanks to the data-driven weight generation procedures we are not only able to improve the classification accuracy but, as a by-product of the learning process, we can also disentangle the degree of expertise of the annotators involved in the study.

--

--