Human-in-the-loop Machine Learning

Marco Brambilla
Off-the-grid: digital vs. physical
2 min readDec 12, 2017

--

A crucial requirement for supervised machine learning is to have access to good data. But another need is to provide your training phase with appropriate input from domain experts.

The crucial question then is:

if you have access to a highly valuable domain expert, how can you maximise the value you can extract from his time?

As a typical example of supervised approach, we pick classification. Here are some specific use cases where the domain experts are put at good use.

Labeling

Active learning is a good solution: you ask the expert to label samples that are most likely to help in the classification problem. For instance, you want to label items that are closer to the decision boundary. However, this kind of approach tend to maximize accuracy, but not necessarily recall. This is particularly true when you are searching for rare items (i.e., minority classes, or extreme class imbalance): actually, random selection of samples works better in terms of recall.

Redefining Class Labels

Experts may be exploited for clarifying whether two different classes must be actually and absolutely kept apart, or if they should or could be merged. This can be done through a matrix, where in every cell you specify the constraint in terms of a value (-1 = absolutely keep classes apart; +1 = merge the classes). You then apply expectation maximization over the overall system of classes.

This is called constraint-based classification, and it’s actually a semi-supervised method for rethinking class definitions.

(Very) Noisy Labels

The problem of classification is that it relies on understanding of data by domain experts. But in some fields (for instance, detection of problems on medical images, MRI, and so on), only a small fraction (say 20–30%) of problems can be detected even by expert doctors. This is critical, because the success rate of surgery may drop by 50% in case of non-visually detected features on MRI scans. In this case you have a problem of wrong, missing or noisy labels. Again, you can apply a semi-supervised technique.

This story is inspired by a keynote speech by Carla E. Brodley, from Northeastern University, given at the IEEE BigData Conference 2017.

--

--

Marco Brambilla
Off-the-grid: digital vs. physical

Data science, social and media analysis. Data, software, ML, AI, and models all around.