Why We Must Reframe the Language We Use to Talk About “Data Labelling”

Ross Young
The Startup
Published in
5 min readJan 10, 2021

The language often used to describe the activity of building datasets for supervised or semi-supervised machine learning can be somewhat reductionist in that data is simply “labelled” (i.e., a feature input is matched to a label output). Instead, we argue that to label data for the purposes of machine learning is to do much more than simply annotate and assign labels. Reframing the language around data labelling matters. The “small” vocabulary change from labelling to teaching has a significant impact towards changing systems that currently reduce the impact and visibility of humans who teach machine learning models.

As introduced in my first blog, we are proponents of the concept of machine teaching, which considers how a task is being completed by humans, how they interact with data to more efficiently train models and infuse their subject matter expertise into model development. The latter can be subtle. For example, it can be something simple such as observing dataset limitations like poor data distribution, or even reduced noise resulting from limited representation. Consider a model that in production will be used to identify ice cream flavours from images. To train this model, humans are given a teaching task where they are presented an image of ice cream and are asked to identify the flavours of ice cream observed (based on assumptions about color of the ice cream in the image, which has its own limitations that we won’t delve into for the purpose of this example); while interacting with the data, teachers quickly recognize patterns such as over-representation of one flavour. Maybe they regularly see vanilla or chocolate ice cream, but do not often see neapolitan.

Similarly, they may also notice that ice cream is always visualized on a cone. However, they know from real-world experience that it can be served in many ways such as in a bowl, or in a sandwich, in an ice cream float, on a brownie or as a banana split etc.

Identifying this noise (or lack thereof) in a dataset can be critical for ensuring that a model does not overfit to certain situations. After all, we don’t want our production model to only be able to identify vanilla ice cream in a cone. As such, insights from machine teaching are valuable as these occur during interactions between the machine teacher and the data, instead of identifying these issues programmatically/statistically after a large-scale effort to assign label outputs has already been completed.

Valuing the role of machine teachers

Deep learning generally requires a high amount of labelled data, which is time consuming and expensive to acquire. Furthermore, as machine learning becomes more widely used, new applications can be slowed by low-data regimes where labelled data (and data in general) is limited. As always, there are trade-offs between cost, quality and speed when acquiring data.

This often leads to the reliance on outsourcing/crowdsourcing platforms that prioritize low-cost and high-speed acquisition of labels, often to the detriment of quality. The role of annotators and crowd workers are typically considered as menial or not valuable enough for core IP in product development (since they are often outsourced to exploitable labour markets or even prisoners, in one example) and the workers themselves are often are disenfranchised by platforms that limit the depth of their insight and contributions, and similarly, their compensation.

A classic example of the pitfalls of these systems is ImageNet, created in 2007, which includes more than fourteen million photographs and associated crowdsourced labels. With a labelling environment and system that encouraged high-speed and low-cost procurement of labels, undervalued (resulting in unskilled) human interaction with the data led to label outputs that sacrificed quality thereby producing a vast amount of biased and flawed data with long-lasting ramifications for the research community: an outcome recently (and infamously) exposed by ImageNet Roulette, which led to the removal of roughly half the 1.2 million images categorized as people in the dataset.

It is important to note that because we were an internal team at Element AI, working alongside AI practitioners, subject matter experts and developers, we had the advantage of directly communicating insights and iterating on how we approached a problem. Many workers in roles similar to ours are stripped of the ability to communicate insights by the system they are using, and stripped of the value and contribution of their insights by being reduced to piece-meal work.

We benefit, along with our AI practitioner colleagues, when we, as teachers, are compensated not on the speed of our work, but on our value as human experts who can identify the potential for bias from the insights and patterns we observe, and who can critically assess when an approach could be modified to better solve the problem at hand. When building systems that utilize human labour for either collecting data or annotating data, it is critical to fairly compensate engagement and consent, and to equally value and recognize contributions. Ghost work is real, and without addressing this in how systems are designed and how labour is accessed, it will continue unchecked, perpetuating biased models and flawed metrics of success.

Valuing human insights can improve how bias is detected (and mitigated) and when concept or data drift has occurred, ultimately leading to better training outcomes. Encouraging systems that are designed to value human teaching by making visible the interactions between humans and data and how they assign outputs, will subsequently lead to interfaces and visualization tools (we will discuss this more in our next blog) that support insight capture and storage both at the data point and dataset level. A common concern of such a design is an increase to cost and a reduction of speed in the procurement of data, however in the broader context of model development and deployment, these trade-offs are often null when a model operates with predictable, explainable results in production.

--

--