What to Expect When Non-Experts Train AI

Published in

Sparks of Innovation: Stories from the HCIL

5 min readMay 23, 2020

Exploring non-experts’ perception of machine teaching through an object recognition task.

Illustration created by Elizabeth Beers (Department of Computer Science at UMD).

Machine learning is a popular and ever-advancing field that can impact every area of science. At its simplest, machine learning refers to teaching a model to perform some task using training data. The task can range from transforming images, recognizing motions, or mimicking natural human speech. However, common for all these is that the success of a model is based off how good the training data is. Machine teaching, then, is the study of that teaching aspect: a task where a teacher with knowledge trains a student with a set of examples, where both the teacher and student can be either a machine or a person. A student might be a machine learning model, like a speech recognizer, and the teacher can be a human researcher using a collection of speech data samples.

Machine teaching is typically conducted by experts in machine learning; however, it’s possible for a non-expert to teach a model through what’s called a teachable interface. Using such an interface allows someone to easily provide training examples to a model if they have domain knowledge — meaning, knowledge about the environment the examples come from. A teachable interface empowers AI-infused systems (a term coined by Amershi et al. for “systems that have features harnessing AI capabilities that are directly exposed to the end user”) by democratizing the training tasks and providing end-users with opportunities to customize the models to their own input.

Though the teachable interface facilitates the user control, the performance of an application with a teachable interface can be affected by the users’ misconceptions about collecting the training examples for a classification model. The model is typically a black box to the end-users, and it is hard for them to observe or expect the impact of their teaching strategies on its performance. Consequently, when collecting the training examples for a model, non-experts usually depend on their intuition of teaching methods which may include misconceptions in training and testing a machine learning model due to a lack of knowledge in machine learning. Therefore, a teachable interface should be carefully designed to help end-users eliminate their misconceptions based on their patterns in machine teaching strategies. We conducted a behavioral analysis with the end-users to identify these patterns.

How did we collect the data?

We recruited 100 participants from Amazon Mechanical Turk. The participants train and test an object recognizer by taking photos through an online teachable interface. To observe the change of behavior through repeated usage of a teachable interface, the participants are asked to train and test a model twice in this study. To encourage participants to build a robust object recognizer, the participants are instructed to be given an extra compensation based on the performance of the object recognizer measured by a secret test.

What attributes do we analyze?

We examined if the participants diversified the size, location, perspective, and illumination of the photos. In addition, we investigated other attributes related to the strategy of taking photos such as framing, features of backgrounds, hand in a photo, and whether the logo of the object visible or not. For the full list of the attributes, see Table 3, 4, and 5 in our paper.

What are non-experts’ teaching and debugging strategies?

Majority of the participants are aware of the importance of diversity in a training set. Based on the analysis of photos, we observe that many participants (N=77) diversified the photos for an object. Most of them varied the size (N=65) or viewpoint (N=63). Around half and quarter of the participants varied the location (N=39) and illumination (N=19). However, only 11 of the 77 participants diversified all four variation dimensions that we measure. On the other hand, some participants (N=5) thought eliminating the diversity allows the machine to learn easily.

Do teaching strategies evolve through iteration?

Fewer participants varied the background in the second training than the first training attempt. The subjective feedback shows that some participants (N=22) did not want to change anything because the performance was good enough. Some participants had no idea of what to change (N=19). While some other participants explicitly mentioned that they will incorporate more diversity in the second training attempt, the difference is not captured in our analysis. This would be due to the degree of change is not big enough or the limitation of our coding scheme that coded the variation in binary, indicating only whether a training set has a variation or not.

Our suggestions for future teachable interfaces

The results of the user study help us better understand non-experts’ interactions with a teachable interface. We highlight the following insights for the design of future teachable interfaces:

Account for teaching strategies: Our findings show that non-experts mainly include the most discriminatory features of the objects (i.e., logos) in the photos. At the same time, they incorporated variations of the distance, background, viewpoint, and illumination to provide samples near the decision boundary. It draws from parallels to the patterns of how humans generalize the visual features of objects (identified in a prior study). While the majority of the participants diversified the photos, many of them incorporated the variation inconsistently across the three objects, indicating that non-experts may have inconsistent strategies of training between classes.

Anticipate misconceptions: Our analysis reveal some misconceptions about collecting training examples. One of the main misconceptions is the consistency within a set of training examples. Some participants thought the consistency as providing almost identical photos to the model. While repeating the same information can be effective for teaching a person, a lack of diversity would be a problem for a machine learning model, letting it make wrong predictions when the input deviates from the training samples. Other misconceptions include the misunderstanding of how a machine characterize objects. For example, some non-experts took photos with the contents in a container and others made texts readable to the object recognizer.

Help users craft evaluation examples: We observe that testing examples are less diverse than training examples or not diverse at all. This is a contrasting behavior to the strategies used by experts in machine learning who typically try to test a system exhaustively with various samples. It is not surprising that some participants did not want to change their strategies in the second training because they had not observed any problem with test samples which are similar to the representative samples in the training sets. The help for crafting evaluation examples should aid the users to collect test samples that fits to the purpose of the system. If it is personalization, the test samples should be consistent with the future use cases. If the purpose is building a model for a general use or discovering the machine learning concepts with non-experts, the help should be about finding a model-breaking examples to identify the weakness in the models.

This post is a summary of our CHI 2020 paper:

Jonggi Hong, Kyungjun Lee, June Xu, and Hernisa Kacorri. (2020). Crowdsourcing the Perception of Machine Teaching. In Proceedings of the ACM Conference on Human Factors in Computer Systems, 2020.