Automating crowdsourcing quality control

Published in

Toloka

7 min readAug 5, 2022

In one of our previous posts we talked about evaluating your crowd-labeled data. Now that we’re familiar with these concepts, let’s discuss the following topic — how to improve the quality of your labeling? Toloka’s customizable projects offer many different ways that affect the results you get. Of course, you can simply edit the instructions for your project or set a higher price for tasks. This will change the behavior of the annotators on your tasks. But usually the better way is to use some quality control tools like:

Overlap — giving the same task to multiple annotators
Filters — e.g., by demographic or by percentage of correct answers on training pool.
Blocking by heuristic rules — e.g., answers too fast or makes a lot of mistakes on a control tasks.

Using these features is usually a crucial part of getting high-quality data in crowdsourcing. For example, if you don’t block annotators who answer too fast, then some of them may just put random labels on your data to make money faster. This can definitely lead to poor-quality data, so using quality control is really important. Unfortunately, setting up quality control is usually quite a complicated task. You have several rules and each rule can have several parameters, for example:

In practice, there is usually two possible ways to find a good settings for quality control on your project:

If you are an experienced crowdsourcing user, then, probably, you are able to come up with optimal parameters from your head.
Otherwise, it’s more likely, that you just try different settings on a small batches of data to find the best option.

So, yes, usually you need to repeat a few boring steps to find a suitable solution. We at Toloka thought — why not do it automatically? In this post I will introduce AutoQuality — our automated solution to find optimal quality control parameters for your Toloka project.

AutoQuality

AutoQuality considers the task of finding optimal quality control settings as an optimization problem and solves it using random search. For each quality control rule we can specify numerical parameters which determine it. For example, for a rule that bans annotators based on their correctness on a control tasks, this parameter might be a threshold value in percent. Random search means that we use random walk to try different configurations of a quality control. So, AutoQuality algorithm is:

For each configurable parameter, define some distribution.
Sample parameters from this distribution.
Label a bit of data with these parameters.
Pick the parameters that provide best quality/speed/cost.

AutoQuality creates a new pool (autoquality pool) for each set of sampled parameters. All of these pools are completely the same except for quality control rules optimized by an algorithm. Then you need to provide a bit of your tasks for labeling. AutoQuality runs several pools in parallel to increase the speed of the algorithm. The cost of the one AutoQuality launch depends on the project settings, but usually it’s ~$3-$15. The most important question is how to pick the best parameters? Our method is:

Calculate a bunch of different metrics for each pool.
Rank pools according to each of them.
Use smart aggregation to find the best average rank.

First of all, we need to calculate quality metrics. For example, as quality metrics for a classification task, could be used consistency or accuracy on control tasks (you can read more about them in our article mentioned above). So, we can just calculate the mean rank of these metrics, right? Well yes, but actually no. The problem is that the pool with the best quality ranks is usually just a pool with very strict and expensive quality control. Like ovelap=10 and ban every annotator who makes at least one mistake. Yes, probably, this approach will lead to a high quality, but it definitely will also lead to a high costs and low speed of your crowdsourcing process. We’ve added some other metrics and ranks to penalize parameters you don’t really want to use in practice. The final formula is:

Using this approach, AutoQuality can provide a high-quality configuration which could be useful in real projects.

Experiments

Our idea sounds good, but how does it work on practice? We tested AutoQuaty at our internal benchmarks there it performed almost as well as professional crowdsourcing specialists. The quality decrease was less than 5%.

So, if you’re a beginner at crowdsourcing, then AutoQuality can help you to find good enough initial settings for your project. But what if you’re an experienced user, can AutoQuality be useful for you? We conducted several experiments with Toloka’s customers to find out how AutoQuality can help in their production crowdsourcing processes. We compared configurations created by requesters with configurations found using AutoQuality. The results were really interesting. In some cases, AutoQuality was able to simplify quality control settings with no decrease in terms of quality. That allowed customers to save their money or receive labeled data faster. In some other experiments, AutoQuality found settings which brought more annotators to the same project. Sometimes these settings were really strange from a human point of view. For example, for one easy task, AutoQuality suggested using overlap=2 and some basic filters without any complex quality control rules at all. For a customer (and for us too) it was a rather unexpected and unintuitive result. But it worked! Our experiments showed, that people tend to find biased and intuitive quality control settings which don’t necessarily work best. So, one more usage of AutoQuality is to help users to look at their projects from a different angle.

Of course, for now the algoritgm has some limitations:

AutoQuality can only optimize simple parameters. It can’t cover multi-stage projects and complex cases.
Quality control settings found might degrade over time as annotators adapt to them.
Settings found might be “shortsighted”, e.g. block annotators too often. They are to be used as a first approximation and tuned later.

How to try?

We’ve added an implemetration of AutoQuality in Toloka-kit — our Python library for interactions with Toloka API. Here you can find some documentation about it. Now let’s look at how to use it. First of all, you need to install Toloka-kit with additional requirements for AutoQuality:

In this example we will use the IMDB Movie Reviews dataset. Let’s do some preparation and create a project:

AutoQuality requires you to use a training pool on your project:

After that, we need to prepare a base pool. This pool will be cloned by AutoQuality to several pools with different quality control settings:

For now, AutoQuality works for a certain subset of the most important and popular quality control rules: overlap, bans according to assingment submit time, bans according to consistency with majority vote, bans according to accuracy on control tasks, filter by accuracy on a training. Additionally AutoQuality is able to find a threshold for a filter by accuracy on an exam. Basic usage is really simple. Create an AutoQuality class, then call setup_pools method which will create autoquality pools:

After that, you need to provide a batch of data to an algorithm. Usually AutoQuality requires 300–500 tasks. If you want to use control tasks in your project, then enough tasks of that kind should also be provided:

So, now you can just run AutoQuality. When it is completed, check the best parameters found and compare it with the others. And archive all created pools if you don’t want to see it in the UI.

AutoQuality usage with default settings is pretty easy, isn’t it? But if you want to have more control over a process, you can customize an AutoQuality class. For example, let’s make a launch cheaper by reducing the number of created pools:

Also, you can set a custom distribution for any parameter:

And change the methods which calculate metrics or ranks. It’s a very powerful functionality. You can rewrite these functions to adapt AutoQuality to your project, for example, if it is not a classification task. In our case, let’s just modify the ranking function to give preference to a cheaper pool. Do not forget to set your new rank to a main_rank column so that AutoQuality knows how to choose the best pool:

In conclusion

Let’s summarize the content of this post:

Experiments demonstrate — automatic quality control configuration works.
AutoQuality, works by trying many parameter configurations and picking the one that outputs the best annotation quality.
We tested our method on internal benchmarks, an external real data labelling task.
Long-term efficiency of such parameter settings remains to be studied.

We will continue to develop AutoQuality. There are some limitations which should be removed in the future. AutoQuality should become a more universal, powerful, but still easy-to-use tool that simplifies work with quality control. We will be very appreciative if you try it and leave your feedback.

Automating crowdsourcing quality control

AutoQuality

Experiments

How to try?

In conclusion

Written by Stepan Nosov