Prefect + Toloka: Painless Automation of Data Pipelines

Automate the creation of high-quality large datasets for ML

Evgeniya Sukhodolskaya
Toloka
8 min readMay 19, 2022

--

According to Andrew Ng, AI systems are made up of two parts: data and code, where the code part consists of machine learning algorithms and models. A lot of effort is already invested in the models — the variety of architectures for machine learning models is insanely huge and continuing to expand rapidly. It’s time to shift our attention to the data-centric approach as an equally promising direction for AI development.

Image by Mohamed Hassan from Pixabay

The data-centric approach pursues the same goal as the model-centric one — finding the best solution possible for an AI-related problem. However, it focuses on systematically improving the data — not the algorithm — to achieve the desired performance.

There are two major aspects of data that affect the outcome: quality and quantity. Both are extremely hard to achieve on a high level because most of the real data available is rare, noisy, inconsistent, or even incorrect, and new data is non-trivial and expensive to gather.

Fortunately, there are several powerful tools and techniques for collecting and labeling data that do a decent job of addressing both quality and quantity.

Crowdsourcing as an instrument of the data-centric approach

Image by Gordon Johnson from Pixabay

Humans are naturals at generating data — millions of terabytes are generated by people every single day. We are quite good at performing classification, data generation, and analysis tasks in fields like natural language processing and computer vision. We don’t even need any specific training, because we have naturally learned how to do these tasks since birth.

However, when it comes to gathering high-quality datasets in specific fields, the crowd force seems uncontrollable. The data produced by the crowd appears to have unmeasurable quality and to be riddled with fraud and unintentional mistakes. Frequently, the solution for companies is to hire trustworthy experts to collect and label data. Individuals facing this dilemma usually resort to doing the work themselves.

A more efficient money-wise and time-wise way is to master the science of the crowd. The keys to success when using crowdsourcing involve decomposition of complex tasks, choosing the right data aggregation methods, and creating a precise system of quality control rules.

Toloka — a crowdsourcing platform

Toloka is a cloud-based data labeling platform that provides all the tools you need for designing a successful crowdsourcing task.

Image by author

Toloka’s main entities are project, pool, and task. Each project contains a description of one particular problem: instructions for the labelers and the task interface that they will see. A project is composed of pools, which are containers for atomic tasks. A pool defines the number of atomic tasks N that the labeler sees at once, the price for markup of these N tasks, filters that select a particular crowd with certain skills for this kind of problem, and a set of quality control rules (for instance, labelers who respond too fast are banned on the project). Typical tasks might ask labelers to classify an animal in a picture, upload a single conversation sample, or outline an object in an image.

Crowdsourcing pipelines and their challenges

Let’s say a requester — the author of a data labeling task — has a one-off project, like gathering a medium-sized dataset of facial expressions for an experiment. It would be convenient enough to manually create a few projects in the Toloka web interface. To cover the quantity aspect of the desired dataset, the requester could select a broad group of labelers, such as a crowd from multiple countries. To achieve the desired quality, they could use a validation crowdsourcing project, where a different group of skilled labelers will accept or decline items generated in the first project. Due to the one-time nature of the project, the requester could manually monitor the completion of the tasks, perform failure management, and download accepted results as soon as they are accessible.

However, a lot of production tasks require a continuous incoming flow of high-quality, up-to-date data in fields where the context changes rapidly. For example, in an advertising context, relevance and sentiment analysis play a huge role in successful targeting (just imagine seeing ads for discount plane tickets after reading a story about a plane crash). In that case, you can’t use a recommendation model trained on a frozen dataset. When you need to continually improve and enrich your data, manual management of crowdsourcing projects in the web application becomes inconvenient, time-consuming, and error-prone, and a huge need for automated workflows arises.

Toloka provides Toloka-Kit — an open-source Python library for interaction with the Toloka API. You can use it to create automatable workflows consisting of crowdsourcing pipelines and all other parts needed to process and make use of the data. However, it’s essential to understand that crowdsourcing pipelines have special features, which introduce some challenges in pipeline management.

Image by author

The first challenge is error handling. One of the classic crowdsourcing pipelines has four steps: running a crowdsourcing project for dataset enrichment, running a validation project on the results where the generated data is accepted or rejected by labelers, performing aggregation of the labels made in the validation stage (let’s say one object is accepted only if it is accepted by a majority of labelers who’ve seen it), and applying the results of the validation to enable qualified data labelers from the enrichment project to receive a bonus. Further, imagine that on a fourth step, an error occurred. Can we simply restart a sequence from the beginning to fix it? Technically, we can, but in practice, we will lose both money — crowdsourcing requires a fair payment for each page of tasks done — and time — even with parallelisation between workers and a vast number of labelers, crowdsourcing tasks take seconds, minutes, hours, or even days in some cases. The only effective way to handle errors is to cache and log all the intermediate results.

Secondly, crowdsourcing pipelines require sensitive monitoring. New fraud methods are invented with a scarily high frequency; labelers are intolerant of outdated instructions and non-user-friendly interfaces and demand personal feedback. One of the toughest challenges is to keep your project rated highly by the crowd. So for crowdsourcing pipelines, there should always be an ability to calibrate a system of alerts.

Prefect — a workflows orchestration system

A workflow is likely to be managed more successfully if its runs are supported by special tools designed for this purpose: in most cases, building your own framework for workflow automation is “reinventing the wheel” and, honestly speaking, a hard path to follow.

Prefect Core is a workflow management system, which provides the essential functionality of logging, caching, retires, and notifications based on the current state of the run.

One of the key conceps of Prefect Core is a task; workflows, or, using Prefect terminology — flows are composed out of them, forming directed acyclic graphs of dependencies. For those acquainted with the Python language, the best analogue of a task is a function — a task encapsulates the logic of a single, related action that could optionally be performed on some input data and could optionally produce an output. Outputs and inputs of tasks can be cached: input data is cached by default, while caching for output data is easy to configure.

Image by author

Another fundamental concept is a state. It plays a significant role in the failure management mechanism provided by Prefect. Tasks are designed to be automatically retried on a failure state. The start of the task is determined by a trigger function that analyses an upstream task state. Alerts and notifications are sent by state handlers — functions that are executed based on a change of an object state.

Prefect Core features fulfill the needs which arise while handling a crowdsourcing pipeline. However, for orchestration of multiple workflows, consider Prefect Cloud or Prefect Server — an open-source lightweight version of Prefect Cloud. They extend Prefect Core functionality with persistent and orchestration layers and provide UI and database backends in addition to Prefect Core engine. This allows you to manage, schedule, and run multiple flows in a web UI.

Prefect & Toloka

For running Toloka pipelines on Prefect, the Toloka team released an open-source toloka-prefect library. You can use the library to maintain project and pool lifecycles and assign and manage crowdsourcing tasks. Below you can see an example of operations that describe a simplified lifecycle of a pool.

Image by author

Using this integration, the four-step classic Toloka pipeline mentioned earlier could be implemented in the Prefect UI like this:

Image by author

Let’s look at the steps in a single run of this workflow.

On the upper-right side, the existing data enrichment project (project_id) is used to launch a new pool with objects extracted from the candidates database table using SQLite queries. These objects contain the enrichment templates, which will be paired with objects generated by the crowd.

The labeled results are transferred to an existing validation project (val_project_id), where a new validation pool with binary classification tasks is launched. In the validation pool, each labeler accepts or rejects the pairs generated in the enrichment project.

Since each pair is validated by several Tolokers, the validation project results require aggregation. The rejected results are selected from the aggregated results and applied to the enrichment project. The purpose of this step is twofold: to decline payment for labelers who performed poorly, and to put the incorrectly labeled objects back in the candidates table to participate in the next workflow run. The accepted results trigger the payment procedure for labelers in the enrichment project, and these results are sent to the results table.

The workflow can be reused as many times as needed to create a dataset with the desired quality and quantity. If you use the Prefect scheduling functionality and set up alerts, the workflow will be almost fully automated with only minimal intervention on your part.

In conclusion

The data-centric approach has a lot of potential for solving AI-related problems. One of the primary instruments behind this approach is crowdsourcing — gathering data, labeling data, and evaluating data quality with the help of the crowd. Crowdsourcing workflows have special challenges that can be solved with the right set of tools. With Toloka and Prefect, you can orchestrate crowdsourcing workflows painlessly and free up more time for creativity and experiments.

For details of the example presented in the article watch this video.

For another example of a Toloka-Prefect integration, go to the GitHub repo.

--

--

Evgeniya Sukhodolskaya
Toloka

Developer Advocate, pessimistic extravert, love NLP & heated conversations:) “Data Engineering & Analytics” master, TU Munich