Gartner Features Toloka in its Latest Hype Cycle for Data Science and Machine Learning Report
The report by the highly respected technology research and consulting company examines how the current business landscape has grown to rely increasingly more on innovations in data science and Machine Learning (ML). Within the context of data science and ML, data labeling in particular has become a pivotal factor in technological progress and AI product improvement worldwide.
The report explicitly outlines how data annotation has risen to become a technological must. Gartner concludes that labeling has paved the way for rapid AI development by supporting classification, segmentation, transformation, and augmentation procedures that facilitate data preparation for ML algorithms. The report further notes that the need for training data — ideally labeled quickly and cost-effectively — has increased dramatically over the past few years.
The reason is that it offers a long-awaited boost in developing AI-based solutions in a wide range of industries, including those situations when AI is not, in and of itself, of primary interest. This means that through training algorithms, data labeling has played a major role not only in the development of AI-specific products, such as voice assistants, but also in fields like law and medicine, where it has acted as an industry disruptor. Industry disruption refers to reinventing the way businesses operate by adding completely new technologies to the already established practices, such as legal document matching or offering help with interpreting medical X-rays.
Gartner lists Toloka as one of the companies worthy of mention in this domain. Among the other ones listed in the report are Scale AI, Appen, and Amazon’s MTurk.
Toloka’s data-driven principles
According to Toloka’s CEO Olga Megorskaya, the success of AI development relies on three key aspects:
- Algorithms that lie at the basis of training models.
- Hardware that’s needed to run the algorithms.
- And data that’s used by the algorithms as a benchmark for ML training.
While the first two have already become commodities — one can use many existing open-source libraries and cloud services to run standard algorithms — training data still remains a bottleneck in the industry. What’s more, with the other two pillars being equally accessible by any player on the market, training data remains the only aspect that varies in quality and complexity and ultimately defines the AI product. One can argue that whoever is more effective in managing training data production, also has the upper hand in competing for business dominance in the long run.
Historically, many web-based AI solutions have relied on algorithms that are trained on user behavior logs, clicks, and other data that gets collected automatically. But the more AI expands its area of application and moves further into the offline world, the more often we face the situation when the only way to obtain training data is to have it labeled by humans. So, the question of effectively managing training data production is actually the question of effectively managing data labeling processes.
Data labeling and production pipelines
For an MVP of most AI products, it makes sense to train the model on a data set that can be collected quickly and with minimal effort. But for a more mature AI product, it is essential to set up the process of continuous improvement, where the data gets updated regularly, which is a critical factor.
When we look at the AI production pipeline, we see that data labeling is required at every stage: first, when you collect the training set; then, when you validate the quality of the model you used; and finally, when you control how the model behaves in real life after production deployment, which is the stage many start-ups tend to forget about.
Here are a few tried-and-tested strategies that can improve the quality of the AI-based product by improving its data:
- Collect as much data as possible for the training set — more data is always better than less data.
- Update your training set with fresh data regularly — a vast majority of AI applications are subject to context drift, which means you can’t build a good AI product in 2021 that’s trained on data from a decade ago.
- Validate the quality of your model and swiftly deploy updated models to production — those companies that update their products more frequently also grow at a higher pace.
- Always control the quality of your AI solution in production.
This means that any AI product’s long-term success relies on having the right infrastructure that can facilitate scalable, flexible, and cost-effective data labeling.
Expertise in modern AI production
While on Kaggle, as a user, you compete within a given data set; however, in the real world, businesses compete in the context of complete production pipelines. This means that you may have the best engineers and the most incredible computing power, but your model’s success can never exceed the quality of the data you based your model on. As a result, knowing how to build complex data-labeling pipelines becomes a prerequisite for success. This is exactly why product management in ML and AI today is essentially all about data management. And since data management draws on data labeling, this is something that you as a company — just like with the ML algorithms themselves — wouldn’t and shouldn’t really want to surrender to a third party.
Labeling with Toloka
At Toloka, we aim to provide businesses with an all-inclusive environment for data labeling. It may not seem obvious, but environments that can meet the needs of different companies aren’t easy to build. To provide a scalable, on-demand workforce that produces high-quality labels 24/7 in all major languages, the following two components are necessary:
- A global crowdforce that’s ready to tackle all sorts of labeling tasks as soon as they get posted. This is only possible in an open environment where a multitude of requesters create demand by supplying assignments for a multitude of performers. In this setup, the former motivates the latter to stay on the platform.
- Reliable methods and instruments for automated quality management on a large scale.
People management as an engineering task
One of the biggest questions in crowdsourcing is how to set up a process that would effectively make use of minimal efforts of millions of independent performers and provide stable and scalable high-quality results resistant to the mistakes of individual performers. Toloka believes that the key to this is a combination of mathematics, engineering, and effective management of the production pipelines. In other words, it’s mainly the processes that need to be managed, not the people who perform the tasks.
The good news is that with the right decomposition of these processes, appropriate pre-training of the crowd, and adequate aggregation of the results, you can achieve high levels of data labeling quality, which you can then recreate on a larger scale. The main ingredient here is automation. Whatever can be automated — should be automated.
As an example, let’s take a look at the search relevance data-labeling pipeline.
Arguably, search relevance evaluation is one of the most sophisticated tasks in applied Machine Learning and data labeling: search queries are practically infinite by both number and variety. This makes it all the more important to be able to correctly assign tasks to specific labelers who already possess some level of relevant expertise and can therefore correctly evaluate the labeling content in question. The pipeline in this case is as follows:
- It contains multiple levels of evaluation and verification.
- Judgments are explained.
- Explanations are verified.
- Confidence of results is estimated automatically.
- Results with low confidence are re-evaluated.
The trick is that once this type of pipeline has been designed and put in place, it can function and continuously scale up automatically for as long as it’s needed. The money put into the pipeline emerges at the other end as the pipeline’s tangible output. And just like that, the task of managing the efforts of thousands of performers turns from a people management task into a purely engineering challenge.
Since Toloka provides its data-labeling infrastructure to thousands of clients, we can observe how the labeling needs differ from one stage of business development to the next:
- Experienced AI teams with seasoned engineers seek flexibility that allows for a full integration of data labeling into production pipelines. For such clients, Toloka offers a wide range of settings and tools via its interface, API, or Python client.
- Clients who are still in the nascent stages of their AI production seek fast results. They don’t want to put a lot of effort into training their first models, and so they mostly go for Toloka App Services. These are pipelines for standard use cases where everything that needs to be addressed — decomposition, training of the crowd, quality control, and aggregation — is already in place to provide maximum quality with minimal effort.
Support at every stage
Data labeling requirements for any business change over time. Initially, it all comes down to getting the highest possible quality, spending the least amount of time, and making the least amount of effort. But with more mature and well-established AI products, scalability and flexibility take center stage.
Today, data labeling lies at the core of AI production. And just like with any other essential part of business, it should not be outsourced. Optimization of the data-labeling processes has a direct effect on how efficiently the whole business operates. We at Toloka provide data-labeling infrastructure with the global crowd of performers in their millions, sophisticated tools for automated quality management, and ready custom-made solutions for the most common labeling tasks and use cases.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.