Data Labeling: A Lighthouse on the Rocky Shores of the AI-Driven Legal-Tech Industry

The global legal-tech market is booming — its revenues have been rising from $17 billion in 2019 to the projected $25 billion by 2025.

Published in

Toloka Tech

4 min readJan 18, 2022

This doesn’t come as a surprise, as there’s currently an influx of AI-oriented tech startups in their thousands from all directions. The legal sphere in particular has benefited tremendously from these emerging technologies, with the US, UK, and EU leading the way.

Within the legal industry, there are five areas that AI sets out to address and solve. All five have to do with Natural Language Processing (NLP), a mixture of linguistics and computer science that allows AI to use written and spoken language (personal assistants like Alexa and Siri are arguably the most well-known examples of this). The five legal areas where NLP is currently involved are:

Research

Acquiring information pertinent to specific legal cases and court proceedings by analyzing all available legal data.

Review

Going through legal contracts to check for accuracy and ensure the most favorable terms for the client.

Advice

Offering legal strategies and solutions based on the various details and circumstances supplied by the client.

Automation

Producing legal documents by collecting answers from the client, usually in the form of a questionnaire.

E-discovery

Matching digitally stored documents to ongoing legal inquiries and investigations.

AI-backed services attempt to speed up the legal processes in question by either eliminating the need for human supervision or substantially reducing it. However, not many people know — not even in the legal field itself — that to utilize NLP and make these solutions work, one crucial ingredient must be present: labeled data.

For any AI machine to be able to execute the task at hand, there has to be a training algorithm that tells the machine how to teach itself. And in order for this to happen, the machine needs to be fed a lot of data. In the context of the legal field, that could mean supplying AI with different types of legal documents, so that it can learn the difference between a birth certificate, a driver’s license, an eyewitness testimony, or a subpoena. The more data there is and the more accurately it is labeled, the better the software can do its job. No data means the machine cannot do its part even with a great training model, on par with a functioning car without gas.

Methodology and drawbacks

Data labeling is admittedly a painful subject for most data scientists, because many labeling methods today — namely the in-house and outsourcing approaches — are unwieldy and ineffective from the perspective of fast-paced business. At the same time, 80% of AI project time is usually spent doing just that: working with the data. As a result, often businesses end up with either small sets of accurately labeled data or large sets of poorly labeled, “noisy” data. Whereas less technical fields can afford to sacrifice some quality in favor of more quantity, the legal field is not one of them. Legal data always needs to be both abundant and accurate. And this is a real problem for many in the field.

For instance, one software engineer named Sam who works for a company specializing in legal annotation says that they have to face a number of serious obstacles on a daily basis. First of all, hiring experts who can label and validate the data is normally unbelievably costly. Secondly, the whole process is slow, with each project taking more than a month to complete on average. Thirdly, communicating effectively and maintaining a necessary collective mentality among the labelers leaves much to be desired. And lastly, legal data labeling is very niche, so there’s no one to turn to for a piece of advice or second opinion.

Other companies working in the legal field sign off on Sam’s dissatisfaction. The whole enterprise is extremely challenging, they confirm, because there’s (1) a dearth of the legal data and simultaneously (2) a shortage of the experts who are willing to label this data. The first has to do with the legal red tape and largely fragmented access to legal documents, with some companies having to spend up to $1 million to overcome this and build a usable data set. The second means going back and forth in endless cycles between data scientists and few legal specialists that there are, often on projects running in parallel, to see whether the labeled data will be sufficient to meet the model’s demands.

A new hope

In short, the whole process is hardly affordable, very slow-going, and rather disorganized. This is so because the chosen labeling approaches are ineffective from the perspective of time and finance, and they call for mostly external specialists, while there’s little to no connectivity between them.

An alternative to all this is crowdsourcing — a business-savvy method of data labeling that uses the efforts of millions of independent performers and provides stable and scalable high-quality outcomes resistant to the mistakes of individual performers.

Toloka believes that the key to this challenge is a combination of maths, engineering, and effective business processes. It is the process that needs to be managed, not the people who perform the tasks.

The good news is that with the right decomposition of these processes, appropriate pre-training of crowd workers, and proper aggregation of results, you can achieve high levels of data labeling quality, which you can recreate on a larger scale. The key ingredient is automation: whatever can be automated should be automated.

Find out more about Toloka’s technology and check out some of the use cases:

Data Labeling: A Lighthouse on the Rocky Shores of the AI-Driven Legal-Tech Industry

The global legal-tech market is booming — its revenues have been rising from $17 billion in 2019 to the projected $25 billion by 2025.

Written by Kate Saenko