How to Outsource Data Annotation: Choosing the Right Data Labeling Vendor
We all know that training data preparation is one of the least enjoyable chores in the machine learning process. While having humans-in-the-loop to execute tasks like labeling unstructured data is often an essential step in preparing training data for your model, its tedious and time-consuming nature makes it a task not ideally suited for small teams of highly skilled & well-paid data scientists or engineers. This is why many organizations choose to outsource their data annotation projects in order to leverage lower-cost labor at scale. Although working with these external teams comes with its own set of challenges, there are a few steps all organizations can take to optimize their annotation partnerships.
Here at BasicAI, we work with organizations on a daily basis who are facing the challenges associated with finding and working with the ideal data labeling teams. That’s why we’re launching this three-part blog series outlining some of the best practices we use when helping our clients select and partner with annotation vendors for their projects.
As a quick overview, we’ve found that there are three essential steps to effectively working with outsourced vendor teams:
- Choose the Right Vendor
- Effectively Communicate Guidelines
- Monitor & Manage for Success
These three points alone are certainly an oversimplification, so in the coming weeks, we will dive in and talk about the details involved in each phase to demonstrate how any organization can successfully work with BPO’s for their machine learning needs. We’ll start today by talking about how to choose the right annotation vendor.
Choosing the Right Vendor
Choosing the right partner is essential for any business engagement. The greater the alignment between your teams the better your final outcomes are likely to be. This is why the first step to successfully outsourcing your data annotation projects MUST be choosing the right data labeling partner.
Understand Your Requirements
This may seem obvious, but just as with any other purchase, it is essential to know exactly what your needs and expectations are before commencing your annotation vendor search. Your team should develop an RFP (Request for Proposal) document to provide a detailed overview of the project as well as your expectations for the work to be delivered. You can then utilize this document to compare each vendor you speak with against the same criteria.
There are many factors to consider when putting together your RFP document. We’d recommend including answers to the following questions, at a minimum.
Detailed Project Overview
- What type of data are you working with?
- What file format(s) is your data?
- What type of annotations need to be executed?
- Will any special domain knowledge be required to label your data?
- What is the objective of this training data?
Know Your Project Timeline
- What is the target completion date for the data to be annotated?
- How can the project be broken into phases?
Know Your Data Volume or Project Budget
- How many labeled files will you require for your completed training data set?
- Do you have a fixed budget for the project?
Answering these questions will be essential to equipping your team for the next step in the process; finding and evaluating only those annotation companies with the right experience and capabilities to match your needs.
Evaluate Vendors on Experience
One of the biggest missteps we’ve seen some organizations take when hiring contracted labor forces is to underestimate the need for skill and domain expertise. While data labeling may often seem like a simple task, it does require great attention to detail and a special set of skills to execute efficiently and accurately on a large scale. Determining who has these necessary skills can be overwhelming when there are hundreds, if not thousands, of BPO (Business Process Outsourcing) companies offering affordable rates to label data for AI projects. So how do you know which companies would be best suited for your needs?
Let’s start by talking about data annotation experience. BPO companies advertising data annotation services range from brand new startups to large organizations that offer other services including anything from document scanning or customer support, all the way to web development or software testing services. While this varied experience is certainly not a bad thing it is extremely important for you to gain a solid understanding of how long each vendor has been working specifically in the data annotation space and to find out more details about their annotation teams.
The good news is since you’ve already completed your RFP and know exactly what your needs are, you can easily ask a few key questions to gauge if any vendor could be a good fit for your project. Here is a shortlist of the questions we recommend asking:
- How long have you been working in data annotation for AI?
- What languages do you support?
- Do you have any security certifications for handling sensitive data?
- Can you share case studies from previous projects you’ve completed?
- Does your team have any special domain expertise?
- How do you train your annotation teams for each project?
- How many of your annotators are full-time vs. part-time?
- Do you have experience using X platform for data labeling projects?
Once you’ve received answers to these and any other questions you find to be relevant based on the details of your RFP, you can then select the top candidates (we recommend 3) and move to the next and most crucial step in choosing a data labeling partner.
Take a Test Drive
Words and pitch decks don’t mean much when your company is getting ready to commit a large budget and the success of your project to a contracted partner. This is why it is crucial to evaluate their actual performance. Luckily for you, most vendors in the annotation services industry are willing to provide small sample runs to demonstrate their ability to execute on your needs. We cannot stress enough how important it is for you to take full advantage of this opportunity!
There are a few steps you’ll want to take to ensure you are receiving as much information as possible in order to make the most informed decision on which team to select.
First, we’d recommend pulling a small, but diverse set of sample data (10–50 files, depending on the complexity of your dataset) from your full data set which showcases a full range of scenarios that your annotators will need to label.
For example, if you are working with 2D images of vehicles driving on a road, we’d recommend pulling images that showcase different times of day, weather conditions, densities of vehicles, and any other variable factors which may change how the labelers perceive the images. Your team will know your dataset best, so it is usually recommended to have your team select this sample set.
Second, make sure that your annotation guidelines are documented and dialed in as much as possible before beginning a sample test. While it is true that your annotation partner will often identify edge cases or outliers in your guidelines as they complete the labeling work, the more thoroughly you can document these factors in advance, the more accurately you can evaluate each vendor on their ability to understand and deliver on your requirements.
Also keep in mind that having your guidelines locked in before the start of your full production project will help avoid days, weeks, or even months of delays in your project. Remember that every change to the guidelines correlates to time your annotation partner will have to invest in retraining their labelers or having to rework previously submitted files.
Third, and possibly most importantly, be sure to have each potential partner use an annotation software with performance tracking and audit features built-in so that you can fully evaluate their efficiency and the quality of their work. Simply reviewing a few labeled files and getting a cost estimate may seem like enough to help you choose a team, but these points won’t show you how much time it took that team to label each file, nor will it show you how many passes it took to deliver annotations of acceptable quality.
We often see other companies choose the lowest cost vendor only to be disappointed when the low quality or lack of team efficiency leads to costly delays in the delivery of the final labeled data sets. This is why we recommend looking beyond the dollar and evaluating every sample test & bid with the following criteria:
- What percentage of possible labels were actually labeled?
- How accurately placed were the labels?
- How often did the annotator properly tag each label?
- How long did it take to place each label on average?
- How long did it take to label each file on average?
- How long did it take to perform a quality audit on each file?
- How many annotators can the vendor commit to our project?
- How many different labelers worked on the sample test?
- Will the annotators from the sample test be responsible for training the rest of the annotation team for the actual project, or will the project be executed by a different set of annotators entirely?
By fully evaluating all of the above factors in addition to cost, you should be well equipped to choose the best team to help move your project forward.
While the workflow we’ve outlined in this article does require a significant investment of time, we have found it is the best approach to saving time in the long run by cutting down on any excessive back and forth communication with vendors and minimizing the need for reworking files. Most importantly, selecting the best vendor from the outset should lead to your team receiving the highest possible quality for your crucial training data.
Of course, if your team is tight on time, BasicAI is always happy to help you select the perfect data labeling partner. Contact us today to ask about receiving bids from some of our 60+ vetted, professional annotation vendors by submitting one simple RFP. We’ll even help put together the RFP for you!