How Not to Mess Up With Your Data Annotation

Published in

The Startup

7 min readMar 18, 2020

The process of labelling data with the objective of training the machine to do certain predictions is called data labelling or data annotation. The more correct the data labelling, the preciser will be the predictions of the machine learning algorithm.

Growth of Data Annotation Market

Using the slipstream of rapid growth in AI and machine learning, the market of data annotation is booming. According to Cognilytica:

The market for third-party data labelling solutions is $1.7B in 2019 growing to over $4.1B by 2024.

Meanwhile there are many established start-ups in this field, with market valuations reaching up to 1B$. The major industries these companies focus on are autonomous transportation, drones and retail as these need labeled data for their specific ML use cases.

Big Techs Now in the Play

Recently also the Tech-Giants started their offerings in the growing data annotation market. At the end of last year, Google launched AI Platform Data Labeling Service Beta Version. As data labelling is one of the first steps in developing machine learning algorithms, it is likely that the partner companies choose to work with initially have good chances to win the following business — the development of the AI system. Through AutoML, Google is an important player in the ML Algorithm development, as well.

The same goes for AWS, which launched the Amazon SageMaker Ground Truth at the end of last year. The system is based on the AWS solution Mechanical Turk, which has been in the market quite a while now and uses crowdsourcing for data handling.

Adapted from the photo by Yoel J Gonzalez on Unsplash

As a result of increasing competition, the “incumbents” of the data annotation industry are adding more services to their portfolio, like listing of public datasets — a business that Google has also recently entered.

In this fast changing environment it is not easy to select the right data annotation strategy. Nevertheless you can start by asking the following questions to yourself to make sure you do not mess up with your data annotation job:

How sensitive is my data?

One of the first questions to ask when starting a data annotation job is the sensitivity of dataset to be annotated. Personally Identifiable Information (PII) defines sensitive information that is associated with an individual person, such as faces, that can be used to uniquely identify, contact, or locate a single person. With growing privacy regulations, this type of information should be handled and stored with care. You can select from the three strategies below for data annotation of PII:

Insourcing — Asking your own employees for data annotation is naturally the most secure way, making sure the data does not leave the companies servers. The downside is that it can come at a high cost and take longer depending on the human annotator capacity.
Outsourcing — Outsourcing to a reliable partner with own data annotation team can be another option. Though most of the data annotation companies utilise crowdsourcing, there are also ones that do the all annotation activities and quality control through own employees. If you choose this path, sensitive information handling procedures can be agreed upon with the outsourcing partner, or a partner with the respective security certifications like SOC 2 TYPE 1 or TYPE 2 can be selected.
“de-identification” and crowdsourcing — when de-identified, the information can be maintained in a way that does not allow association with a specific person. In this case, the information would not be considered sensitive. Some of the data annotation companies provide this type of service, for instance through automatic blurring of faces on images.

By diligently analyzing the information before sending it out publicly, or selecting a secure way to trasmit, share and store the information will help you not to mess up with your data annotation task in the first place.

2. To what extent can my annotation job be automised?

Data annotation is a human business — or it’s not? With the increasing level of intelligence, it is getting easier to train machines detect certain patterns or objects.

Through applications like AutoML where every day thousands of people upload training data for their ML algorithms, the model itself is becoming more intelligent and less and less human data annotation is being required, at least for the most common objects — e.g. cars, faces, traffic lights, etc. According to Cognilityca, over 30% of current labelling tasks will be automated or performed by AI systems by 2024.

The advantages of this come in two-folds:

Data annotation tools with automatic detection of objects or scenes — Some data annotation tools can automatically identify many objects such as bike, telephone, building and scenes such as parking lot, beach, city, without requiring a human annotator. Some tools can also provide pre-intelligence with functions like celebrities face recognition. When analyzing video, you can also identify specific activities such as “delivering a package” or “playing soccer”. If your use case fits with one of these commonly used functionalities, you just need to upload your data to the system and the labelling will be done automatically, without an intervention from a human annotator.

This advantage is currently available in majority for Computer Vision. For more custom applications — e.g. detection of certain plant types still more human annotation will still be required.

Need of less training data — through transfer learning, when the machine learns how to recognise patterns in a picture, learning of new objects recognition is much faster.

For instance while I was training an AutoML model for X-Ray lung disease detection, only 100 datasets per label was good enough to get reliable results. This is in part due to transfer learning, that the machine has learned how to learn from the previous trainings.

By understanding the portfolio of offerings available in the market that minimise human annotation, you can reduce and maybe even completely abandon this task, saving costs and time. Thus, not messing up with your data annotation task by going through the complete manual way.

3. How do I ensure ethical compliance?

How important is it for you to know the working conditions of the data annotators? If your company has ethical compliance conditions for all their suppliers, this may be an important consideration for the selection of your data annotators. Most of the human data annotation work is being crowdsourced to countries with cheaper labor like Philippines and India, to costs as low as 5$/hour.

There are data annotation companies who utilise their employees in large cities or in emerging countries to do the data annotation job. These state to have fair working conditions for their employees in terms of gender equality, maternity leaves, minimum wage, etc. Though these companies provide a better basis for a fairly handled workforce, it is also noteworthy that the statements are in nature a self-declaration rather than a third-party ethical compliance certification.

Careful evaluation of the fit between your ethical compliance needs and the solution provider makes sure you do not to mess up with your data annotation job.

4. How accurate will the annotations be?

The more accurate data annotation / training data is, the better the predictions of the ML algorithm will be. In practice one object will be annotated 3–5 times, and the most chosen annotation will be taken as ground truth. Thinking the annotation of one object costs in average 0.08$, repetitive work is a very important cost factor.

By working with companies who have experience working on similar datasets like yours, you can benefit from a pre-trained data annotation team and possibly a pre-trained ML algorithm. This will lead to more precise annotations with less manual work, saving costs and time.

Choosing the data annotation supplier with experience in your field will help you better predict the outcome and avoid messing up with your data annotation job.

5. What level of specialisation is required?

For some datasets specialist knowledge is required for annotations. An example is radiologists knowledge being needed for annotation of medical images. For these special use cases, you may not need only specialist team, but also a special data annotation tool, for instance in the case of medical a tool with the ability to perform 3D annotation in CT and MRI.

Validating the level of specialised knowledge and tools that are required for your data annotation job will help you look for the right partner or choose the right strategy in order not to mess up with it.

Conclusion

Data annotation plays a crucial role in the development of robust ML algorithms. In order not to mess up with it, it is important to analyse the sensitivity of data, the extent that the job can be automised, importance of ensuring ethical compliance, predicting the accuracy of annotations with the right partner as well as identifying if specialised knowledge and tools are needed.

Not messing up with data annotation will save you from a lot of troubles in the future.

How Not to Mess Up With Your Data Annotation

Written by Asli Solmaz-Kaiser