Ground Truth Gold — Intelligent data labeling and annotation
by Mohan Reddy, Chief Technical Officer, The Hive
We all understand the unreasonable effectiveness of data and often data is said to be the new oil. However for ML models it is the labelled data that is the most precious commodity. Modern ML models require large amounts of task-specific training data, and creating these labels by hand is often too slow and expensive. The biggest lesson learnt from building many AI startups within The Hive portfolio is that the hardest part of building AI products is not the AI or algorithms but data preparation and labeling. This three-part series will explain our research into this area, techniques for labeling data and how The Hive is tackling this problem.
The intelligent enterprise is the hallmark of enterprise automation; wherein both operational and strategic decision-making can be automated from artificial intelligence (AI) based extraction of high-level inferences from real-time streams of raw data. The biggest thrust to intelligent computing has come from the availability of data. A significant corpus of historical information in a specific domain will enable an AI application to extract key concepts, recognize entities, associations, and hierarchies and generate what we call smart data by merging domain knowledge with ontologies.
Referring to the famous cake example by Yann Lecun at NIPS 2016 (Image 1), “If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. A key element we are missing is predictive (or unsupervised) learning: the ability of a machine to model the environment, predict possible futures and understand how the world works by observing it and acting in it.”.
Unsupervised Learning is the holy grail and the only sustainable way to proliferate AI building algorithms with human like creative abilities. However it is very hard and we are no way near that goal (image: 2). Pure unsupervised learning is difficult because it is hard to know what to train ahead of time. To train models it will need to extract different features. The features are application or task dependent and supervision is needed to learn effectively. Hence it is very important to understand the data, it’s variations and behavior.
In Supervised learning input data or training examples come with a label, and the goal of learning is to be able to predict the label for new, unforeseen examples. Labeling the data is expensive and error prone. Data quality issues can lead to “garbage in, garbage out” in machine learning.
For example retinal images are used to develop automated diagnostic systems for conditions, such as diabetic retinopathy, age-related macular degeneration and retinopathy of prematurity. In order to do that we need annotated images labeled by various conditions structurally. Same with CT images as well. This is a rather time consuming task wherein it requires identification of very small structures and usually takes hours for an expert to carefully annotate them making it very expensive to annotate a decent size of labeled images. We need several experts to label the same image to ensure correctness of the diagnosis, and hence acquiring a dataset for the given medical task would be several times the amount it takes to annotate a single image. The problem is even more difficult in traditional enterprise settings because of data sparsity, data quality and lack of domain experts.
To solve the problem of cost and scalability there are many techniques and in this series we will talk about:
1. Pre-training models / transfer learning
2. Weak supervision
3. Active Learning.
The idea behind pre-training is to train a neural network on a cheap and large dataset in a related domain or with noisy data in the same domain. This will solve the cold start problem by bootstrapping the network with rough idea of the data and generally in this first pass the accuracy of the results may be not be high. The parameters of the neural network are further optimized on a much smaller and expensive dataset pertaining to the domain problem. Using a pre-trained network generally makes sense if the tasks or datasets have something in common.
CNNs are used as a feature extractor and the last fully connected layer is removed from rest of the CNN as a fixed feature extractor for the new dataset. The network is re-trained on the new dataset and weights are fine-tuned by continuing the back-propagation.
This approach of transfer learning works very well and has produced great documented results in computer vision. It can also be adapted to other domains using different kinds of data such as sensor data, business process data, language data etc., We are currently working on general language modeling tasks combined with domain driven noisy labeled data for Q&A for support engineers in data center domain.
In the next blog we will present an approach to take advantage of existing annotations when the data are similar but the label sets are different. This approach was based on label embeddings which reduces the setting to a standard domain adaptation problem.
Weak Supervision :
Weak Supervision is programmatically generating training data using heuristics, rules-of-thumb, existing databases, ontologies, etc. It is often known as distant supervision or self supervision.
The idea of Weak Supervision for information extraction is not new. Craven and Kumlien (1999) introduced the idea by matching the Yeast Protein Database (YPD) to the abstracts of papers in PubMed and training a naive-Bayes extractor. Hoffmann et al. (2010) describe a system which dynamically generates lexicons in order to handle sparse data, learning over 5000 Infobox relations with an average F1 score of 61%. Yao et al. (2010) perform weak supervision, while using selectional preference constraints to a jointly reason about entity types. Another notable mention is NELL System (Never-Ending Language Learner). NELL system instead of learning a probabilistic model, bootstraps a set of extraction patterns using semi-supervised methods for multitask learning.
Snorkel system is noteworthy here and has been gaining a lot of traction. As part of DAWN project, Snorkel enables users to train models without hand labeling any training data. The users define labeling functions using arbitrary heuristics. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of their recently proposed machine learning paradigm, data programming.
In a Snorkel system
- Subject Matter Experts write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics.
- Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs’ outputs into probabilistic labels.
- Snorkel uses these labels to train a discriminative classification model, such as a deep neural network.
Active learning is a special case of semi-supervised learning and is an approach for reducing the amount of supervision needed for performing a task by having the model select which data points should be labeled. It addresses the data labeling challenge by modeling the process of obtaining labels for unlabeled data. The system needs to request the labels of just a few, carefully chosen points during the process in order to produce an accurate predictor. Reinforcement learning is a natural fit for active learning and generally a deep recurrent neural network function approximator is used for representing the action-value function.
In this post we looked at various approaches to data labeling. In the next post we’ll talk about how to generate training data for Named Entity Recognition (NER) using Snorkel focussing on classifying independent objects with an enterprise application use case. We’ll also discuss Active Learning based CNN+LSTM language parser for NLP.