It all Boils Down to the Training Data
Is your model not performing well? Try digging into your data. Instead of getting marginal improvements in performance by searching for state-of-the-art models, drastically improve your model’s accuracy by improving the quality of your data.
Since most data scientists are adapting off-the-shelf algorithms to specific business applications, one of the most difficult challenges that data scientists face today is creating a continuous workflow that consistently feeds high-quality training data into their algorithms. At the same time, your model is learning and you want to be able to leverage this intelligent model to label the rest of your data set. Building the infrastructure to do annotation that integrates with your model and managing the workflow is the most challenging part of machine learning.
Iteration => Accuracy & Consistency
The axiom of garbage in garbage out can be masked in training. Even when fed random noise, such as random labels or unstructured pixels, certain models are capable of overtraining to the point of attaining 0% training error (Understanding Deep Learning Requires Rethinking Generalization). This is because recent high-capacity models like deep neural networks can memorize even massive data sets. While these models do not commit errors during training, when tested, they perform no better than random guessing.
Therefore, iteration and rigorous QA/QC processes are essential to a proper data labeling workflow. “Quality evaluation methods can be classified in three main families: (i) automatic, (ii) by direct inspection of the job provider and (iii) methods using the crowd itself as evaluator” (Worker Ranking Determination in Crowdsourcing Platforms using Aggregation Functions). Since, in most cases, automated evaluation without human input is either impossible or guarantees minimal quality, we will discuss how to implement QA/QC methods of the latter categories to help improve the confidence in the quality of your training data.
- Test questions
- Direct inspection
Test questions and direct inspection are QA/QC methods that fit into category (ii) where the job provider, or data scientist, is directly responsible for evaluating quality. Test questions is a standard technique amongst companies. It refers to a set of data that is correctly labeled by the data scientist and then distributed randomly amongst labelers to test their accuracy. Direct inspection is the process of visually inspecting your labeled data to gauge accuracy.
Visual screening is a basic functionality that everyone should have to preprocess data and post-label review for accuracy. In the article, Why You Need To Improve Your Training Data, And How To Do It, Pete Warden recommends randomly browsing through your data. This basic practice can reveal valuable information about your data set, such as “unbalanced number of examples in different categories, corrupted data (for example PNGs labeled with JPG file extensions), incorrect labels, or just surprising combinations.” For more practical tips on improving your data quality, read his article here. While most open source tools do not provide this essential feature, Labelbox is a repository of labeled data where you can visually browse and manage your data in one place.
While the QA/QC methods of category (ii) are extremely useful, they have two inherent drawbacks. First, they are inherently unscalable since the resources of the job provider, or data scientist, to evaluate the accuracy of crowdsourced labels is finite. Second, in order to perform these methods, the correct answers must already be known.
Consensus, on the other hand, is both inherently scalable and useful when the correct answers are unknown. Consensus requires multiple different annotators to provide labels for the same piece of data. With that information, consensus computes Intersection Over Union (IOU) to average out idiosyncrasies across labelers and get better attenuation of the signal. In other words, the answers to the same question are compared to determine the rate of agreement. High agreement is indicative of a high-quality data set, while low agreement typically points to poor data quality, but can also be indicative of ambiguous examples. Labelbox offers a built-in consensus tool so you can monitor your quality metrics in real-time. Read more about how the Labelbox Consensus tool works here.
Diminishing Marginal Returns
Google published a study that showed that even when you think you have enough data, adding more can make your model perform even better (The Unreasonable Effectiveness of Data). And yet, the answer is more complicated than more is always better.
The core question to ask is, not whether you have enough data, but whether you have hit the efficient frontier where the marginal costs of labeling exceed the marginal gains in model performance. To visualize this, plot the model’s performance over time on held-out evaluation data. For example, start with 1000 samples to train your model and evaluate it on 200 held out samples to measure your starting accuracy. Then collect another 1000 samples and repeat the experiment with the second set. The model is expected to do better with 2000 examples because it is learning to see natural variations in the data and filter out idiosyncrasies while better attenuating to signal.
It is common practice to use a labeling service where you outsource the data and get labeled data in return. However, if you are outsourcing your data labeling, but have no way of measuring the quality of the labeling service, you are essentially gambling with your investment.
Outsourced labeling services can be a good go-to for basic object classification, like labeling cars or dresses. If you need to generate a large labeling task force on a specific subject matter, there are different Business Process Outsourcing (BPO) firms that can accommodate particular specialized knowledge categories. Through Labelbox, you can connect with our partner BPOs, monitor the quality of your outsourced data labeling services, and create and manage your own workflow all on a single unified platform.
To Sum it Up, Clean it Up
Your model is only as good as your training data. Now that you know how to ensure that your training data is consistent enough, accurately labeled, and sufficient in size, go clean it up!