by Patrice Simard, CEO, Intelus
One humbling lesson I have learned in AI is that until I *interact* with both my model and a large corpus of unlabeled data, I don’t know what problem I am solving. Class definition — a critical part of the problem — is a continuous refinement process done through interactive exploration and discovery of data.
For example, if the problem is to classify documents related to “Gardening,” what should we do with documents related to botanical garden, rock garden, bonsai tree trimming, and so on? Positives? Negatives? What is “gardening,” really? Do cocktail recipes or cheese tray arrangements qualify as “cooking recipes”? How about the positive you don’t know or didn’t think about? Does your perfect recipe classifier happen to miss insect-based recipes?
Finding the flaws and biases of one’s model through customer complaints is frustrating at best. At worst, it is embarrassing. My experience with ML started with optimizing benchmarks, which I did for more than a decade (I had the world record on the MNIST data set for many years). The benchmark task is to optimize algorithms *given* the classes and the labeled data. Looking at the data and defining classes is someone else’s problem. Optimizing algorithms on benchmarks is a great research tool. What I realized since then is that to solve a specific problem, the only legitimate approach is to actively explore the data.
The following two experiences were critical in changing my perspective:
- As Chief scientist in Live Labs and later on in Ad Center at Microsoft, I often advised teams on how to build their AI models. Teams typically already had solution using state-of-the-art algorithms when they came to me (sometimes citing scientific work I had contributed to). When it came to discussing performance, I learned to dread the “98+% accuracy” claim. What would invariably follow is that I would manually label some fresh data from the deployment distribution using the latest labeling directives. My human labels would typically only agree with 80% of the label predicted by the latest model. I would then have the team relabel blindly the contentious examples with their own labeling directives. And the ruling would be mostly in my favor. What is the performance of the system then, 80% or 98%? The discussions that ensue were often difficult. There were many tentative explanations. Ambiguous data is progressively discarded. Labeling directives evolve independently of training sets. Distribution changes and is out of sync with collected test sets. Overtraining resulting from test set information leaks. Etc. What I learned from these experiences, is that the main source of error is, surprisingly, rarely the algorithm. The main sources of error are the semantic of the problem, the selection of data, and simple process mistakes.
- In 2014, two researchers in my team, Todd Kulesza and Saleema Amershi did a beautiful experiment: They asked several members of the team (me included) to label a data set. A month later, the same participants were asked to label a different data set. The new data set had 75% overlap with the previous data set. The surprise was that the labeling consistency between the labels of both sets for the same labeler and the same examples was only 81.7%. How is that possible? We would each have guessed north of 90%. The discrepancy comes from being overconfident in our ability to define a problem unambiguously. As it turns out defining class semantic is difficult and it evolves continuously as new data is discovered. It requires discipline (and tool assistance) to be done successfully and consistently.
With the realization that class definition is both difficult and critical to performance came another realization: we do not have the right tools to define, let alone, evolve class definition.
When I worked at Microsoft, labeling directives in Bing were often 100-page documents with dozens (and sometimes 100s) of revisions. Yet labeling directives and labels were not synched in a common source-controlled repository. Datasets were grown organically, and which label within a data set came from which labeling directive revision was anyone’s guess.
Data collection was also an arcane art. What queries were used to select which example were not tracked (it was not uniform sampling). Yet without tracking this information, bias cannot be analyzed and corrected. As far as I know, these issues are endemic and present in every company that builds custom models, large or small. Some of these issues are summarized in this paper.
To define and refine class semantics, we need new tools that allow interactive exploration of unlabeled data. The exploration can leverage models and semantic in intermediate stage of construction to bootstrap the process, but this requires some subtlety. For brevity, I will limit this discussion to requirements.
To be effective, an interactive exploration tool needs to:
- Assist the user’s search for false positives or false negatives.
- Track how each training set example was collected so that bias can be analyzed and corrected later
- Allow serendipitous grouping of examples that have related semantics (as suggested in Kulesza et al.’s 2014 paper), so that these groups can later become categories, be refined (split), redefined (moved), renamed, or deleted.
- Source-control schema building (class definition), labeling, and featuring together.
Without 1, the user wastes time labeling uninformative examples when the classes are lopsided. 2 is necessary to monitor and possibly undo selection bias. 3 allows the schema to evolve without necessitating relabeling of previously labeled examples. 4 allows reproducibility, as well as allowing multiple semantics to coexist and be compared (relabeling is not necessary thanks to 3).
These 4 requirements are complementary. The first 2 enable exploration and discovery of data without fear of bias. The last two enable experimentation of semantic without fear of making irrevocable semantic choices. 1 and 3 are assistance tools to make users more efficient. 2 and 4 and tracking tools that allow tracking and correction of user actions.
Building such tools requires a different mindset. Optimizing for interactivity is different than optimizing for benchmark performance. Interactivity requires sub-second response time, which most of the state-of-the-art high-performance systems cannot achieve for either training or sampling large unlabeled data sets. Interactive exploration does not require accuracy. But it does requires responsiveness, tracking, and integration.
These tool requirements may not be sufficient to build a high-performance model, but they are in my opinion the best weapons we have to fight the largest sources of errors, namely semantic definition, data selection, and the lack of process discipline for ML development.
For additional reading, I suggest a forward-looking article I wrote for insideBIGDATA.com that transforms my concerns into what is hopefully a roadmap for organizations looking to unleash the power of their data in the coming year.