How Clara Labs is on the Cutting Edge of AI by Keeping Humans in the Loop
Note: This post was jointly developed with Erik Trautman, a technical entrepreneur specializing in business strategy and product development in the AI and blockchain spaces.
Jason Laska is a data scientist who is used to exploring the boundaries of machine learning.
After completing his Ph.D. in electrical engineering with a focus in signal processing, he dove straight into the chaos of implementation at Dropcam, which made a motion-activated home security camera and was later acquired by home automation powerhouse Nest Labs for over half a billion dollars. At Dropcam, he used machine learning to categorize detected actions and prioritize which events should drive push notifications to customers. As one ofthe largest inbound customers on Amazon AWS, Dropcam was a massively high volume, high throughput data science challenge.
That’s part of what makes what he’s doing now at Clara Labs, the San Francisco-based digital assistant startup that focuses on candidate scheduling in recruiting workflows, so fascinating — it’s machine learning at a completely different scale and with a completely different paradigm.
In an environment where data keeps getting bigger and the celebrated “cutting edge” has focused on million-example training sets and the scaling problems of massive models, the team at Clara Labs is quietly pushing the frontier in a different direction. The unique challenges of their product, which requires an exceptionally high level of accuracy without starting with a Google-sized data set, have led them to architect their systems and models to work with humans instead of trying to replace them.
This paradigm is typically called Human In The Loop (“HITL”/”HIL”) .
Rather than toning down the technical fortitude of their solution, keeping humans in the loop has forced Clara to push the edges of both operational process and technical systems design. They have figured out how to marry human and machine for a result that’s stronger than either alone by creating a sophisticated task allocation system and using feedback loops from their human “workers” to improve their models.
Some of the toughest challenges facing ML-driven businesses are keeping processes efficient and continuously improving models. The HITL approach has surfaced best practices which address both of these problems.
In the following sections, we’ll take a look at those best practices plus how this framework can be used for almost any high-accuracy application that has to iteratively generate its own training data.
About the Product
Before diving into the technical side, it’s important to understand the context of how Clara Labs’ product works and how that drives its technical requirements.
At its most basic, Clara provides an email-based service which can be asked to perform sophisticated scheduling tasks. If you are chatting with a recruiting candidate and setting up the next interview or a coffee chat for next week, you CC “Clara” and she seamlessly figures out what you’re looking for, matches the available options against your preferences and follows up to make sure everything is properly set up.
Anyone who has been through a hiring process can attest to how easy it is for either party to drop the ball and lose track of conversation threads, so this addresses a major pain point for recruiting workflows.
The challenge with this kind of scheduling, where you’re asking humans to hand off a very valuable relationship, is that the bar for quality is extremely high. For example, a single misunderstanding of the meaning of “I’ll be out of town next week but let’s grab a coffee if you’re free the following” could force you to explain to a very frustrated candidate why you stood them up at Blue Bottle from 3000 miles away.
As Jason says,
“In a conversational system, the penalty for making intuitive mistakes is very high. People won’t trust a product that just doesn’t make sense. It’s very different from sorting videos or images based on some quality or importance metric. In those cases, you’re continuously optimizing the metric to improve customer experience, engagement, or revenue. For us, some classes of mistakes mean some people won’t continue using the product at all.”
Because of this, the team at Clara Labs has used a blended approach which combines machine learning models with human workers from the beginning. But the humans aren’t just a stand-in for some future model… they are a highly integrated part of an ongoing process of model training and quality control.
The Anatomy of Scheduling
Let’s lift the hood on what happens behind the scenes with Clara Labs’ product so we can explore the technical guts with proper context.
As with most engineering problems, the biggest challenge is actually defining the problem.
Once Clara has been brought into the loop via an email, their systems need to understand what is actually going on and what needs to be done. Setting up scheduling is fairly complex because you have to know:
- Who is being scheduled (the emailer or a third party?)
- What kind of event they are looking for (coffee, interview, “chat”…)
- The dates and times being suggested (is “4/10” April or October?)
- If there is positive (“…let’s try…”) or negative (“…can’t do…”) intent
- What type of response actually needs to be returned (propose a time? confirm a time?)
- …and so on.
All of this annotation information gets pulled out of the email text by a model which has been trained on thousands of prior emails. Because of the difficulty of this task, it is ripe for the HITL approach, which we’ll dig deeper into in the following sections.
After a conversational system like this understands the request and the information it contains, it then needs to make decisions based on that information. In this case, that means the actual scheduling step.
Customers set up their calendar in Clara alongside their preferences for best times, off-limit times, preferred locations and other parameters as part of their onboarding. Clara’s systems match those constraints against expected availability for the third party and come to best-case resolutions of any scheduling conflicts that surface (many problems, particularly multi-party scheduling, do not resolve cleanly).
Although determining that next action is an interesting aspect of the system, in this post we’ll focus on the natural language understanding (HITL) component in particular.
The whole process typically ends with a quick quality-assurance (QA) double-check and then the email suggesting a solution is on its way.
The Secret Sauce of Human-in-the-Loop at Clara Labs
Because one of the most difficult — and important — parts of this scheduling problem is natural language understanding, that’s where the secret sauce lies.
Let’s frame the problem this way: you start off with a text-based email thread and need to return a set of high-accuracy annotations which fill in the key attributes such as intent, duration and location so the scheduling logic can do its job. What’s the best way to architect this?
The first step is to set up an NLP pipeline which takes the natural language of the email and begins pulling out the important bits. This is a nontrivial task because nobody uses perfect grammar and intent is often unclear even to other humans. Couple this with a relatively small initial training set (it’s very expensive to annotate scheduling email sets ahead of time) and it’s a recipe for low confidence outputs, particularly at first. NLP tools like spaCy are a good start but they aren’t the whole story.
In short, active learning relies on intervention from humans to help with labeling the classes of the underlying data set so a relatively low labeling effort can produce a high coverage of labeled data. This is an especially useful approach for cases like Clara’s, where the efficiency gain it provides allows them to optimally integrate human task workers into the real-time loop.
You can get a quick primer on active learning in the slides from this 2009 tutorial.
In Clara’s case, instead of simply running the NLP stack on an incoming email and then blindly passing its output forward, they use that output to dynamically allocate annotation tasks to human workers and then they decide on next steps with the combined results in hand.
“From day one we have been designing the system such that our NLP pipeline and workers share the same tasking framework.”
Time and cost constraints mean it’s not feasible to have multiple human workers perform the same job as the model every time, so they determine which sets of NLP predictions are most likely to need checking and then distribute only those specific annotations as tasks among their human workers. Outputs from the NLP pipeline that have high enough confidence can be passed through to the next step without need for human intervention.
The task assignment algorithm leverages the idea that different workers excel at different tasks or task components. An expectation maximization algorithm (which was refined for managing crowdsourced workers) for estimating worker quality is run regularly to update the system’s best guess at each worker’s capabilities.
Essentially, based on the performance history of each human worker in their pool, they can identify which workers are likely to perform the best on this specific type of annotation task and both allocate the task and weight the aggregation of worker results accordingly.
“At the end, we get a confusion matrix for each annotator for every variable they are responsible for. So for each person, we know how likely they are to, say, confuse PM with AM and we recompute these regularly automatically based on all the data that has been annotated over the previous week.”
Once the output from the human workers has been bucketed alongside the NLP model’s output, another algorithm effectively aggregates the responses to determine the viability of the ensemble result. If threshold accuracy hasn’t been achieved, the task is passed back into the loop and elevated to stronger workers until the appropriate threshold has been exceeded.
The annotations resulting from this ensemble combination move two directions:
- Forward to the scheduling step so the core product can do its job.
- Backward to the core NLP models where they are used as labeled data to train it more effectively.
The second of these is important because it takes a process — bringing humans into the loop — which might otherwise seem like an inefficient economic sinkhole and turns it into a way of continuously improving training data, something that is crucially important for any production model.
The granularity of the active learning approach also allows Clara Labs to optimize their workforce composition and utilization over time, resulting in dramatic cost reductions alongside the obvious accuracy gains. This starts with dynamic task allocation to their human annotators:
“You actually discover which annotators are best at all the different components of the task so, for example, you can bump it back to a specific annotator if they are online. It’s possible to obtain fewer annotations and still get the desired accuracy by picking the right person. This allows for a dynamic scheduling of annotators.”
It also produces results that improve both individual performance and overall system design:
“When we first ran experiments with this approach, it clearly identified the most productive workers. Now we’re able to give workers scores that show them their probability of success on different areas in a task and we can construct economic incentives that reward the behaviors that matter most.”
Best Practices: Data
A theme that Jason continuously emphasized is how high the stakes are for delivering accurate model results and how that affects every part of their team’s process. Any data scientist worth their salt knows that the core of an accurate model isn’t the fancy algorithm or hardware but the data used to train and retrain it.
“You should be constantly thinking of how you can take care of your data… New data scientists have a habit of not looking closely enough at the source data when building their models.”
Keeping humans in the loop provides Clara Labs with solutions for a number of challenges typically posed by data. Most obviously, the feedback loop of the ensemble of task outputs continuously improves the quality of the model the human workers support.
It also solves the data quality problem which typically plagues teams trying to rely on third-party labeling. Specifically, by tying the economic performance of their human workers to the success of their tasks, they create long term incentives for high accuracy which simply don’t exist with third-party labelers.
Tackling New Problems
Jason’s experience in the industry has also allowed him to sidestep a number of other common problems faced by new entrants to the field. He advises data scientists approaching new problems to slow things down and:
- Explore the Data: Dive DEEP into the source data, looking manually through dozens of examples and understanding the relationships that they might be able to work with. Break it into smaller pieces and find the highest value nuggets.
“Look at the data! What do you have? What can you do with it? What questions need to be asked about it? What kind of labels will you need? Day-one bootstrapping is getting into the weeds. Do not assume this data looks like anything that has been written about before.”
- Explore Hypotheses: Make hypotheses and evaluate them as manually as possible directly with data before building any models. This is the stage where the cost of iteration and experimentation is low relative to spending weeks or months codifying shoddy assumptions into a model.
“What kinds of things might you want to predict about this data and why? Once you have those hypotheses, you can start using data mining techniques.Gensim is a great package for topic modeling and simple classifiers. Break apart your data into pieces to make it more manageable. You want to find what your highest value chunk is to try and explore and go from there. In that process, you will get starting points for training sets.”
- Start Simple: Try testing simpler approaches first to ease the path to production. You might find these simpler solutions sufficient to solve the problem at hand or, just as usefully, validate that you didn’t need to solve that particular problem anyway.
“If your task is classification, try using a random forest or SVM with SciKit Learn for the first version if that scales for you, THEN move on to more advanced methods once you’ve seen it work.”
Only after mucking around properly in the source data and with lower fidelity algorithms do you know enough to start taking on more complex modeling.
Dealing with Data Issues
Simply having a good process to get started isn’t enough because data is never perfect and certainly doesn’t stay that way. Several key problems often come up in these types of applications:
- Data Bias: Bias in models doesn’t come from biased algorithms, it comes from biased source data.
- Data Distortions: It’s extremely rare to have a well-balanced data set where everything you want to classify is well represented. Typically, a handful of popular classes dominate the training set while the outliers which represent the “fat tail” of remaining cases will be disproportionately underrepresented.
- Data Drift: Over time, data can drift away from your training set. For example, new product features may result in new customer behaviors which in turn lead to new kinds of examples.
Jason also highlighted a key problem which isn’t well represented in conventional wisdom: as much as the source data itself, the pre-processing we do to clean up that data is often responsible for the very problems laid out above.
For example, an initial data set might be cleaned up to remove what appears to be noise, which could exacerbate class distortions:
“You may have chosen to remove certain classes of data during the cleaning process because they didn’t seem important to the problem at the time, which may mean that you have a blind spot if you re-use that set for a different problem.”
A data scientist needs to be particularly cognizant of how the data has been manipulated through the whole pipeline prior to entering the model or risk missing issues which can reduce accuracy as things change over time:
“One of the big problems with machine learning systems is that the model is not the whole product, it is the final output of a data cleaning process and data manipulation process… it’s a whole pipeline. There’s an offline part of that can introduce bias if you don’t know what processing came earlier.
“This is fine in an academic setting because you have a data set everybody can work against for their algorithms. But in the practical setting, it’s unfortunate because you’re losing a huge amount of fidelity of what happened between the creation of that data set and the harvesting of it and that’s incredibly valuable information.
“If you’re not snapshotting the data a model was trained on and you’re not tracking the whole pipeline used for creating those data sets you may not even know how much you’ve influenced that data set to begin with.”
At Clara Labs, they are obsessed with tracking the evolution of their data sets and with continuously improving them using new labels from production. That’s part of why the feedback loop between human task workers and the NLP models is so important and why it’s something they don’t ever expect to fully remove.
Best Practices: Process and Production
It can be tempting to think of data science as a one-way process of acquiring training data, building a model, then tossing the model “over the wall” to engineering for implementation in production. This is rife with problems and quickly falls down when met with the needs of reality.
This problem was one of many that was highlighted in the 2014 paper “Machine Learning: The High-Interest Credit Card of Technical Debt” and it was further emphasized by Jason:
“If you’re working on that next best algorithm and nothing is in production yet, your colleagues may have no idea how to use what you’re building and they are *not* focused on your problems.”
The high bar for accuracy combined with the pace of a rapidly growing startup have required Clara Labs to build strong and iterative processes which unify their team around best practices.
Engineering + Data Science = ❤
Unifying engineering and data science starts with the org structure. Jason heads both functions at Clara Labs, which works well for smaller teams. In larger teams, it may mean embedding the data scientists into the engineering teams directly.
To further strengthen the integration between these teams, Jason stressed the need for cross-education. It is important for those developing models to understand the constraints and challenges of the production environment and for those building resilient software to understand the behaviors of stochastic models.
At Clara Labs, they do everything from monthly demo days to cross-functional lunch-and-learns to providing an annual subsidy that offsets the cost of conferences, workshops, books and other educational pursuits. In addition:
“We have weekly sprint retros that include a final 10–15 minute presentation by someone on the team with the title “What am I thinking about?” — this may be a new project still being designed/spec’d, something mid-flight, an in depth review of part of the stack, or something the teammate just wants to teach us about! This item was partially inspired by the way some academic groups function (and a lot borrowed from best practices around managing creative processes).”
Production Best Practices
Many of the problems resulting in unhappy data scientists building shelfware models or taking on massive technical debt stem from a process which isolates data science from the production context in which their models need to perform.
The team at Clara Labs has surfaced a number of best practices to address this:
- Get to Deployment ASAP: Streamline the deployment and retraining process so it’s optimized for speed and iteration.
“If we want to update a model, we actually deploy a new version of the pipeline even though it may just be a re-trained model.”
- Baseline in Production: Use an existing model in production as a control against which to test the performance of a newly deployed model so you can be sure that you are making an “apples-to-apples” comparison.
“Establish the baseline performance. If you have a baseline, that means you have something in production working and all the systems around it have been architected… One strategy is to ‘dark launch’, where you have the model running in production but not driving automation.”
- Mirror Production Constraints: Minimize the time spent re-developing models by building and testing the models in environments that closely mimic production.
“It’s hard to be someone who can go deep on algorithms and someone who can go deep on systems so you often have a hand-off that occurs where you ‘throw it over the wall’. Someone built this thing in Matlab or Python and the production engineer might say you’re using too much of a resource — bandwidth, cpu, max run time — and during that iteration process you sometimes have to reconsider the original technique itself. ”
Jason acknowledged that it isn’t easy with today’s tooling to accomplish an ideal flow between data science and engineering but stressed that it’s worth trying, especially when the product changes regularly and models need to allow for high rates of experimentation.
He highlighted Dan Crankshaw’s Clipper project out of Berkeley, which seeks to tie the application code together with the models so they share similar deployment processes, as an example of an approach that is starting to gain more traction in the community. One advantage:
“If you have a large amount of ‘glue code’ between your model and the data source in real time, that glue code should probably be the same in the training code and the production code.”
HITL for New Ideas
One more advantage of the human-in-the-loop architecture specifically is that it allows for much easier testing of new ideas than a production flow which has been hardened completely around a single model.
When they want to experiment with a new model or product feature, the team at Clara Labs can quickly spin up a test with entirely human workers to baseline its effectiveness. Then, as parameters are honed over time and the new idea or feature is validated, a model is seamlessly integrated with the workflow and continuously improved over time like all the others.
This can save weeks or months of model building plus there is no friction during the hand-off period where the main model begins to step in because the processes are all designed from the start to work interchangeably with either human or model-driven outputs.
For a fast-moving startup, that kind of iteration can make all the difference.
Closing the Loop
Jason’s migration from high throughput big data at Dropcam to high accuracy human-in-the-loop at Clara Labs highlights that progress doesn’t always travel only along the dimensions of bigger training sets, faster algorithms and more automation. The team’s work partnering models with human workers to annotate and retrain data sets shows how a more focused and iterative approach can solve a whole new frontier of challenging and high-stakes problems with machine learning.
Keeping Up with Progress
Of course, no matter what direction progress moves, it can be challenging to keep up. Jason offered his approach for filtering knowledge:
“There’s just way too much out there to read so you have to deploy an ‘explore versus exploit’ strategy. I try to pick papers that have nuggets of insight and authors who are actually doing something different rather than the next best performing result. Good stuff is often surfaced by following those authors onarXiv and Twitter.”
You can follow Jason’s “good stuff” in a number of places. He has discussed the technical guts of datetime disambiguation at Clara Labs and spoken about detailed strategies for integrating people with ML systems. He will be speaking specifically on designing the text annotation tasks for HITL products later in 2018.