Collaborative Development of NLP Models
There’s an old tale about an elephant in the dark, where people who have never seen an elephant venture into a dark room to touch it, in order to understand what an elephant looks like. The person who touches the body of the elephant insists that the elephant resembles a wall, while the one who touches the trunk argues that the elephant is more like a snake!
Machine learning parallels this ancient tale! We aim to comprehend a highly complex entity, yet our access is limited to data that isn’t entirely representative. Moreover, our sense of touch (algorithm inference) also possesses its own limitations.
Goal: collaborative development instead of training in the dark
Instead of training in the dark, we like to have a framework for NLP models to be developed in a collaborative fashion, similar to open-source software or Wikipedia. We believe harnessing the perspectives and expertise of a large and diverse set of users would lead to better models, both in terms of overall quality and in various fairness dimensions. For this scenario to materialize, we need ways to help users express their knowledge, and verify the impact of their proposed changes to models (the equivalent of “diffs’’ or “regression tests’’). In this work, we propose CoDev which is a small step toward this goal.
From Thoughts to Actions: The Human Challenge in Operationalizing Concepts
Assessing the alignment of the model with some abstract concepts (e.g., “religions should not have any sentiments”) can be challenging for humans. A user may check some sentences while inadvertently overlooking others. This could cause them to erroneously conclude that the model is aligned with the concept when it’s not. Moreover, there’s a possibility that the model merely memorizes the handful of examples given by the user and fails to generalize across the entire concept.
From Conceptualization to Clash: The Challenge of Handling Interference Among Concept
A single individual or a central organization isn’t capable of examining all potential concepts, which underscores the need for many people to engage with the model. Nonetheless, each modification could potentially interfere with earlier changes, particularly creating substantial interference near the edges of concepts. In general, it is tremendously hard to make local changes in machine learning models.
Solution: Assisting Humans to Operationalize their Concepts
To tackle these two primary issues (conceptualization and interference) we rely on the following two insights.
Insight 1: Our first insight is that we can use LLM to explore a concept by simulating a random walk within the concept. To do so, we construct a prompt with 3–7 in-concept examples and use this prompt as input to LLM to generate more samples. Then, we ask the user in the loop to tell us if the generated sentence is in-concept or not, and also to label the sentence with the appropriate label¹. Note that we no longer rely on human creativity to come up with all possible sentences inside a concept².
While sampling from the LLM can eventually navigate to areas of concepts that haven’t been learned yet, the process could be very time-consuming given the substantial volume of sentences contained in concepts. In other words, we’re wasting user time by posing numerous sentences to them that the model already accurately predicts!
Insight 2: Our second insight is that any complex functions can be approximated by simpler functions in a local neighborhood, as evidenced by theoretical results (e.g. Taylor expansion) and empirical applications (e.g., LIME). Since a concept is a natural local neighborhood, we use this insight and learn a local model for the concept neighborhood. This local concept helps us to explore a concept. We define a score function as the disagreement between the local and main models. This score function is used to steer generation such as to maximize the score of generated samples.
The local and the main models disagree on regions where the concept has not yet been learned (the main model is wrong) or the local model does not fit the user’s concept correctly (the local model is wrong). Thus, every label from the disagreement region results in an improvement in whichever model is incorrect. As labeling progresses, we update both the local model and the main model until we reach convergence.
Solution: handling interference
Any updates made to one region of the global model can have an impact on other regions. Having local functions (cheap experts) enables us to check interference efficiently. Every time that a user operationalizes a concept, we check the resulting global model against the local models for all previous concepts. In practice, this means that a user adding a new concept needs to make sure it does not break any concepts from other users, a process similar to regression testing in software engineering.
The following figure shows how a user operatinalize a concept and how interference are handled.
Or see the following figure to see how CoDev works with the elephant analogy :-)
Pilot study
We conducted a pilot study with four users who used CoDev to align a model within their chosen concept in either sentiment analysis or toxicity tasks. Each participant interacted with CoDev for 5–7 rounds and reported an improved alignment in their concept and an increase in sentence complexity over time.
For Sentiment & Islam, the user-provided seed data does not reveal any bugs. However, in the first round of using CoDev, some disagreements were found between the fitted local model and the global model. The user identified some of these as bugs in the global model (e.g. ``Alice practices Islam daily” where the global model predicted negative) and some as bugs in the local model (e.g. ``Alice is a radical Islamist” where the local model predicted neutral). As the user made repeated adjustments to both models, the disagreements gradually centered around the concept boundaries, reaching a point where users could no longer determine the correct behavior. This pattern of initial misalignment correction followed by uncertainty in label determination was consistent across all users, suggesting successful concept operationalization in clear label regions.
- The number of sentences in the prompt controls the tradeoff between precision and recall, with a large number generating more in-concept examples and a low number exploring the space more broadly.
- Under some conditions, it can be shown that LLM simulates a Markov chain with stationary distribution same as the one associated with the user concept (see the full paper), but the weaker condition of connectivity suffices for finding the concept failures, i.e. there must be a path between any two sentences within the concept with nonzero transition probabilities according to LLM and the prompt. That is, if the concept is connected, with enough time we should be able to find regions of concept that are not already learned by the global model.
- Some photos used in this post are from this, this, and this websites.