Collaborative Development of NLP Models

6 min readJul 23, 2023

There’s an old tale about an elephant in the dark, where people who have never seen an elephant venture into a dark room to touch it, in order to understand what an elephant looks like. The person who touches the body of the elephant insists that the elephant resembles a wall, while the one who touches the trunk argues that the elephant is more like a snake!

Machine learning parallels this ancient tale! We aim to comprehend a highly complex entity, yet our access is limited to data that isn’t entirely representative. Moreover, our sense of touch (algorithm inference) also possesses its own limitations.

Similar to the elephant in the dark tale, machine learning models generalize according to the data that they have seen!

Goal: collaborative development instead of training in the dark

Instead of training in the dark, we like to have a framework for NLP models to be developed in a collaborative fashion, similar to open-source software or Wikipedia. We believe harnessing the perspectives and expertise of a large and diverse set of users would lead to better models, both in terms of overall quality and in various fairness dimensions. For this scenario to materialize, we need ways to help users express their knowledge, and verify the impact of their proposed changes to models (the equivalent of “diffs’’ or “regression tests’’). In this work, we propose CoDev which is a small step toward this goal.

Instead of training in the dark, we want to enable a diverse set of people to actively interact with the model and align it with their values without interference.

From Thoughts to Actions: The Human Challenge in Operationalizing Concepts

Assessing the alignment of the model with some abstract concepts (e.g., “religions should not have any sentiments”) can be challenging for humans. A user may check some sentences while inadvertently overlooking others. This could cause them to erroneously conclude that the model is aligned with the concept when it’s not. Moreover, there’s a possibility that the model merely memorizes the handful of examples given by the user and fails to generalize across the entire concept.

In this figure, the user’s concept is “religions should not have any sentiments”, which contains a lot of sentences about religion (depicted by the large oval). A user may inspect just a few examples and conclude that the model aligns with their concept, even though there are instances of failure that the user has overlooked (depicted by the red area in the image).

From Conceptualization to Clash: The Challenge of Handling Interference Among Concept

A single individual or a central organization isn’t capable of examining all potential concepts, which underscores the need for many people to engage with the model. Nonetheless, each modification could potentially interfere with earlier changes, particularly creating substantial interference near the edges of concepts. In general, it is tremendously hard to make local changes in machine learning models.

Prior research highlights the difficulty of implementing localized changes in machine learning models. A change in one concept can cause changes in other concepts within the model.

Solution: Assisting Humans to Operationalize their Concepts

To tackle these two primary issues (conceptualization and interference) we rely on the following two insights.

Insight 1: Our first insight is that we can use LLM to explore a concept by simulating a random walk within the concept. To do so, we construct a prompt with 3–7 in-concept examples and use this prompt as input to LLM to generate more samples. Then, we ask the user in the loop to tell us if the generated sentence is in-concept or not, and also to label the sentence with the appropriate label¹. Note that we no longer rely on human creativity to come up with all possible sentences inside a concept².

If a user is interested in “nationality should not have a sentiment”, she only needs to write a few sentences and then we use LLM to do a random walk in her concept.

While sampling from the LLM can eventually navigate to areas of concepts that haven’t been learned yet, the process could be very time-consuming given the substantial volume of sentences contained in concepts. In other words, we’re wasting user time by posing numerous sentences to them that the model already accurately predicts!

The user’s concept state space is significantly large! Neither the user alone nor with the assistance of LLM performing random walks can completely traverse the entire space to discover failures!

Insight 2: Our second insight is that any complex functions can be approximated by simpler functions in a local neighborhood, as evidenced by theoretical results (e.g. Taylor expansion) and empirical applications (e.g., LIME). Since a concept is a natural local neighborhood, we use this insight and learn a local model for the concept neighborhood. This local concept helps us to explore a concept. We define a score function as the disagreement between the local and main models. This score function is used to steer generation such as to maximize the score of generated samples.

The local and the main models disagree on regions where the concept has not yet been learned (the main model is wrong) or the local model does not fit the user’s concept correctly (the local model is wrong). Thus, every label from the disagreement region results in an improvement in whichever model is incorrect. As labeling progresses, we update both the local model and the main model until we reach convergence.

Pseudocode illustrating the specific procedures necessary for operationalizing a concept.

Solution: handling interference

Any updates made to one region of the global model can have an impact on other regions. Having local functions (cheap experts) enables us to check interference efficiently. Every time that a user operationalizes a concept, we check the resulting global model against the local models for all previous concepts. In practice, this means that a user adding a new concept needs to make sure it does not break any concepts from other users, a process similar to regression testing in software engineering.

The following figure shows how a user operatinalize a concept and how interference are handled.

CoDev loop for **operationalizing a concept**. (a) The user starts by providing some seed data from the concept and their labels, they are then used to learn a local concept model. (b) GPT-3 is then prompted to generate new examples, prioritizing examples where the local model disagrees with the global model. (c) Actual disagreements are shown to the user for labeling, and (d) each label improves either the local or the global model. The loop b-c-d is repeated until convergence, i.e. until the user has operationalized the concept and the global model has learned it. For **interference**, note that some changes from the left user might trigger the b-c-d loop for the right user or vice-versa. In general, each user wants their local models to be completely aligned with the global model.

Or see the following figure to see how CoDev works with the elephant analogy :-)

The global model has some failures. Codev is trying to fix the failures via interaction with different users (i.e., aligning the model with their concept). The width of each arrow is showing the amount of information transferred between entities. whenever there is a disagreement between the global model and the local model (whether an initial flaw in the global model or caused by another separate concept) the user should handle that disagreement. LLM is utilized every time to find such disagreements.

Pilot study

We conducted a pilot study with four users who used CoDev to align a model within their chosen concept in either sentiment analysis or toxicity tasks. Each participant interacted with CoDev for 5–7 rounds and reported an improved alignment in their concept and an increase in sentence complexity over time.

For Sentiment & Islam, the user-provided seed data does not reveal any bugs. However, in the first round of using CoDev, some disagreements were found between the fitted local model and the global model. The user identified some of these as bugs in the global model (e.g. ``Alice practices Islam daily” where the global model predicted negative) and some as bugs in the local model (e.g. ``Alice is a radical Islamist” where the local model predicted neutral). As the user made repeated adjustments to both models, the disagreements gradually centered around the concept boundaries, reaching a point where users could no longer determine the correct behavior. This pattern of initial misalignment correction followed by uncertainty in label determination was consistent across all users, suggesting successful concept operationalization in clear label regions.

For more details see the full paper. Please contact me if you have any questions or feedback!

The number of sentences in the prompt controls the tradeoff between precision and recall, with a large number generating more in-concept examples and a low number exploring the space more broadly.
Under some conditions, it can be shown that LLM simulates a Markov chain with stationary distribution same as the one associated with the user concept (see the full paper), but the weaker condition of connectivity suffices for finding the concept failures, i.e. there must be a path between any two sentences within the concept with nonzero transition probabilities according to LLM and the prompt. That is, if the concept is connected, with enough time we should be able to find regions of concept that are not already learned by the global model.
Some photos used in this post are from this, this, and this websites.