Exploring Potential Bias in Virtual Assistants
Co-authored by David Mueller, Richard Walls, Katherine Brown & Vignesh Harish
In order for an AI solution to be successful, it needs to be trustworthy. Users must be to able trust that the solution was designed with their experience in mind. Since AI is trained on human sample data that is provided by human designers, there is always a possibility that a solution will inherit human biases. In the case of a chatbot, training data that reflects the communication habits of one particular group, at the expense of others, creates a biased solution that can undermine users’ trust.
Many chatbots engage in unstructured interactions that can be extremely difficult to assess for bias. Often, there is no real record of who is using a chatbot or whether they regard the interaction as successful. Even when a chatbot solicits feedback from users, there is rarely any way to determine whether a negative response results from bias or some other problem. For instance, even a chatbot that obtains a favorable rating 80% of the time can be biased if its 20% unfavorable results are disproportionately concentrated in an underrepresented community of users.
In this post, we’ll review a recent pilot study to look for bias in data from the content Catalog for IBM’s Watson Assistant. First, we’ll look at the background of the study and the challenges of detecting bias in virtual assistants. We’ll discuss the methodology that our team devised to detect bias in an assistant and examine the results of our test run of the process. Finally, we’ll look at next steps for this process and how our work might help developers detect, monitor, and mitigate bias in their chatbots.
Our group is a cross-functional team of IBM developers that was formed as part of IBM’s JumpStart program and tasked with investigating potential bias in the Content Catalog. We began this project with several rounds of user research. We interviewed developers, business unit leaders, and end users to understand the challenges presented by bias, both for those who design virtual assistants and those who interact with them. While we heard of a range of experiences involving bias from end users, on the development side, many of the recurring concerns centered on the ability to get reliable and accurate training data for the assistant.
Virtual assistants are only as effective as the user data that they are trained on. In cases where there is little useful training data, such as call logs or chat records, developers might rely on sample content from Watson Assistant’s Content Catalog. The Content Catalog is a collection of generic intents, or user goals, that are associated with common use cases, such as customer care, account management, and basic conversational interactions. This content can be useful for training a chatbot to answer routine questions. However, there’s no way to account for the variety of ways in which any question might be asked in a real-world scenario. If a particular demographic tends to use a term or phrase that a chatbot has not been trained on, the chatbot might consistently underperform with this group, creating a biased solution. Currently, there is no mechanism or process to evaluate Watson Assistants for bias.
Watson Assistant responds to user requests by using intents, or goals that are expressed in a user’s input, such as answering a question or processing a bill payment. An assistant is trained to recognize utterances, or user inputs, that can be associated with certain goals (check account balance, get store hours, etc.) and connect these utterances with predefined intents. However, if the statements the chatbot is trained on reflect the communication practices of only some groups of users, while excluding others, the chatbot is biased. For example, a chatbot that recognizes Where is the nearest bar? but not Is this the closest pub? might be biased against British consumers.
More seriously, an assistant that is implemented in a health care application and performs poorly with seniors could cause problems not only for the population it serves, but for the care provider that implemented it and the developers who designed it. Even when an assistant is trained on actual user interactions (such as chat logs) there’s no guarantee that it will be prepared to serve all the different users who need to access it today.
Our project was to evaluate how a chatbot that is trained on Watson Assistant Content Catalog data performs with different demographic groups and explore the results for potential bias. To do this, we designed a survey to collect sample user interactions with chatbots, along with demographic data for each survey participant. Our chatbot was then tested on these user responses and evaluated for accuracy. The results were cross-referenced with the demographic data to identify areas where the chatbot performs poorly.
By connecting demographic information with actual user utterances, we hoped to identify not only whether certain demographics are being poorly served, but which specific utterances may not be getting the correct response. Different age, gender, and language groups have different ways of communicating. If an assistant is not trained to recognize utterances that are common to a particular user group, its performance may show bias against that group. The process that we outline in this post can help developers and businesses identify where their assistant needs improvement and what sorts of utterances must be targeted for retraining in order to improve.
When we designed our survey, we wanted to include questions that were closely aligned with the pre-defined intents from Watson Assistant’s content catalog. With this strategy, the responses that we received could provide direct insight into the variety of interactions that different demographic groups might have with a Watson Assistant while trying to accomplish goals associated with these intents.
In order to minimize the introduction of bias into the survey, we chose the following five intents by using a random selection method:
- Customer care
- Cancel account
- Negative feedback
As another measure of bias reduction, each of the intent-based questions included a generic scenario that was designed to illicit an open interpretation of how to interact with the chatbot. To collect a significant amount of data for each intent, we requested that each participant provide three responses to each scenario-based question.
For example, to collect responses that might be associated with the Hours and Location intents, we asked the following scenario-based question:
You are planning a trip to a restaurant in another city. What would you ask a chatbot to help plan your trip? Please provide three examples.
By leaving the question open-ended, we hoped to gather a wider variety of interactions than we would have by directly prompting participants to ask about business hours or location. A more direct request could have biased participants to simply reorder the prompt as their response, replacing more genuine responses with a generic What are your hours and location?.
To provide the proper context for understanding bias, we collected a set of demographic information for each participant. This information was solicited on a voluntary, anonymous basis. Each demographic question was designed to provide statistical significance based on common characteristics that are commonly subject to forms of bias. These characteristics included age, gender identity, language, and education. When combined with the intent-based questions, the demographics provide a representative integration of data that can be utilized to improve a Watson Assistant.
Testing Watson Assistant
After collecting demographically tagged user responses through our survey, we used them to perform a blind test of a Watson Assistant with the Watson Assistant Testing tool. The Watson Assistant Testing Tool is a set of scripts that run against a Watson Assistant instance to test its performance on a specified collection of utterances. The Watson Assistant bot we tested was trained on the same intents (though not the same utterances) that we used to solicit our sample user responses in the survey. The testing tool feeds our sample utterances to Watson Assistant, which then attempts to match them with intents. The intents that Watson associates with each utterance are the predicted intents. The testing tool determines the assistant’s accuracy by comparing these predicted intents with the golden intents, which were previously identified as the correct response for each utterance by a human verifier.
The accuracy of Watson Assistant’s performance on each utterance is a simple yes or no split. Did the assistant associate the utterance with the golden intent? Each correct correctly identified utterance receives a score of 1, while each incorrect guess receives a 0. The testing tool tallies these scores as a measure of the assistant’s accuracy for each utterance and intent. Watson Assistant also assigns a confidence score for each perceived intent, which reflects how confident the assistant is that the perceived intent is correct. This confidence score is directly related to how closely a given utterance is associated with the assistant’s training data. The testing tool pulls in this score from Watson Assistant and incorporates it into the test results.
The confidence score provides some context around the accuracy score. A low confidence score on an accurate answer may indicate the assistant simply made a lucky guess and might not be reliable in another similar case. Conversely, a high confidence score on an inaccurate response may mean the assistant has completely misunderstood an utterance and needs to be retrained. Confidence scores range from 0 (no confidence) to 1 (maximum confidence). For our purposes, the confidence score helped us identify and track certain utterances that were giving the assistant trouble within a particular demographic. Therefore, improving the assistant’s performance on these utterances might improve its overall performance across that demographic.
Running the Numbers
The CSV output from our survey consisted of 15 sample utterances and four demographic markers for each participant. First, we removed the demographic information and fed the sample utterances into the Watson Assistant Testing Tool to test our Watson Assistant against them. Our assistant was a test instance that was trained on the same intents that we targeted when we collected our sample utterances. However, since this was a blind test, the Assistant was not previously trained on the sample utterances themselves. The Testing Tool implemented the Blind Test feature to output confidence and accuracy scores for each utterance to a new CSV file. We then worked in Jupyter Notebook using Python Libraries like NumPy and plot.ly to merge the test output and survey output CSV files and create data visualizations. These visualizations helped us track variations in our assistant’s performance across different age groups, gender identities, levels of education, and first language acquisition.
Our results demonstrated relatively uniform accuracy and confidence scores across each demographic set across all of the questions. There was no obvious evidence of bias in our test assistant at this high level. The following graphs show the spread of average confidence and accuracy scores across all demographics for all sample utterances. You can see that there is little variation in the overall scores.
The following swarm plot shows the distribution of predicted intents by confidence score for all sample utterances across the gender identity demographic. In this plot, each predicted intent is represented by a colored dot. The dot’s place on the y axis indicates the average of all confidence scores from each time the assistant assigned this intent to an utterance from the survey. Since the confidence score reflects how closely a given utterance is associated with the assistant’s training data, the plot can be used to map actual utterances to strengths and weaknesses in the assistant’s training.
Here, as elsewhere in our study, the distribution shows no statistically viable difference between men and women. There is no obvious bias. However, if an assistant was not trained on certain utterances that are common to one demographic group, we might expect to see a clustering of low confidence scores in one demographic that is not present in the others. From there, we could trace the predicted intents with those confidence scores back to the actual utterances that produced them and retrain the assistant to improve its confidence with that demographic.
We further analyzed each specific question, exploring for question-specific bias. The following graph shows the variation in average confidence score across the age group demographic for question four, Please provide three examples of how you would ask a chatbot to clarify its response. This prompt was designed to gather sample utterances that could be associated with performance on the
Here you can see a variation of over 15 percentage points in average confidence score between the age demographics with the highest and lowest confidence. Due to the very small sample size in our study, this still does not show a statistically viable degree of variation. However, a result such as this that was returned against an actual Watson Assistant in production, serving an enterprise-level client base, might indicate a significant weakness in performance in one age group that affected a large number of users. In this case, the next step would be to try to determine why the assistant was showing lower average confidence with this group of users. Was there a particular utterance, term, or word that caused the assistant to perform disproportionately worse with this demographic?
Recipe for Exploring Potential Bias
Our pilot study did not uncover any obvious bias in the Watson Assistant Content Catalog. Our methodology is reusable to help developers evaluate their Watson Assistants for bias and to improve their training sets to mitigate any bias that is detected. This methodology could be particularly useful for a Watson Assistant solution that is already in production, and for which certain demographic information may already be a part of user profiles- for example, in a financial or health services institution. Following this procedure in such a case would obviate the need to create a survey and would also allow developers to focus on specific intents that may already be troublesome for their users. Furthermore, by running this process iteratively– — identifying possible bias, retraining the assistant and running the process again– — developers can generate a record of diagnosis and improvement to share with clients and regulators.
The following diagram shows the steps that a developer can follow to explore and mitigate potential bias in a Watson Assistant that is in a production scenario where some user demographics are available.
The challenge of bias in AI, and in Watson Assistant in particular, presents a classic example of a wicked problem. It is context-dependent, socially entangled, and impossible to solve uniformly. There is no silver-bullet solution. However, it is our hope that by creating greater visibility and explainability around how Watson Assistants perform across different demographics, we can give developers the tools and insight to design AI solutions that are more representative of all the users they serve.
To learn more about the challenges of bias in chatbots and other AI solutions, explore the following resources:
- This is how AI bias really happens — and why it’s so hard to fix, MIT Technology Review
- Racial Bias in Conversational Artificial Intelligence, Towards Data Science
- AI and bias, IBM Research
- Bias in AI: How we Build Fair AI Systems and Less-Biased Humans, THINKPolicy blog
- How I’m fighting bias in algorithms, Joy Buolamwini’s TED Talk