using visual analytic workflow to discover bias in gender prediction models
Jun Yuan
Machine learning techniques are widely used to make crucial predictions and discover patterns from massive data collected from social media data. More concerns have been raised about potential biases that have been introduced in the machine learning models in recent years. High-performance prediction models may produce discriminative predictions. For example, Amazon’s recruiting algorithms have been reported to discriminate against resumes from female applicants. Such discriminative behaviors are often hard to detect due to the complicity of the state of art machine learning techniques. The issue becomes more severe when non-technical users apply the models without examining the model themselves. To improve the transparency, trust, and implication of machine learning models in social media research and other industries, one potential approach is using visual analytic techniques to interpret the black-box nature of machine learning models to the human-understandable interactive interface. In this case, non-technical or seasoned machine learning model users can leverage the power of visual pattern recognization to discover potential bias in the prediction models.
workflow and expectations
Our work consists of the following steps:
- data collection of tweet gender prediction and descriptive data analysis
- tweet data tokenization and preparation for classification model training
- classification model training using three different algorithm
- model explanation generation for classification models
- visual analytics design and implementation
- bias discovery usage demonstration using the interactive visual interface
For step 1, we want to overview the frequencies of the words in the datasets between the male and female groups to understand the ground-truth. For step 2, we want to combine useful text information and convert the text into vectors as standard input for classification models. We also want to prepare the training and testing set for model performance evaluation. For step 3, we want to train and evaluate three different classification algorithms with different complicity levels. We want to measure their accuracy and F1 performance in this stage. For step 4, we will use LIME (a model-agnostic ML explainer) to derive explanations from each prediction from each classification model. The goal is to see whether if the explanations for the same input varies across models. For step 5, we want to export all the explanation data and design visualizations to systematically discover classification bias patterns on keywords between male and female groups. For step 6, we want to demonstrate our interface by exploring a list of keywords and analyze the potential biases. The details for each step are explained in the following sections.
1. data collection of tweet gender prediction and descriptive data analysis
The data is download from Kaggle: https://www.kaggle.com/crowdflower/twitter-user-gender-classification. The data features include user names, random tweets, and description texts of the Twitter accounts. The author of the data used the crowdsourcing approach to generate the ground truth gender labels (male, female, brand, unknown) for Twitter accounts. Participants judged the gender based on the profile of the account. The data's confidence feature indicates the percentage of the participants who gave the same gender prediction on the Twitter account. Since we want to first train machine learning classification models to best mimic human predicting behavior, we only select the records with 100% confidence in the gender labels (which means all the participants agree on the gender labels). To focus on the potential model bias inherited from the data, we select the records with prediction labels, either male or female, as the ground truth for model training. A total of 53,849 records are selected in the following steps.
2. tweet data tokenization and preparation for classification model training
We first combine the Twitter username, user description text, the tweet content to be a long text string. We then tokenize each long string using the TF-IDF. In this way, each word in the text string is converted to a numerical value, and the text string a vector of numerical values. We also randomly split the data to be 80% training data and 20% testing data.
3. classification model training using three different algorithm
We first use the training data from the previous step to train three classification models:
- logistic regression
- K-nearest neighbor classifier
- random forest classifier
The models are selected to represent different complicity of popular classification models. Then we generate accuracy and F1 score for the models using the testing set:
Results show that logistic regression has the highest F1 score and accuracy. Based on this result, one might conclude that logistic regression is the best model to deploy in real-life usage. However, we have no idea whether the model makes biased predictions or not. We want further to examine the classification models regarding their potential biases.
4. model explanation generation for classification models
Once the models are trained, we use LIME to generate explanations for each record under each model. Here is a simple explanation of the LIME technique. LIME takes the entire dataset and the specific index of the record as the input. Based on the dataset and the chosen record, LIME generates hundreds of random samples around the chosen record neighborhood. Those random samples then feed into the classification model, and the model returns the predictions for each sample. In our case, the prediction outcome is male or female. Linear regression is then performed on the random sample and their predictions to determine each feature's importance to the prediction. In our case, the importance is the weight of each word that contributes to male or female prediction. The following example shows the LIME standard explanation output. It shows the number 9081 record in the dataset predicted by the logistic regression, KNN, and random forest, respectively.
In this example, even though all three models produce the correct prediction, we discover that “Basketball” is considered more correlated to the male prediction by the logistic regression model. This might be problematic if the model predicts further input data based on such discriminative corrections. This brings us a question, is “basketball” always corrected to predictions of males? To answer this question and systematically explore other potential discriminative correlations. We repeat the LIME process for three models with all 53,849 records. We export the results in a spreadsheet and import them to Tableau software. The LIME process varies from model to model in time. The logistic regression, K-nearest neighbor, and the random forest take 2 hours, 4 hours, 10 hours, respectively. Since the LIME function calls the model’s prediction function for each generated sample, a more complicated model will cost more time.
5. visual analytics design and implementation
Above is a screenshot of the interface that we designed and implemented. It shows an example of the visualization of the logistic regression model LIME explanation. The top row visualizations show the bar chart of counts of the word. The length of the bar indicates the occurrence of the word within the top 7 influential words. The color indicates the average contribution of the word when it is considered influential. For example, the words “and,” “may,” “of” are the top three influential keywords in both female and male prediction. However, “of” is the third frequent word in female prediction but first in male prediction. In the top 2 visualizations, users can scroll down all the influential keywords and compare them between male and female.
The bottom visualizations are word bias explorer. Users can search the word of interest in the search bar on the right and see the word comparison between the male and female groups. The length of the bar indicates the average contributions. And bubble plots show the contribution distributions among each keyword and gender group combination. For example, we can search and select “education” and related keywords in the search bar. Then we can see that “educated” on average is a more significant positive indicator for male prediction than female. “education” on average is a significant negative indicator for female prediction than male. The bubble plot aid as a significance check. If the bubbles for one gender differ from the other gender in the variance and mean, we are more confident with the conclusion. Combining both of the evidence, the model will likely have a stronger correlation between the word “education” and male prediction.
6. bias discovery usage demonstration using the interactive visual interface
We collect a list of keywords for checking potential discrimination:
- professor
- boss
- money
- CEO
- teacher
- happy
- sad
- finance
- computer
- driver
We use the interface introduced in step 5 to exam the logistic model with the keyword list.
We discover the following:
In conclusion, we find some keywords that have been regarded as influential to the prediction. However, they should not be considered as important to the prediction. For example, the model correlates “boss” to male prediction more. If not corrected, the model will lead to discrimination against female bosses.
Discussion
There are limitations to our workflow. We simplified the input data for the model training. In the original crowdsource experiment, participants were given the entire Twitter profile to make their judgments upon. We only take the text information as input. However, the Twitter background setting, follower number, following number, and other features may also influence the human's judgment. Such limitation on the input data may also constrain the performance of the models. And ill-performed models may lead to unreliable LIME explanations. There are foreign vocabularies that can be recognized and show up in similar word search. For future work, we would like to train better models and account for foreign vocabularies.
Conclusion
We design and demonstrate a visual analytic workflow to discover biases in the classification models. For future work, we would like to design visualizations to support multiple models comparison in one view. Also, there are concerns about the ethical issue of exposing user information without their consent. We will consider them during the design of future visual analytic workflow and interface.