Using AI to tackle gender bias in job adverts

Testing OpenAI GPT-3 Private Beta in solving real-life problems

Tomasz Tkaczyk
Apr 12 · 10 min read

Can a job description be sexist? Put another way: can the words we choose to describe jobs send the wrong message to female candidates? Or inadvertently put off most of the male applicants? A great deal of academic research has gone into answering these important questions, and it yields strong evidence suggesting that gendered wording in job advertisements exists and sustains gender inequality. Especially towards women.

Examples of the gender-biased job description, used in a study. The left ad is targeting women, whereas the right one is male-biased.

Recently, I was happy to receive a message from OpenAI inviting me to use their private Beta of the (in)famous GPT-3 model. I was immensely excited to start working with it and so I have begun to look for a first project. Based on the discussion with my girlfriend who works in venture capital looking for talented founders from diverse backgrounds, I became aware of the gendered language problem. I have learned that especially in the tech domain, which is still male-dominated, employers are struggling to craft their messaging in a way that would be gender-agnostic.

While the impact of bias on the candidate pipeline is clear, uncovering the bias in language isn’t always obvious or intuitive. Research from ZipRecruiter shows that gender-neutral job adverts receive up to 42% more applications than their more biased counterparts. This means that refining the wording used in the job posting can help cut the times required to fill the position. This proves that removing gender-biased terminology not only addresses the larger issue of inequality in the workplace — but also greatly broadens the pool of candidates who decide to apply.

There is a limited selection of free and paid tools that can help you avoid bias when writing job descriptions. Unfortunately, free tools are based on simple keyword search, which helps only by highlighting words that might carry some bias. This does not take context into account, nor does it provide any suggestions on how to address the identified issues. Paid augmented writing tools like textio.com offer a lot more functions like wide synonyms pools and gender tone scoring. This, however, does not come at a low price. I wondered if it’s possible to come up with a more affordable solution (possibly free) to tackle this important issue.

Can a biased AI help you solve this issue?

When GPT-3 was released in mid-2020, it almost immediately made the international headlines. It was described as an intelligent machine, that can write English better than an average person. The ultimate benchmark was passing a “Turing test” in which a group of people was invited to spot a generated made-up news article among genuine ones. The results of this experiment showed that the generated news was virtually indistinguishable from the human-written text.

As many articles and studies suggest, GPT-3 suffers from many bias-related problems:

…the machine(GPT-3) guessed that professions demonstrating higher levels of education were heavily male leaning (such as banker, or professor), while professions such as steward, nurse, receptionist, and housekeeper were heavily female-leaning.

Considering that the model was trained on huge amounts of data scraped straight from websites like Reddit, it would be naive to assume that model won’t learn things that it shouldn’t.

There is definitely a lot of work to be done in the future to assure these complex AI models which learn from human-generated data won’t pick up any of our bias and preconceptions.

Still, it does not mean that such models cannot be used to tackle the problem of gender bias today — if used properly. Therefore, I decided that my first project would be to build a gender-sensitive paraphrasing tool. The goal of the tool is to take a possibly biased input and automagically paraphrase it into a gender-neutral form.

Experiment

In contrast to the classical ML development cycle in which a model is either trained or fine-tuned on millions of examples, GPT-3 gives you a totally different experience when it comes to model development.

In essence, it allows you to communicate your intent in plain English and let the model figure out what is the task. Luckily, OpenAI provides you a great sandbox environment, depicted below, where you can experiment with different inputs and parameters:

The interface of the GPT-3 Model Playground

The title of the official paper describing GPT-3 states that “Language models are few-shot learners”. This means three things. The first is that GPT-3 is “just” a language model. In other words, given an input text, it predicts what word from a known vocabulary will come next. Just like the autocomplete keyboard on your phone.

Secondly, in order for the model to work, you have to prompt it with a task description (Zero-shot). In my task description, I have used a short instruction that includes a long list of words that should be avoided.

Prompt to instruct the model for sentence paraphrasing.

Lastly, we must provide a list of examples (few-shot) to prime the model. For this, I used a list of selected examples of rephrased statements from research that I have found on the Internet:

The initial results have really surprised me. The model seems to be able to understand the task and return very promising results. The prompt was able to turn the model into a paraphrasing bot capable of reformulating a whole job advertisement in a sentence-by-sentence manner.

Very first results of gender bias removal. Here the word “decisively” is clearly biased towards male applicants.

Considering that GPT is a paid API, where you pay based on the number of tokens (including both prompt and response), the next logical step is to adjust the prompt to be cost-effective. To spare some tokens, I have replaced “Input” and “Output” with A and B, and I have removed unnecessary spaces and trailing characters(###). Then, I pruned the list of biased words, removed some rare phrases, and left only the most blatant ones. The final iteration is available on my Github.

In the end, the prompt text was around 1,000 tokens long. While this is still way below the maximum input length of 2048, it can be considered as ‘expensive’. Open AI currently offers 6 engine types at different price points. Each engine has a spectrum of capabilities, with DaVinci being the most capable and ‘Ada’ being the fastest but also the least clever. The most expensive model is priced at $0.06 per 1,000 tokens, whereas the “medium” option ‘Curie’ costs $0.006. I’ve played with all of these engines and I have experienced different results for each of them. Unfortunately, the cheapest engine, priced at $0.0008 per 1,000 tokens, was completely unusable as it constantly repeated itself and barely understood the task. The high-end model was undoubtedly the best, however, the difference between high and mid-end engines was significantly smaller. The ‘Curie’ model is just half as good, but considering that it is also 10 times cheaper, it is worth requesting 2–5 suggestions at once and stick with the mid-range option. Hopefully, one of the paraphrases will be at least OK.

Evaluation

After playing some more with the model, I have quickly realized two things. Firstly, the output of the model is more or less correct for only ~75% of the requests. Every four tries the exact same sentence will be repeated. To address this problem I have increased the “Frequency Penalty” parameter from 0 to 0.75 which penalizes the model for reusing text already present in the prompt.

Secondly, in cases when there are multiple ‘biased’ terms, GPT3 tends to focus on just one issue in the sentence, leaving other problems untouched. Take this sentence as an example: “Decisive, with strong communication skills.” Both Decisive and strong carry some bias. For such a case, GPT likes to respond by addressing only one of two adjectives — “Is decisive, and has good relationship skills to work closely with others”.

Sometimes it takes up to three requests to find an answer that fixes all the problems in the sentence.

Another observation is that there are cases where GPT3 often misses the opportunity to make changes to the sentence structure. Instead, it just finds synonyms of the biased terms. This led me to adjust one of the most important parameters of the model — “Temperature”. Temperature ranges between 0 and 1 and controls the randomness of the results. One can think about it as a measure of creative liberty allowed in autocompletion. In practice, this means that values close to 0 deliver very deterministic results for the same input. Increasing the temperature should allow more “creativity” in responses. However, temperature values approaching closer to 1 will deliver slightly different results each time the request is made. Unfortunately, allowing such randomness in the model output makes it actually very challenging to test and validate the model’s reliability to solve a given problem.

Lastly, in some isolated cases, GPT delivered terribly wrong results. I caught it changing a vaguely correct biased statement “Independent and open-minded” to “ Chronic job-hopper and failure”. Such events have ultimately ensured me that the model cannot be fully trusted or let loose without supervision. GPT simply requires a human-in-the-loop to check the outcomes as they might be far from what is expected.

In order to evaluate the performance of various parameter combinations, engines, and prompts, I have developed a small Python app that allows me to manually request a set of paraphrases to a list of biased sentences that weren’t included in the prompt. While this might seem tedious to test it manually, I decided not to automate for a few reasons.

Primarily, I wanted to understand the impact these changes had on the model outputs. Allowing myself to acquire that ‘feel’ for the knobs, enabled me to improve step-by-step just like a gradient descent algorithm.

Secondly, it would be actually very challenging to create a robust testbed. It might be trivial for a testing program to spot wrong answers i.e when the model repeats itself or returns an empty string. However, for every paraphrased sentence there is more than one correct answer, and it is definitely not trivial to measure how good or bad either option actually is. Essentially, there is no better way to see if the model behaves as expected than by eyeballing through the list of possible suggestions.

Finally, I am pretty sure that building a simple proof-of-concept app is the best way to convince anyone that something is either useful or useless. Therefore, this Python app quickly became the backend of a PoC web app which I plan to release soon after publishing this article. Below you can have a glimpse of the prototype:

Final thoughts

After hours of testing the adjustments to the prompt, examples, and hyper-parameters, I finally came up with an (im)perfect combination and a conclusion.

First thing first, I need to admit that GPT-3 is a remarkable piece of engineering. Ultimately, it has proved the premise that simply scaling up a deep learning model and giving it way more data automatically makes it more performant and capable. It caught the media attention the same way Watson did in 2011 and Alpha Go in 2016. This time, however, the hopes were that OpenAI came up with something more sophisticated than a narrow AI only capable of excelling at a very specific task. GPT-3 seemed to be a big step towards artificial general intelligence as it displayed signs of reasoning. But the reality is a little more nuanced than that.

As of today, it is not possible to fine-tune the model which makes it harder to adjust the model’s reliability. To me, all the fancy and advanced applications like writing SQL queries are just a great stunt, because the model’s understanding of SQL is just an illusion. The model does not contain any structured knowledge base, and all it does is just a prediction of the next characters. In this case, GPT was fed enough data from internet forums to provide answers to simple problems, but that’s not enough to actually solve more complex queries.

I’m concerned that paraphrasing text and creative writing are one of the best use-cases where GPT can shine. Considering that GPT already has the understanding of language, a good prompt and an input sentence, it has all the ingredients to paraphrase any sentence. However, due to the chaotic nature of the model, using it right now in a software engineering context would be very challenging due to its instability. It would be hard to build a paraphrasing application based solely on this model, but it can be very useful in combination with other techniques. One can not simply ignore decades of research in NLP and NLU, and just rely only on the language model. I’m pretty sure that a hybrid system leveraging multiple techniques is the best architecture to address the problem of bias removal. I’m also positive that GPT can play a vital role in making such applications stand out.

As a follow-up to this article, I plan to release a demo application where you can use GPT to paraphrase biased job adverts. If you are interested and want to give it a try, please follow me on Medium, and stay tuned.

Axel Springer Tech

Tech and product folks writing about their work across the…