Gen AI Series: Bias

Over the past few months, there have been many explanations of the pros and cons of generative AI (Gen AI). From grandiose visions of AI overlords harnessing the powers of SkyNet, to the less doomsday-like approaches of embracing the free market and letting innovation lead the way. The call for a 6-month moratorium on Gen AI has been at the forefront of our 24-hour news cycle. Comparisons to the Ford Model T affecting replacing the job of the poor horseshoe blacksmith come to mind.

The call for a recent 6-month pause has gained momentum

However, the foray into a complete AI narrative is not monolithic, it does not pervade our global narrative as a fight between good and evil. Depending on one’s technical, economic, and political leanings, there seems to be a narrative that fits each flavor of thought. Regulate, demote, stop, pause, inspect. These are all partial solutions to a much more complex problem.

Gen AI is not a go or no-go question, there is nuance and dialogue needed in the AI community

Gen AI is exciting and companies that are implementing it will no doubt provide a better and more timely service to their customers. This article is not about the benefits that any individual company seeks to gain, but rather, a community-level dialogue to the areas where Gen AI is apt to bring debate. The following areas dive into eight sections that I label BADASSES. These include:

  • Bias
  • Alternatives
  • Disinformation
  • Attribution
  • Security
  • Siloing
  • Explainability
  • Sensationalism

As we learn more about the new technologies that present themselves each day, the new models, architecture, and infrastructure landscapes, understanding these 8 areas will become increasingly important for companies that are looking to leverage Gen AI gains. Various domains will touch upon different areas of these threads. Those Gen AI solutions directly affect people’s lives, regardless of the domain they fall into. The government domain (politics and national security) will worry more about disinformation, while academia will need to be more aware of attribution and sensationalism. My hope is that by touching these 8 threads, a more coherent understanding of Gen AI Ms in modern society will become more clear.

Let’s Dig In!

Bias

Bias is present in all elements of society. Our personal preferences and past experiences culminate into predisposed notions of what is correct. These preferences make their way into the types of news we follow, the clothing we wear, the information we absorb through social media, and the thoughts we have on how society should look and sound. These collective preferences have created the modern internet, embodied in the text and images that are stored on websites. Data is collected at a mass scale for the training of Gen AI solutions (see the ChatGPT model summary later). It is these preferences that are embodied in the Gen AI models and are returned as bias.

Large-scale data collection is nothing new. We have heard for years of the National Security Agency’s (NSA) ability to collect and store data. 10 years ago, it was reported that the NSA “touches” over 29 petabytes of data daily. In fact, their Utah data center is reported to be able to store a yottabyte (one billion trillion megabytes). One can only assume that has increased at a logarithmic pace. In the private sector, most companies that have high processing capabilities collect data at an astonishing higher rate than before. The world’s largest retailer (WalMart), with over 20,000 stores worldwide, generates 2.5 petabytes of data each hour! Data-centric companies are using data at a rate never before seen in historical records. So how does large-scale data collection affect the development of Gen AI?

National Security Administration Utah Data Center

As a technical background, ChatGPT-3 leverages Common Crawl, a 321 terabyte (TB) collection of raw text from the internet. However, this data tends to be skewed towards younger users from developed countries. GPT-2 relied heavily on Reddit data, which according to a 2016 Pew Internet Research survey is 67 percent men of which 64 percent of those men are between the ages of 18 and 29. The GPT framework utilizes Wikipedia, where only 15 percent of contributors are female. Finally, WebText, another key input to the GPT models, is trained on news sources, Wikipedia, and sources of fiction. A recent study by Gilmen et al. 2020 found that a sampling of this text had a toxicity score of equal to or more than 50 percent. All this is important because it lays the foundation for how we view bias and the data it lives.

The issue of bias in ML is not anything new. Take for example Tay, a fun-loving Bot that Microsoft released in 2016. Tay was trained on public Twitter posts. Tay was supposed to emulate the nature of society and how we interact with Twitter. It took less than 24 hours for Tay to be a fun, upbeat friend who sends you cool tweets to a racist, Fascist-loving Bot from hell. (see below for how it began and how it ended). Tay taught the ML community that training data matters, and simply throwing the eternity of the public internet at an algorithm is bound to find the worst corners of it to represent if left tweaked by engineers.

We have seen this story play out, from Microsoft’s unwittingly fumble with Tay. Dr. Joy Buolamwini of MIT has done tremendous research and education in this field, promoting a more inclusive use of AI technologies. Frequently bringing to light the inherent racial bias that ML systems contain. What does it mean when AI systems from leading companies fail to correctly classify the faces of Oprah Winfrey, Michelle Obama, and Serena Williams but can classify white celebrities with shocking precision?

Tay

Bloomberg recently released a report that took a deep-dive approach to StabilityAI’sgenerative AI text-to-image creation, Stable Diffusion. The one-sentence summary of the inherent biases is telling:

“The world according to Stable Diffusion is run by White male CEOs. Women are rarely doctors, lawyers or judges. Men with dark skin commit crimes, while women with dark skin flip burgers.”

The Bloomberg report analyzed 5,000 AI-generated faces that represent 16 different types of jobs, from janitor to CEO. They compared these images using a skin-tone index that allows for six different ratings (Fitzpatrick Skin Scale), from light to dark. These ratings were compared across industries to the statistics reported by the US Bureau of Labor Statistics. When compared to the actual distribution of assumed skin tones from US government statistics, the results are shocking to the level of bias in Stable Diffusion’s generated images.

“When prompted to generate images of a “terrorist,” the model consistently rendered men with dark facial hair, often wearing head coverings — clearly leaning on stereotypes of Muslim men. According to a 2017 report from the Government Accountability Office, radical Islamic extremists committed 23 deadly terrorist attacks on US soil since Sept. 11, 2001 — but far-right extremists, including White supremacists, committed nearly three times as many during the same time frame.”

The bias doesn’t only affect ethnicity. As seen below, using the Bureau of Labor Statistics data, Stable Diffusion tends to skew its results to generate more women for lower social status jobs (dishwasher, cashier, housekeeper) and lean more towards men for positions that have higher education and social ladder requirements.

But there are solutions to this problem, as Dr. Buolamwini describes in her “Safe Face Pledge” whereby companies promise to evaluate the data they train models on for racial biases. For Gen AI to build on the work that pioneers in racial equality and justice movements have moved through society, they must address the data they train on. Take for example areas in large language models, a type of Gen AI. As evidenced by a recent Chinchilla paper, tuning an LLM algorithm can only get you so far, the real improvement comes from more training data. For companies training LLMs to rely solely on the gigabytes they throw at LLMs won’t be enough. It will take a concerted effort to give these models the right data to improve their performance.

Wrong type of Chinchilla

What is the solution to overcoming this deluge of training data bias in the LLMs that are deployed today?

  1. Commit to ethical training data curation
  2. Join open communities
  3. Document dataset curation
  4. Release training dataset samples to the public
  5. Hold data in escrow accounts

Commit to ethical training data curation

Ethical training data curation borrows from academic sampling methodologies frequently found in public opinion research and survey analytics. By leveraging bedrock sampling methodologies from leaders in the political science academic arena, notably Leslie Kish, Ph.D., a more robust sampling framework can be put in place to better represent underrepresented groups. The world of political opinion survey research has leveraged these methodologies for half a century to create accurate samples of the American electorate. By looking into the pioneering work of people like Stan Greenberg who was Bill Clinton’s pollster Stan Greenberg (or his organization Greenberg Quinlan Rosner) or Jan van Lohuizen and Fred Steeper, George W Bush’s pollsters, we can borrow from methodologies that have proven successful in understanding the American electorate. These methodologies, as it comes to choosing data sampling allocations and distributions can help create a more equitable and accurate data representation. When combined with model parameters and data size recommendations from the Chinchilla paper, gains to LLM accuracy, capabilities, and costs can go hand-in-hand with a more diverse and ethically appropriate data collection.

Join open communities

Joining open communities where data collection ideas are shared is critical. Open collaboration in the 1980s for nuclear deterrence proved fruitful in helping overcome some of the largest stockpiling drawdowns in human history of fissile materials. A similar approach to open collaboration, even when competing in the traditional Prisoner’s Dilemma constraint where the optimal choice is to cheat, will help in curating better datasets. Open participation, open sharing of code (ie, Reagan’s “Trust but verify” approach) can lead to more robust training data sets.

Document dataset curation

Any software engineer or machine learning engineer knows that proper documentation of code is worth its “weight” in gold. The same holds for how data is curated. Just as DevOps and MLOps filled the gaps in software development and ML model infrastructure, respectively, so too does a new era of data documentation. This documentation should answer four key questions: where, when, how, and why. Where was the data collected, the duration of collection, the methods, such as filtering and permissions used, and why it was collected?

Release training dataset samples to the public

AI companies should share a random sample of training data for a proper audit of their processes. The algorithm for this sampling methodology should be an openly agreed upon consent that allows for the input of multiple leading, and independent, companies that can offer clarity and statistical best practices to sampling methodologies used. Furthermore, allowing these datasets to be evaluated by the open community such as Kaggle has helped engage machine learning practitioners.

Hold data in “escrow” accounts

AWS JumpStart recently launched LLM management and productionalization that allows data and model weights to the law in “escrow.” Similar to a buyer and seller putting their money in an escrow account to ensure the proper functioning of a home sale. Since users are concerned about companies training their future models on their input prompts, and companies not wanting their model weights leaked to the public (LLaMA from Facebook), this concept has gained traction. AWS allows the user to place their data on AWS-managed infrastructure while the model weights, for example, AI21 and Cohere model weights, are stored on a similar stored capacity on AWS. At the point of inference, the model weights are called in the escrow account and compared against the user input data. This allows the flexibility to use the model vehicle keeping the user data private and the model weights the IP of the company.

Source

Where do we go from here?

Bias in data collection impacts LLMs at an astonishing rate. While there are specific actions that can be taken immediately, there is a regulation that is pending that will attempt to draft the future of Gen AI. Most notably in the EU and US Congress. Time will tell where Gen AI is headed, but if GDPR and CCPA are any indicators, more regulation is around the corner. In the meantime, there remain areas of additional exploration. The idea of human-in-the-loop actions that allow for more human involvement in the data curation process, stress-testing the datasets for undisclosed bias, and other creative means of data discovery in the model-building process will help organizations uncover new areas for growth.

Author’s note: Special thanks goes out to Steven Pais, Rishi Sheth, and Claire Salling for their technical review and sound boarding as this draft moved from PROD to DEV :)

--

--

Nicholas Beaudoin
Eviden Data Science and Engineering Community

Nicholas is an accomplished data scientist with 10 years in federal and commercial consulting practice. He specializes in ML operations (MLOps).