What Sites Were Used For Training Google Bard AI?

Joe Jacob
4 min readFeb 11, 2023

--

Google

Google Bard, it’s existence rooted in the LaMDA language model, raises questions about the provenance of its training datasets known as Infiniset. The mystery surrounding the source and acquisition of these datasets prompts us to ponder the ethics of AI development and the transparency of technology companies.

While the LaMDA research paper from 2022 provides information on the composition of the datasets used to train the language model, a mere 25% can be traced back to publicly available sources, such as web-crawled content and Wikipedia. This raises further questions about the origin and quality of the remaining 75% of the data used to train the AI system.

Datasets of Google

Google’s Infiniset Dataset is a source of philosophical reflection, particularly in regards to its application in AI technology. The language model LaMDA, which powers Google Bard, was trained on this dataset with the intention of improving its ability to participate in dialogue.

The choice to use Infiniset, a blend of selected Internet content, raises questions about the motivations and biases behind the selection process, and the impact these choices may have on the training of the AI system. It invites us to contemplate the responsibility of technology companies in shaping the future of AI, and the ethical considerations that must be taken into account when designing and training these systems.

The LaMDA research paper (PDF) explains why they chose this composition of content:

The research paper on LaMDA, the language model that underlies Google Bard, introduces the concept of “dialog” and “dialogs” within the field of computer science. This terminology invites us to contemplate the role of AI systems in human communication, and the potential impact these systems may have on our society and relationships.

The language model was pre-trained on an extensive corpus of “public dialog data and web text,” totaling 1.56 trillion words. This staggering amount of information serves as a reminder of the power and influence of AI technology, and prompts us to consider the responsibilities that come with developing these systems. It raises questions about the implications of training AI on such a vast amount of data, and the potential biases and inaccuracies that may be inherent in the training process.

The dataset is comprised of the following mix:

  • 12.5% C4-based data
  • 12.5% English language Wikipedia
  • 12.5% code documents from programming Q&A websites, tutorials, and others
  • 6.25% English web documents
  • 6.25% Non-English web documents
  • 50% dialogs data from public forums

The origin of the majority of the data used to train LaMDA, the language model behind Google Bard, remains shrouded in mystery. Only 25% of the data is from named sources, specifically the C4 dataset and Wikipedia. The remaining 75% of the data in the Infinite dataset consists of words scraped from the Internet.

The research paper offers no concrete information on the method used to obtain this data, the specific websites it was obtained from, or any other details about the scraped content. Google merely provides generalized descriptions, such as “Non-English web documents.”

This lack of transparency in the source of the data has led to it being described as “murky,” meaning obscure or uncertain. Although there are hints that may give a general idea of the websites included in the scraped data, it is not possible to know for sure.

C4 Dataset

C4 is a dataset developed by Google in 2020. C4 stands for “Colossal Clean Crawled Corpus.”

This dataset is based on the Common Crawl data, which is an open-source dataset.

The following statistics about the C4 dataset are from the second research paper that is linked above.

The top 25 websites (by number of tokens) in C4 are:

  1. patents.google.com
  2. en.wikipedia.org
  3. en.m.wikipedia.org
  4. www.nytimes.com
  5. www.latimes.com
  6. www.theguardian.com
  7. journals.plos.org
  8. www.forbes.com
  9. www.huffpost.com
  10. patents.com
  11. www.scribd.com
  12. www.washingtonpost.com
  13. www.fool.com
  14. ipfs.io
  15. www.frontiersin.org
  16. www.businessinsider.com
  17. www.chicagotribune.com
  18. www.booking.com
  19. www.theatlantic.com
  20. link.springer.com
  21. www.aljazeera.com
  22. www.kickstarter.com
  23. caselaw.findlaw.com
  24. www.ncbi.nlm.nih.gov
  25. www.npr.org

Google does not specify what sites are in the Programming Q&A Sites category that makes up 12.5% of the dataset that LaMDA trained on.

So we can only speculate.

Stack Overflow and Reddit seem like obvious choices, especially since they were included in the MassiveWeb dataset.

But the following two are not explained:

The research paper provides only limited information, with a general description of 13% of the sites included in the database as “English and non-English language web pages.” This sparse information serves as a reminder of the complexities and challenges associated with the development of AI systems.

It prompts us to consider the ethical and philosophical implications of training AI systems on large amounts of data, and the responsibilities that come with these efforts. With the vagueness of the information provided by Google regarding the source of the training data, it raises questions about the accuracy, impartiality, and transparency of the AI systems we are creating.

Should Google be transparent about datasets

The use of websites in training AI systems, has led to concerns among publishers who fear that their sites may become obsolete. While the validity of these concerns is yet to be determined, they represent a legitimate anxiety among publishers and those involved in search marketing.

Given this, there is growing debate over whether Google should be more transparent about the datasets used to train AI systems, and whether such systems will have a lasting impact on the future of the web.

--

--

Joe Jacob

Self-taught Data Scientist | I want to help individuals and businesses to solve problems through technology. Sharing insights and inspiring innovation.