Wiki-Intelligence and the Foundation of Knowledge

Sam Bobo
Speaking Artificially
5 min readAug 14, 2023
Imagined by Bing Image Creator powered by DALL-E

English Class

High School English classes always belabored the 5-paragraph essay as a standard method for teaching concise writing — the introductory paragraph, three body paragraphs, and a conclusion. Soon thereafter, those essays turned into term papers with the additional requirement of providing reliable external sources and utilizing correct citation syntax. Why do I bring up English class in a blog centered around Artificial Intelligence, the answer: Wikipedia.

Teachers never hesitated to discredit Wikipedia. I’ll argue that in early 2000s, Wikipedia was emergent as internet popularity was growing. Yet, teachers were hesitant about the concept of user-created and community-sourced knowledge for third degree references simply due to the unreliability of the content. Whether factual or not, the notion loomed over education for a long time

History Class

The site was founded in January 2001 by Jimmy Wales and Larry Sanger as a spin-off of an earlier project called Nupedia, which was a free online encyclopedia that relied on expert contributors and peer review. Nupedia was slow and cumbersome to produce, and Wales and Sanger wanted to create a more open and collaborative platform that anyone could edit. They decided to use a wiki software, which allows users to create and modify web pages without any technical skills. He named the site Wikipedia as a portmanteau of wiki (a Hawaiian word meaning “quick”) and encyclopedia.

The idea of Wikipedia was revolutionary at the time, as it challenged the traditional notions of authority, quality, and reliability in knowledge production. Instead of relying on a centralized editorial board or a fixed set of experts, Wikipedia adopted a decentralized and democratic model of governance, where anyone can contribute, edit, or delete content, as long as they follow certain policies and guidelines. These include verifiability (citing reliable sources), neutrality (representing multiple viewpoints fairly), no original research (only summarizing existing sources), and respect for other editors (avoiding personal attacks or vandalism).

Wikipedia also relies on a community of volunteers, who perform various roles and tasks to maintain and improve the website. These include administrators (who can delete pages, block users, or protect pages from editing), bureaucrats (who can grant or revoke user rights), stewards (who can oversee cross-wiki issues), arbitrators (who can resolve disputes among editors), checkusers (who can investigate user accounts), oversighters (who can hide revisions from public view), bots (automated programs that perform repetitive tasks), and many others.

In effect, Wikipedia transformed into an open-source encyclopedia of content, utilizing a system-of-the-masses to ensure the information was reliable and accurate, and growing to nearly 6.6 million articles in 2023 (English).

Certainly the concept of Wikipedia in execution unveils a number of fallacies far too common in society today including under representation, bias, and more, but those are outside of the scope of this blog post.

Computer Science Class

Enough spending time in History or English class, it’s time to shift focus to Computer Science, specifically about Artificial Intelligence!

As referenced in previous blog posts, my career in Artificial Intelligence was catalyzed by my early entry into IBM Watson during the divisions emergence only a couple of years prior. During my tenure with the organization, I witnessed the strategic acquisition of AlchamyAPI, a start-up aimed at providing common Natural Language Processing capabilities via APIs, specifically — sentiment analysis, entity extraction, keyword extraction, and more. This set of APIs bolstered IBM Watson’s already market-leading Conversational AI capabilities at the time and augmented the capabilities of the broader integrator Ecosystem eager to infuse applications with Artificial Intelligence with the ability to extract such information from sentences, entire documents, or whole corpora. Why do I reference AlchemyAPI? One of the underlying sources of training data for those NLP capabilities were none other than the English Wikipedia, among others. In fact, many of the known entry extractions of AlchemyAPI actually sourced Wikipedia for further details!

In listening to the Towards Data Science podcast, I discovered an episode interviewing Angela Fan, an AI researcher at Meta, where she explained the interworking of crafting Wikipedia entries about known people, landmarks, concepts, etc and its direct impact on Artificial Intelligence systems.

Effectively, authors were required to:

  • Maintain a specific structure in the content they crafted to optimize for training data ingestion
  • Be hyper-aware of biases and representation that could be introduced in the copy that would infect and shape the training data and its corresponding output
  • Create a broad enough knowledge base to fuel AI training and not create unnecessary underrepresentation of specific fields, people, perspectives, etc.

Its powerful!

The episode delves further into Generative AI and using the Wikipedia corpus to generate new entries on Wikipedia to further augment its corpus and use of transformer models, but that is outside the scope of this blog post.

Wikipedia, among other common datasets like Common Crawl are vital to the progression of Natural Language Processing capabilities within Artificial Intelligence. The massive corpus helps weave a knowledge graph, vectorize language to power LLMs, and act as credible (yes credible) reference sources in queries.

Peering Into the Future

The Year 2023, with the advent of Generative AI, has sparked massive interest in the field of Artificial Intelligence, specifically around Natural Language Processing. This catalyzed interest has accelerated both new approaches as well as long-known controversial conversations including:

  • Corpus building and data access — what data is considered public versus private? What is the legality on web scraping tools? Who owns data published on the internet (the contributors? the platform? No one?) and who should be monetized? This is directly leading to sites such as Reddit charging for API access and other social media companies tightening their posture of access and monetization to content. Furthermore, do we generate synthetic data to train future AI models? What are its ramifications?
  • Influx of AI Generated Data — from deep fakes, quick monetization schemes, and sheer generation of garbage information on the internet, what becomes the future of truth and credibility? What measures should be put into place to understand what is AI generated or not? Is this a vicious cycle of “garbage in, garbage out?”

What is becoming apparent is that Open Source projects such as Wikipedia and Common Crawl have a mission to aid NLP training. The collective action and governance of these organizations will help preserve the integrity of some training data available for building Foundation Models and other NLP powered systems.

Yes, teachers and educators had a right to limit access to Wikipedia early in its lifetime, however, these institutions have grown to show significant value to society and its evolution.

Always, proprietary data shall prove dominant in providing AI-powered advantages and capabilities within solutions, but we need a starting place, and that is where Wikipedia comes in.

I hope that after reading this blog, you understand the history of Wikipedia, its emerging importance in the world of Artificial Intelligence, and the continued ramifications its institution has on the field!

--

--

Sam Bobo
Speaking Artificially

Product Manager of Artificial Intelligence, Conversational AI, and Enterprise Transformation | Former IBM Watson | https://www.linkedin.com/in/sambobo/