Generating Conspiracy Theories With Machine Learning (GPT-2)
DISCLAIMER: This project is solely for the purpose of entertainment and is not meant to be used for ill will. I do not support the use of this application to spread pseudoscience, disinformation, or other content that is contrary to public health or safety. I do not condone the use of this app to incite or cause harassment, physical harm, or reputational harm. Further, I do not approve of using this app to advance false claims about historic events and facts. Any text generated from this application can be inaccurate or inappropriate. Take caution with this application and use at your own risk.
In this article, I walk you through how I built a web application that generates conspiracy theories. To give a quick summary of this project, I…
- Built a dataset by using an API to gather over 10,000 comments from public forum posts
- Fine-tuned GPT-2 on the dataset to receive a loss of less than 0.1
- Optimized model parameters to generate more human-like theories
- Built a Streamlit application for the model
- Deployed Streamlit web application containing model on Google App Engine & Google Kubernetes
Want to give it a go or play with the code? Check out these links:
Web App: https://share.streamlit.io/lognorman20/con_gen/main/app.py
Source Code: https://github.com/lognorman20/con_gen
Table of Contents
1 Introduction
2 Data Collection & Pre-Processing
3 Model Selection & Development
4 Model Deployment
5 Conclusion
Introduction
Recently, I’ve been reading conspiracy theories about various topics from the Earth being flat to birds not existing. One theory I read was about the Earth being hollow and aliens living inside of it. At this point, I felt like anything could be a conspiracy theory, even the thought that “I” am writing this article.
A key aspect of conspiracy theories is that the justification behind the theory doesn’t even need to make sense for the theory to build traction. This led me to try to generate reasonings behind conspiracy theories using machine learning.
From questioning the rapid technological development during the industrial revolution to the reputability of COVID-19 vaccines, conspiracy theories have always been around. They’ve made an impact on society in different ways, but during no other time in history have humans been able to communicate with each other as quickly and efficiently as we can today. With the rise of social media applications like Twitter, Youtube, Instagram, Reddit, and others, it has never been easier to share ideas with little to no repercussions. Furthermore, these applications allow users to comment as they please anonymously, thus adding another layer of freedom.
While this liberty can have consequences, it also gives people who wouldn’t have a voice the opportunity to speak up for what they believe in. Popular public forum websites like Reddit allow users to anonymously make posts about what they believe and discuss current events with other users. Social media is a breeding ground for new ideas, and it was only a matter of time before conspiracy theories were to be discussed on these mediums.
My approach to generating conspiracy theories took advantage of Reddit, the social media platform and public forum. I used Reddit to create a dataset, then trained the machine learning model GPT-2 on the dataset. Finally, I deployed the model using Streamlit on Google Cloud Platform.
Data Collection & Pre-Processing
Before I can generate anything, I need a dataset to work with. I elected to use the social media platform Reddit to make this dataset for several reasons:
- Reddit has a strong community of conspiracy theorists on the subreddit r/conspiracy
- Reddit already has an API that can be used to easily create the dataset
- Reddit is anonymous, unfiltered, and has years worth of ideas waiting to be collected
If you’d like, check out the subreddit r/conspiracy to know more about the community. Also, if you want to follow along with the code, you can do so on this project’s Github repo.
The community of r/conspiracy has separated discussions of different conspiracy theories into what is called “roundtables.” In these roundtables, users express their ideas, share information, and debate the topic at hand.
Each of these round tables are public forums where anyone can read, share, or comment their ideas. The length of comments on these roundtables varies from a short burst of text to a research essay with citations. That being said, it is perfect to build a dataset from.
I used the Python Reddit API Wrapper to get all of the comments from these roundtables to make my dataset. It turned out to be a short and simple Python script. After running this script, I gathered over 10,000 comments from the years of user comments from each of the r/conspiracy roundtables.
While I did have a dataset, not all of the data could be used to train the model. Some comments would only be a few words, or contain hyperlinks/special characters. So, I also wrote a script to clean out this dataset.
The biggest issue I had when cleaning this dataset was deciding how long a meaningful comment should be. Some comments would be less than ten words, but bring up good points. That being said, I ultimately decided to drop any comment with less than 170 characters because that was the length I found to be the minimum for a comment to add value to the conversation. This will also prevent the model from generating short conspiracy theories. After running this script, I had about 6,000 comments in the dataset.
Initially, I was afraid that this dataset was too small; however, I feel that it will suffice for the purpose of this project.
After creating the dataset, I was ready to build the model.
Model Selection & Development
First, I needed to learn how natural language generation actually takes place. To do this, I watched some Youtube videos, read articles/papers, and explored the documentation of various machine learning models.
There’s multiple different methods, but the one that interested me the most was using GPT-2. This model had been catching a lot attention from the natural language processing community due to its improvements to traditional deep learning methods. As noted in this notebook by HuggingFace, GPT-2 has “improved transformer architecture and massive unsupervised training data” with over 300M parameters, showing that the model was trained heavily with lots of quality data.
There have also been plenty of implementations of GPT-2 from generating news stories, question-answering, and many more. On top of that, GPT-2 was trained on millions of web pages, therefore making it very intelligent. With this in mind, I decided to use GPT-2 to generate the conspiracy theories.
To get started, I used the tool aitextgen. It allowed me to fine-tune GPT-2 on the dataset in just a few lines of code in a Google Colab Notebook:
from aitextgen import aitextgen# defining the model
ai = aitextgen(tf_gpt2="124M", to_gpu=True)file_name = "../data/processed/corpus.txt"# training model
ai.train(file_name,
line_by_line=False,
from_cache=False,
num_steps=6000,
generate_every=1000,
save_every=1000,
save_gdrive=False,
learning_rate=1e-3,
fp16=False,
batch_size=1,
)ai.save()
After training the model, I received a loss of less than 0.1. Not too shabby for a small dataset.
The generated text isn’t always the best — it commonly strays from the input topic or outputs nonsense. This could be because the dataset is too small to train on or the model could be further optimized. It is also important to note that natural language generation (NLG) is still a growing field of research in the world of machine learning.
Model Deployment
Upon training the model and successfully generating text, I moved on to the biggest challenge of this project: the model deployment process. I wanted to quickly get a web application out, so I decided to use Streamlit because of the ease of use and its ability to scale quickly.
I used Streamlit’s documentation and public forum threads to build a basic web application and tested it out on my local machine.
This web application took a while to build because I ran into several issues:
- The requirements.txt file did not have the versions/packages necessary to run on Streamlit
- The model was too large to store on a regular Github repo, so I needed to learn to use GitLFS
- The website was slow because I wasn’t caching the model and other large elements
After getting a working version of the web application, I decided to deploy it on Google Cloud Platform (GCP). I’d never used GCP before. The hardest part for me was setting up Docker and making sure that my application was properly containerized.
I deployed the web application on both Google Kubernetes and Google App Engine to see which one I preferred. I chose the latter because only a few steps are necessary compared to Kubernetes. I got a custom domain for my application and was able to route it successfully.
Right now, you can find the application deployed at:
https://share.streamlit.io/lognorman20/con_gen/main/app.py
Check it out!
Conclusion
All in all, I enjoyed this project. I was able to explore the creativity of conspiracy theories and teach it to a computer. Conspiracy theories will always be present, especially with the current state of social media. Who knows, maybe the Earth really is flat.
What are your thoughts on this project? Let me know in the comments.
Feel free to reach out on LinkedIn and follow my work on GitHub.
Interested in working with me? Send me a message on LinkedIn or Upwork.
If you enjoyed reading about this project, I encourage you to follow me on Medium, I write about my experiences in the world of coding. While you’re at it, check out my other projects Predicting Sneaker Prices or building a Spotify Song Recommendation System!