Introduction and Motivation
The motivation for this project was to explore natural language processing and text generation techniques. We thought that a pertinent application of NLP, given the current news environment, was to analyze President Donald Trump’s tweets. From his tweets, we were able to cluster his tweets into distinct topics, and generate original tweets in his linguistic style relating to each of these topics.
For our dataset, we used a repository we found on GitHub containing all of Donald Trump’s tweets from 2015 to 2017. This repository is updated hourly, providing us with the most up-to-date data available. The data was given in JSON format as follows:
For our purposes, we only needed the “text” field of the JSON, so we used Python’s JSON module in order to parse the JSON files we pulled from the repo.
Using just this data, we were fairly successful generating tweets. However, we noticed that a few of the tweets seemed confusing or slightly incoherent. This wasn’t necessarily a bad thing, given Donald Trump’s general tweeting style, but we thought it would be beneficial to gather more data in order to improve the clarity of the tweets. We decided to find transcripts of his speeches, which (most of the time) contain full sentences. This helped us construct more understandable tweets.
We found a website that contains all of his speeches from the campaign trail for the 2016 election. This website didn’t have an API, so we had to scrape the site’s links using Python and BeautifulSoup. This proved to be a bit tricky due to the site’s formatting, which had oddly nested <p> tags for each new paragraph. Parsing it required finding the tag with class=’displaytext’ and removing all of the <p> paragraph tags and the <i> italics tags. The texts also contained a lot of punctuation we didn’t want to include in our model, such as ellipses, square brackets, etc. We had to remove these characters. After all of the text cleanup was done, we tokenized all of the text into sentence tokens and wrote each sentence to a file. Adding this text seemed to help with the quality of our tweets, while also making our topics somewhat less defined, probably due to the more generic context of most campaign speeches.
Topic Creation from Tweets
In order to create Trump-like tweets within a certain topic, we had to identify how topics could be extracted from our data sources. We decided that we would try to translate texts into representative vectors, cluster these vectors, then identify the central topics of these clusters. We attempted three methods in topic creation: Doc2Vec, Term Frequency-Inverse Document Frequency (TF-IDF), and Latent Dirichlet Allocation (LDA).
We found work around topic modeling to be quite helpful as we defined possible approaches. Specifically, Brandon Rose and Rik Nijessen wrote helpful tutorials about finding common topics in large texts (movie reviews and news articles, respectively). These served as helpful guides as we approached topic identification within our data sources.
Method 1: Doc2Vec with K-means
Doc2Vec is a variant of Word2Vec. It stems from Quoc Le and Tomas Mikolov’s paper proposing the ‘paragraph vector’, which seeks to overcome the weaknesses of the bag-of-words models by tracking word order and acknowledging word semantics. A single document is represented by a vector that represents the words in the document.
For this model, we removed all links and stop words such as “a” and “the”. We also made all words lowercase. We then trained a Doc2Vec model. We used a decreasing learning rate as we trained the Doc2Vec model in order to more effectively reach the optimum. After building the Doc2Vec model, we translated each tweet to numerical vectors. We then clustered these vectors into 15 groups using K-means and extracted the most common words of the texts within a single cluster. A sampling of the groupings with the 10 most common words found in the grouped texts are shown below.
While it is possible to distinguish some differences in the clusters (perhaps we could annotate the fifth cluster as “interview focused” and the last cluster as “fox news”), we see that many words like “@realdonaldtrump” and “trump” are consistent throughout the clusters. This is not ideal.
Method 2: Term Frequency-Inverse Document Frequency with K-means
TF-IDF is a measure of a word’s importance to a single text within a set of texts. A word’s TF-IDF weight for a single text increases if it appears in a text frequently, but is counterbalanced by its prevalence in the overall set of texts. This measure is effective in identifying the words that best discriminate texts in a collection.
We began this method by removing the stopwords from the texts. We then used NLTK’s Snowball Stemmer library to identify the roots of each of the words. From these, a look-up table is created in which a single stem (like ‘walk’) is associated with multiple tokens (like ‘walked’, ‘walking’, ‘walks’). SKLearn’s TFIDFVectorizer method is used to transform the texts to a matrix that accounts for the frequency of the words in a single text and the word’s prevalent in the overall set of texts. We applied K-means clustering to the matrix representing the vectorized tweets. Similarly to the Doc2Vec method, we created 15 topics with K-means and found the most common words of the texts within a single cluster. A sampling of these groupings with the most commons words is as follows:
It was easier to extract topics from these clusters than the clusters formed from the Doc2Vec. As seen above, Cluster 2 could be “interview” or “fox”, Cluster 9 could be “hillary”, Cluster 11 could be “make america great again”, and Cluster 12 could be “policy decisions”. One challenge we see is that some of the words are duplicated. However, this could be alleviated (in a slightly hacky way) just by using the top 10 unique words to define the clusters.
Another thing we saw from our clusters was that some had 150x as many tweets in them as others. We attempted nested k-means clustering with the clusters that were larger than 1500 tweets. However, we found that they would still lead to unevenly sized clusters and had vague topics.
Method 3: Latent Dirichlet Allocation
LDA views each text as a mixture of topics and each topic as a mixture of words, all of which can be overlapping.
Similarly to TF-IDF, we began our implementation of LDA by tokenizing the text. We then use gensim’s corpora module to create a bag of words corpus of text that includes all words that appear in at least five of the texts and no more than 80% of the texts. We used this corpus to generate 15 topics found in the dataset, using LDA. As a sample of our results, the following graphs show two of the topics with the 10 most important words and their weights.
With words: cruz, debate, donald, gop, lead, new, poll, rubio, trump, via
With words: crowd, great, honor, iowa, join, makeamericagreatagain, great, thank, today, trump2016
While these clusters do a better job of capturing the topics than the Doc2Vec method, the topics aren’t as clear as when we used TF-IDF. Furthermore, LDA does not actually cluster the texts into groups — rather it defines topics from all of the texts overall, meaning there would be extra steps necessary to actually assign texts to each of these topics.
Final Design Decision
After trying the three methods above, we decided to use TF-IDF with k-means as it created the most defined cluster topics. The following is a visualization of each topic, manually summarized from the most popular tokens in the cluster, along with the size of each cluster.
Potential Future Steps
Some future steps that could be taken to improve this include combining these methods to improve topic definition. For example, we could combine TF-IDF and Doc2Vec as two features for each of the tweets, then cluster them into groups. After finding groups, we could use LDA to identify primary and secondary topics in each of the groups (rather than just selecting the most frequent words).
Alternatively, early on in the project we tried clustering with only Donald Trump tweets (not any of his speeches). This led to much more defined cluster topics, but caused challenges in the tweet generation portion of the project due to an insufficiently small dataset. Since adding the speeches seemed to cause murkier topics, we could try a method that first clusters tweets and identifies topic centroids. We could then try mapping Trump speeches to these centroids to place them in the pre-defined clusters. This would allow us to augment our dataset for tweet generation while still maintaining well-defined topics.
Many political pundits and commentators have remarked that the current U.S. President’s tweets, while varied in tone and topic, remind people of text potentially generated using probabilistic models, such as a Markov chain. We pursued two methods for generating topic-specific tweets in the style of Donald Trump, Long Short Term Memory models and Markov Chains, and we will discuss both in the following sections.
Method 1: Long Short Term Memory (LSTM) Text Generation
The first method we pursued for text generation was with a Long Short Term Memory (LSTM) network based generative language model. A significant limitation of traditional Neural Networks in NLP applications is that they accept a fixed-size vector as an input and produce a fixed-size vector as an output. In addition, all inputs and outputs are independent of each other and the network does not keep track of sequences of states or how current states may be affected by previous states. Recurrent neural networks or RNNs address this issue by allowing operations over sequences of vectors. Essentially, an RNN is a chain of repeating models of neural networks, with each module passing information to the next. This allows the network to persist information and generate outputs based on previous inputs. Because of these valuable properties, RNNs are often used in applications like speech recognition, language modeling, translation, etc. However, there is a downside to RNNs. In theory, RNNs can make use of information in long sequences, but often in practice as the gap between past relevant information and the current state grows, RNNs are unable to learn to connect the information that is too far in the past. A specific type of RNN, Long Short Term Memory Networks or LSTMs, however is able to capture long-term dependencies and thus, show particular promise in NLP applications. In this section, we will discuss the data preparation, training set, and parameter tuning necessary to train an LSTM model to generate coherent tweets in the linguistic style of Donald Trump. For a reader’s reference, blog posts by Christopher Olah, Andrej Karpathy, Denny Britz, and Trung Tran are immensely helpful for both conceptual understanding and implementation of an LSTM model.
Data Preparation for Training
In order to generate tweets relating to each of the 12 topic clusters, 12 separate LSTM models had to be trained over each cluster’s relevant tweets. Here, we will be showing our work training on the 1st cluster which pertains to “Making America Great Again.” We trained a word-level LSTM for our purposes, rather than a character-level model. This tutorial shows the impressive capabilities of a character-level LSTM model to train over a large corpus (the novel Alice in Wonderland) and then to generate original and coherent text character-by-character. This model trained over the approximate 150,000 characters contained in the Alice in Wonderland corpus. On the other hand, our corpus consists of tweets and speech snippets in various topic clusters. In the Make America Great Again cluster for example, there are 529 tweets. In total, this amounts to around 57,000 characters, only a third of the corpus used in the tutorial. With a much more condensed corpus, world-level training gave us a higher chance of coherency in our generated text.
In the screenshot below, we can see a few of this cluster’s tweets tokenized by “words”. Our definition of “words” in this section will include actual words as well as crucial punctuation (.,;?!) and special characters related to Twitter (@#).
Two dictionaries were created for each unique word to map it to a numerical value and vice versa. This allowed for sequences of words to later be encoded as one-hot vectors before inputting them into the model.
We used the Keras API to create and train our model. The training set for the model consists of sequences of words of length seq_len as inputs and the next word in that sequence as outputs. When formatting the training input and output set, we tried two different methods and compared their performance. Both methods require a pre-defined seq_len. This is parameter that was tuned carefully, but for this section let’s consider a seq_len=5. We will demonstrate the training set generation for both methods using the following two tweets that appeared back to back in the Make America Great Again cluster.
Input Method 1
In the first method, we built the training set tweet by tweet, maintaining separation between tweets. With the above tweets and provided seq_length, this method would generate a training set shown in the table below. The inputs are sequences of length 5 and the outputs are the words that immediately follow the input sequences. Once tweet 1 had been fully traversed to the final word (!), the next tweet is traversed separately.
The length of the training set using Method 1 is described in the equation below. Keep in mind, that if a particular tweet has less words than seq_length, that tweet will not be included in the training set.
Input Method 2
In the second method, we ignored the differentiation between separate tweets and treated all tweets as one continuous corpus. The table below shows the training set generated with this method. Unlike Method 1, when the end of tweet 1 is reached, no distinction is made and the next input sequence is simply the end of the first tweet and the next output the first word of tweet 2.
The length of the training set using Method 2 is described in the equation below. Note that this training set will be significantly larger than that of Method 1.
Our original hypothesis was that Method 1 would produce the most topic-specific and coherent tweets due to the distinction between separate tweets. We assumed that back to back tweets in the dataset, although in the same topic cluster, may be about completely discreet events or subtopics. However, in practice, Method 2 was far more successful. We attributed this to the size of the training set from Method 2 being much larger than that of Method 1. In addition, Trump’s linguistic tweet style is already fairly scattered and sporadic, so training over the tweets continuously was not be as problematic as expected.
The number of words in the input sequences trained on by the model, or seq_len, was a very important parameter to tune to ensure successful generation of coherent text. Visualizing the number of words in Trump’s tweets and some related statistics proved helpful in making educated tuning decisions.
On average, Trump’s tweets contain 17 words. There is a high risk of overfitting if a seq_len greater than or equal to 13 is chosen, as this runs the risk of memorizing word-for-word some of his shorter tweets (at least 25%). An example of output on a model that had been overfit was:
The tweet’s second sentence is an exact copy of a sentence in one of Trump’s existing tweets that the model trained on. On the other hand, there is a high risk of underfitting if a seq_len less than 5 is chosen. A model that had been underfit resulted in completely incoherent outputs and often looping and repetitive words and characters, like:
We found the most coherent tweets were generated from models that were trained over inputs of length 7–10 words.
In terms of parameters for the LSTM, our most successful model was created with 3 hidden layers (LAYERS_NUM) each with 700 hidden states (HIDDEN_DIM). Exact specifications are shown in the configuration code below. Most successful results were generated once models had trained for upwards of 100 epochs.
To generate a tweet, a word is randomly selected from the corpus as an input seed to the model which outputs the next most likely word to follow. This sequence, now of length two words, is fed back into the model to generate the following word, and this process is repeated up to a particular length.
The length of a generated tweet was an interesting constraint to consider. Twitter has recently raised their tweet character limit to 280 from 140. In the following graph, we can see that Trump has taken advantage of this character bump.
Rather than randomly selecting a character limit for our generated tweets between [140, 280] or even [0, 280] which runs the risk of an excessive amount of very short tweets, it was logical to generate tweets of lengths in keeping with Trump’s existing habits. We created a distribution of the likelihood of Trump to tweet 0,1,2…280 characters and based on this probabilistic distribution, randomly selected character limits for each generated tweet.
Once the character limit had been reached by the sequence of words produced by the model, the array of generated words was passed to a helper function that arranged the individual words, punctuation, and special characters into a properly formatted sentence. That process is shown below.
Below, are a few examples of the most successful outputs from our LSTM model based on the Make America Great Again topic cluster.
After parameter tuning, the model definitely began producing fairly coherent tweets that were linguistically reflective of Trump and also generally topic specific. However, there was still some trouble producing consistently coherent text. The corpus per topic was likely too small to generate a robust enough model. As shown in this tutorial, LSTMs generate the most successful outputs after training on a novel or even a series of novels, rather than just a few hundred tweet-length texts. Also, the training time required for each separate model was slightly prohibitive for the timeline of this project. It took about 2 hours to train a model for >100 epochs using a costly AWS GPU instance. Because of some of these limitations, we also pursued Markov Chains as an alternative method for tweet generation.
Method 2: Markov Chains
A Markov chain contains a set of states and moves successively from one state i to another j in a step s with a transition probability probability p. Markov chains are memoryless (the Markov property); predictions for a future state are solely be based on the present state and conditionally independent from the history of the rest of chain. Markov chains can predict our next typed words (as smartphone text auto-suggestion systems do) and also generate text given a sample document.
For our project, we tried two Markov chain approaches: first, a simple Markov chain implementation in a Dart package known as markov (by Filip Hracek); and second, a Python package named Markovify (created by Jeremy Singer-Vine).
Method 1a: Markov (Dart implementation)
According to its creator, Filip Hracek, the markov (stylized with lowercase) Dart package is a Markov chain implementation that generates words and punctuations and is tuned for tweets. Dart is a general-purpose programming language used to build web, server, and mobile applications; we looked into Dart because Filip Hracek has created a Trump tweet generator in the past, but without the topic clustering that we have proposed to do. This package works by converting a byte stream of text into unicode lines and piping the stream into a Markov chain generator of order 2. Order refers to the number of past states that the future state depends on. Then, new text is generated from the tokenized input stream, and formatted with valid English syntax before output. An example of the output is below:
As the above screenshot shows, the following tweets were generated given their inputs:
- “Our way of life is under threat by Radical Islam. Hillary Clinton can’t close the deal with Bernie– and destroyed City.” (generated from a collection of Donald Trump tweets from 2016–2017)
- “Crooked Hillary said Loudly, and always very short (stamina). Media is protecting her! Frm Intel Comm Inspector General https://t.co/b0tLW5TvhX” (generated from a collection of Donald Trump tweets from 2016–2017)
- “‘Hillary Clinton Had Gun Control Supporters Planted In Town Hall Audience’ https:/t.co/1GVq74iW8a’” (generated from an early cluster of tweets with Hillary Clinton as cluster topic)
Method 1b: Markovify (Python package)
Markovify is a Python package billed as a “simple, extensible Markov chain generator” used to build Markov models from large corpora of text. Markovify works by reading raw text as a string, splitting the input into sentences, and generating sentences based on parameters such as state size, generated sentence length, number of attempts to create a sentence that does not overlap with the original text, and number of words in common with the original text, among others.
For the purposes of our project, we generated tweets using Markovify by specifying a cluster from our overall dataset as the text input. After manually tuning the parameters of the model, setting the state size to 3, the maximum number of generation attempts to 1000, and the max overlap to 15 words generated tweets that were subjectively the closest to the tone and syntax of Donald Trump’s actual tweets. To determine the length of output, we used the probability distribution of the length of Donald Trump’s historical tweets and used that as a parameter in Markovify. A sample of the results from our tweet generation using Markovify are below:
Evaluation of Markov (Dart) vs. Markovify (Python)
While the Dart package markov was useful in demonstrating how we could use Markov chains to begin generating tweets, the Markovify package was more useful for the purposes of our project because of its Python support. Additionally, a limitation of the Dart package was that tuning the parameters of the tweet (especially length) became a bottleneck, as the implementation of the Markov chain and subsequent tweet generation was in a completely new language that our group members did not yet know.
Conclusion: Final Design and Future Work
We feel that the tweets written by our tweet generator closely resemble the style of tweets written by the President himself. We found that clustering his tweets’ topics using TF-IDF with k-means produced the most defined cluster topics and that generating tweets using the Markovify package gave us the most coherent generated tweets.
In the future we would like to be able to provide any topic to our generator, not just the ones we restrict it to currently. We would also like to try another model we found for tweet topic clustering, but did not have time to implement, called Twitter-LDA. While traditional LDA methods are designed to be used on long documents, this model is specifically designed to be applied to short documents like tweets and performs better on them than traditional LDA methods.
Furthermore, this project could be expanded to identify how politicians tweet about certain topics similarly or differently. Our work could also be used to determine how specific topics can cause polarization or popularity in tweets.
This project has been completed as part of the University of Texas at Austin’s data science tech core within Electrical and Computer Engineering.