The MidTerms on Twitter


Political support inference from twitter has been among the most widely studied areas of NLP (Natural Language Processing) on the platform with many valid approaches developed over the years since its wide adoption. A number of these approaches rely on sentiment analysis, classification of tweets between positive and negative, to measure the support of parties and leaders, others put an emphasis on classifying users rather than tweets to model users as voters and not the tweets themselves as indicators of support but not many of them measure up to the task of analyzing contentious national parliamentary elections where both local and national leaders, parties and issues are at play. The system proposed here aims to tackle one such election, the US 2018 House of Representative election, but the approach followed in its development, while requiring considerable domain knowledge, is completely language agnostic, i.e. none of the tools, techniques and data acquisition strategies used are language exclusive. The standout features of the system are: its reliance on regional data and the parallel yet distinct analysis of local support for national figures and factions on one side and of the congressional candidates actually running for office on the other. The methods applied to both analyses rely however on a relativity complex pipeline which will be explained in this paper. The review of this pipeline will be divided in three main areas plus validation.


How the system queries for geotagged tweets and tweets containing the candidates’ name or handle.


How the NLP models perform, how they are trained and how the training and testing data was acquired.


How the data after analysis is displayed on map and how to make available as much information as possible while avoiding clutter and confusion.

*Bonus* Full system “validation”

How to validate the system using districts where the outcome is forecasted with an high degree of confidence by reputable sources.


The searching methodology for tweets is sometimes overlooked in this kind of analysis as it poses mainly technical rather than theoretical challenges, however as it may in itself introduce biases difficult to detect in later stages of the process it is worth a thorough discussion. As explained before the system is capable of gauging both local support for national causes and the popularity of the candidates running but it does so separately analyzing different data gathered in different ways. This approach mirrors a longstanding practice of US polling where the voters interviewed are queried both on the favorability of the candidates on the ballot and the likelihood that they would support a generic candidate from one party rather than the other. The two methodologies will respectively be referred to as “On Ballot”(OB) and “Generic Ballot”(GB). Again borrowing from US polling practice the number of districts on which the analysis will focus is relatively small. Indeed the races for many congressional seats, due to factors like incumbent’s advantage, are not very competitive. Therefore consulting reputable sources such as the Cook Political Report it’s possible to cut down on the number of districts to analyze from 435, the seats in house, to about 68 the races rated lean or toss up as of Oct 10th.

Query composition for On Ballot tweets

The source of data for both approaches is the twitter standard search API but if in GB, as will be explained later, much emphasis is put on geo-tagged local data with OB no regional limitation is put on the query. In fact the queries for OB contain only the specific candidate’s name and surname and his/her twitter handle e.g.

\"French Hill\" OR @RepFrenchHill OR @ElectFrench

It is reasonable to query in this way on the assumption that most congressional candidate don’t receive national attention and that the vast majority of the twitter activity around the specific election will stem from local voters. Furthermore the aforementioned twitter activity is on the whole sufficient to make analysis since in every district for each candidate an average of about 7k tweets are found over the span of a week.

Query composition for Generic Ballot tweets

When it comes to GB tweets the search queries complicate. As with OB the standard twitter search API is used this time however the queries contain national leaders and parties and are limited to a certain region, the specific congressional district queried. For example this is the query for tweets on Republicans.


In order to limit the search to the congressional district the geocode field in the twitter API search query is used. The field however takes only a tuple for coordinates of the center of the search circle and a float for its radius. But as very few congressional districts may be approximated well with a single circle more queries are needed for a single district to cover its area. The approximation may not always be exactly fitted to the district’s shape but guided by population density this method is able to best suit the use case and analysis.

Approximation of PA-10
PA-10-a 40.03392360399664,-76.78482055664064,16.274661km
PA-10-b 40.19594518732199,-77.05535888671876,16.238963km
PA-10-c 40.28752567143796,-76.73475783783944,14.080184km
PA-10-d 40.53970542053963,-76.78781683556737,16.456645km
PA-10-e 40.395718433470364,-76.94387265946717,6.962575km


When it comes to the actual analysis of the tweets gathered both GB and OB use very similar RNN-LSTM binary classifiers trained to identify democratic or republican leaning tweets fed by a Word2Vec embedding layer itself trained on tweets. This architecture has been proven accurate and adaptable in a number of NLP tasks, namely sentiment analysis, and trained on enough tweets, hundreds of thousands, it is capable of reaching satisfactory accuracy of classification in this use case. The architecture of the models is however a less important variable as compared to the quality and quantity of the training and testing data so before a more thorough exploration of the models an explanation of the techniques and methods used to gather this data is in order.

Training data acquisition for On Ballot and Generic Ballot models

The models use different training data gathered in very similar ways. The main technique employed consists in downloading the entire twitter timeline, as far back as about 5–6 months ago, of users with clear political affiliation. Among these are the accounts of the politicians themselves, of partisan political pundits, of the most well-known activists and involved organizers, e.g. @KamalaHarris @davidhogg111 @billmaher for democrats and @tedcruz @TuckerCarlson @DonaldJTrumpJr for republicans. The accounts selected for this election yielded around 160k tweets that based on the unequivocal political leaning of the users writing them were consequently labeled as democrats (0) or as republicans (1). Training the RNNs only with this data however was not enough to reach accuracy beyond the high sixties on the testing set. In order to improve on this result by increasing the variety of the data, 120k other tweets were added to the training set. This second batch differed from the first as these tweets belong to lesser know users of the platform who nevertheless are much more well versed in the vernacular of daily political interactions on twitter e.g. @StormResist @deejay90192 (D) and @Robfortrump2020 @qanon76 (R). Adding this new data allowed the GB model to surpass 70% accuracy and to settle in the low to mid seventies on the testing set. Furthermore by adding a confidence threshold of 75% the accuracy rises to nearly 80% at the cost of around one quarter of the tweets that are discarded as they don’t pass the threshold. The OB model aside from the training set used for GB was trained on a further 140k tweets coming from the congressional candidates themselves to make the model more aware of the vocabulary of day-in day-out campaigning. Despite this additional training the OB model across the board does not match he accuracy of the GB model because of the similarities in the lexicon used by candidates of both sides. “Turn out the vote” tweets and tweets containing poll information are especially difficult to classify.

Testing data acquisition and labeling for Generic and On Ballot models

The sole reliable testing for true validation of an NLP model like these ones is evaluation on hand-labeled real world data. The queries described in the search section were used to acquire the tweets and the testing set was formed maintaining balance between the republican and democratic leaning tweets and between the tweets that contained republicans or democrats.

Training, testing and exploration of Generic Ballot and On Ballot models

No tweet of the training set reaches the model before pre-processing and tokenization. Pre-processing entails the removal of unwanted characters, stop words and making the tweets as uniform as possible by, for example, lowering all uppercase letters.

def pre_process(text):
text = text.lower()
text = re.sub(r"http\S+", "", text)
text = re.sub('[^a-zA-z0-9\s]','',text)
text = remove_stop(x)
return text

Tokenization is a more complicated process, indeed as the RNN cannot directly learn on text the tweets need to be converted to sequences of integers and the role of the tokenizer is keeping track of the correspondence between number and words. However not every word is worth keeping, as it might just be a name or handle unnecessary for the analysis, therefore only the most frequent 15000 words are tokenized while the others are discarded. 15000 is a relatively high number as compared the 3–5k typical of sentiment analysis models but this abundance is key to capturing the more nuanced nature of political opinions. Another important parameter in the tokenization process is maximum length in number of words allowed for a tweet, as a matter of fact every tweet when tokenized is converted to a sequence of integers of fixed length, the ones that don’t contatins enough words are padded with 0s at then end and the ones that are too long are either discarded or cut short. Therefore the maxlen allowed should perform an important balancing act between capturing complex messages in long tweets while keeping to a minimum padding of short tweets as this process may hinder training speed and effectiveness.

After pre-processing and tokenization the training set reaches the Word2Vec embedding layer which feeds the RNN. The 128 wide embedding layer is used to decrease the complexity of incoming tweets by learning abstract representation of every word to facilitate training of the network.

embed_dim = 128 
lstm_out = 256
lstm_out2 = 64
model = Sequential()
input_length = X.shape[1]))
model.add(Dropout(0.5, noise_shape=None, seed=None))
model.add(LSTM(lstm_out, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(LSTM(lstm_out2, dropout=0.3, recurrent_dropout=0.3, return_sequences=False))
model.compile(loss = 'categorical_crossentropy',
metrics = ['accuracy'])

The RNN itself is composed of two layers of which the exact dimensions can be respectivly between 128 and 256 and between 32 and 64. A number of different techniques are used to improve training such as dropout, which limits the risk that the network might start to rely always on the same kind of signals, or the use of a second layer, which helps with abstractions in certain situations where the literal meaning of the tweet might be a red herring. However as mentioned before the exact details of the RNN don’t have as high an impact on accuracy as the quality of the training data. And the accuracy metric is only as good as the testing set from which it comes so is worth taking an anecdotal look at which tweets of the testing set the model labels correctly and which ones it fails to classify.

The network does not have any problems with tweets containing clear partisan markers e.g. #MAGA or #MarchForOurLives.

@RealJamesWoods I love the idea of this clown thinking about a White House Run! This is the dementia of the democrats. They don’t see their insanity. Bring it on @SenBooker! Most republicans would love to watch you run! You are a very entertaining comedian!
1 0.9939819

The confidence lowers on tweets with more ambiguous meaning.

When I tweet angrily, it’s because I have to keep my mouth shut at family gatherings. Because the people I’m related to who did vote for Trump:
Don’t see why it was a bad idea (still)
Aren’t actually bad people (much as some would like to believe)
Are people I still love
0 0.84874254

The model is able to classify also snarky or satirical tweets.

@FrankDangelo23 @realDonaldTrump I Was Going To Be A Liberal For Halloween But My Head Wont Fit Up My Ass.
1 0.9390376

More policy dense tweets are caught too.

.@BryanSteilforWI, we’ve been calling on you to pledge not to make any cuts to Social Security or Medicare for days now. You’ve been silent.
I’ll ask again. House Republicans have proposed cutting Social Security and Medicare by $541 billion. Will you promise to oppose this cut?
0 0.94953746

Where the system stumbles though is on quotes of contrasting opinions or referrals to other statements.

If you’re being nominated for a nonpartisan position as a neutral arbiter on the Supreme Court, attacking “the left,” “Democrats,” “the Clintons” and “the media” in your opening statement while noting there will be reprisal for years to come is, well, disqualifying.
1 0.93661255

Looking at these examples one might be tempted to assume the network actually understands political opinions but the system works only on a syntactic level, the same opinion expressed in very uncommon language might be classified differently. To test the merits and limitations of the model for yourself there’s a demo.

Click on the link above to try it yourself

Nevertheless as stated in the intro none of the steps and techniques used to train, test and validate the models are language specific or domain specific, as long as there are users with clear leaning and which tweet enough about the subject no opinion extraction task is off limits.

*Bonus* Full System “Validation”

No surefire way exists for validating aggregate system like this one, aside from the election itself, of course. It is however possible to get a sense of the effectiveness of the system by trying to classify the polar opposites of the spectrum i.e. districts that reputable sources such as the aforementioned Cook Political Report or rate as LIKELY/SAFE. Districts therefore where the outcome can be forcasted with confidence. Thus a testing sets of sorts can be compiled on for example 10 such districts. Validating the system on a set of this kind yielded satisfactory results as all ten districts where classified correctly.

5 districts in the validation set
the other 5 districts in the validation set



















The product of the system is a geojson file with the favorability score of each side saved in its fields. At this point there are a number of different ways to display the map. Coloring the map in solid blue or red to show which districts lean for the democrats and which for the republicans similarly to the visualization done on the last Italian parliamentary election of 2018 available at is an option.

However as the colors to display in this case are only two a good way to enrich the map would be to color the districts on a spectrum between blue and red to visualize not only which party is more likely to win the seat but also the degree of confidence of the prediction.

The map is available at


The general applicability of the principles and methods of this kind of analysis forms the basis a more complex project which attempts to provide a user friendly integrated way to setup, carry out and understand regional twitter political alignment analysis. The project in its very early, incomplete, much less than MVP state is available at TwitterSentMapper.