Have you seen http://wso2.com/election2016/ yet? We have been building a twitter analytics site for US 2016 election using WSO2 Analytics platform. It listens to prominent hashtags such as #Election2016, analyze the tweets, and present key facts in a form you can digest within few seconds.
All these done using WSO2 Analytics platform. This post describes the design, inner workings and algorithms that power the site.
Following shows the overall architecture of the site.
We use ESB to connect to twitter stream API via a connector and pull tweets from hashtags such as #Election2016. (ESB stands for “Enterprise Service Bus”, which is used to integrate different systems together. We use WSO2 ESB.). ESB cleans up the data and sends the data to WSO2 DAS receiver, which will send it to CEP for real-time processing and send it to Disk for batch processing.
WSO2 DAS stands for “Data Analytics Server”. It is an analytics platform that supports both real-time ( streaming, without storing) and batch data processing (after storing). Please see Introducing WSO2 Analytics Platform: Note for Architects for more information.
CEP stands for Complex Event Processing. This is a technology that let users write queries over data streams using an SQL-like query language. You can find more information from WSO2 CEP User Guide.
Queries such as community graph is done using SparkSQL queries that run every 12 hours. Twitter Analytics site pulls these results periodically via a REST API. Queries such as most popular tweets and real-time word clouds run as real-time queries. Those results are immediately sent to the Twitter Analytics site via a Web socket connection.
Following sections describes each element of the Twitter Analytics site, algorithms used, and their implementation using Spark SQL and Siddhi query language.
How many people are talking?
First of all, let’s gauge the attention each party receives in the tweetosphere.
For each Tweep, we guess his bias towards a candidate by looking at his tweets and counting “biased hashtags” he had used. For example, we classify Tweeps who has used “#FeelTheBern” more as Bernie Sanders supporter while we classify Tweeps who used “#MakeAmericaGreatAgain” lot as a Trump supporter. We do this calculation when first see the Tweep, which is stored and cached for later use.
Then, we calculate the count using CEP over a 24-hour window and send updates to the site immediately via a web socket channel.
Realtime Top Tweets?
OK, next is most popular tweets. We do this using a variation of Reddit algorithm.
The basic idea is that each tweet is given a rank, which is proportional to the interactions it receives and inversely proportional to its age. We have tweaked the algorithm to work with high tweets rates, which is about 200k/day. You can find the final equation we used for the ranking below.
This model provides most interesting recent tweets at a given moment. For example, older tweets will be in the top only if they have received lot of interactions.
We implement the algorithm in streaming fashion, using CEP ( Complex Event Processing) technology. Any updates are sent to the site via web socket immediately.
Real-time Word Clouds
Next up is what does community of each follower are talking about?
Word cloud shows top N most frequent words, and word size is scaled according to their frequency. We show four-word clouds, one per each candidate, and they are build using tweets in each candidate’s community’s hashtags.
When selecting words, we only use nouns and adjectives. We used Ark Tweet NLP library for Part of speech ( POS) tagging, and then used the tags to select only nouns and adjectives.
Top words and words frequencies are calculated in streaming fashion, where system send real-time updates to the site. One challenge is that “counting word frequencies” requires the system to keep track of a large number of counters. We need a counter for each word. To handle this problem, we use a TopK Sketch data structure, which is a probabilistic data that can approximate a large number of counters with fixed relatively small memory footprint. We use the implementation from Stream Lib.
Great! how does the twitter community is structured?
The following community graph is built using retweets. Each node is a Tweep and each edge is the number of retweets between two tweeps in the graph. Each node is scaled based on the number of retweets received by the corresponding Tweep and colored based on the party affiliations of the user.
We calculate the graph once every hour using SparkSQL queries. We decide the candidate affiliations using Tweep’s old tweets as described under “How many people are talking?”.
To visualize the tweets, we using D3.js force layout. Since the graph is very big, we have only included top 200 tweeps in the visualization.
Most Shared Links
People share lots of links in twitter. Good! what are they?
This section shows most shared links in twitter in last 24 hours. Twitter API automatically expand the short URLs, and we do the counting based on the actual URLs. Counting is done using CEP and updates are sent to the site via a web socket channel as the list changes.
Finally, the big one. How do people feel towards candidates?
This section shows how the media sentiment towards each candidate changes over time. Once every 6 hours, we retrieve the top then new stories under “election 2016” in google news and use that to calculate the sentiment.
From those articles, we select the sentences that refer to candidates. Then we used a positive and negative word lists to calculate the sentiment. Sentiment calculation is done using CEP.
We have tried out several sentiment calculating techniques (Stanford core NLP, Open NLP, positive negative word lists and AFIN) against the manually ranked sentiments of 20 election-related articles. Word list worked best in the election scenario. We will continue to work to improve this.
We plan to run this till the election is decided. Our goal is to build something that shows how election chatter in twitter looks like. Let us know if you have any thoughts on how to do better.
Also, if you like to build an analytic system like this, check out our products at http://wso2.com/analytics. They are free and open source under apache license. If you want, we give commercial support too. We love to hear what you did if you end up using WSO2 products.