Real-time Analytics with Google Cloud

Sergey Abakumoff
7 min readSep 27, 2016

--

The Internet era changed the way we obtain and process information. The huge volumes of data are free and ubiquitous and the cloud computing has put practically infinite computing power and storage and the sophisticated tools at everyone’s disposal, on a pay-as-you-go basis. This story explains how to leverage the cloud computing power and publicly available data to build a tool for real-time analytics of the presidential debate feedback posted to twitter. The aim of this article is to show how easy it is to implement pretty interesting, helpful(or malicious!) and even lucrative applications by using the Google Cloud Platform which is the quintessence of the Internet Age Marvel.

Architecture

Among other mind-blowing services such as BigQuery or Natural Language API, Google provides a set of tools for super-fast information exchange and data processing.

  • Pub/Sub : real-time communication service that allows to send up to 10K messages/sec, ensures encryption, guarantees delivery and operates in the Google infrastructure.
  • DataFlow : real-time auto-scaling data processing service for building fast, reliable and secure Extract-Transform-Load pipelines that are integrated with with other Google services. For example a pipeline can receive the messages from Pub/Sub service, process them by using the Natural Language API and save the outcome in a BigQuery data set.

Another player in the ensemble I put together is the Twitter Streaming API that makes it possible to intercept the real-time stream of tweets that are sent from a particular location, by a specific user or contain certain phrases, for example “Hillary Clinton” or “Donald Trump”. At this point you probably figured out the whole picture as it’s quite simple, but just in case here is the sequence diagram of the architecture that performs the sentiment and syntax analysis of a stream of tweets and saves the results in a BigQuery data set. All of it, except of the Twitter back-end and client app(which is pretty simple) happens at real-time in the Google Cloud.

Implementation

I believe that the key point of this story is the incredible ease of the code that orchestrates the system shown in the picture above. The sources are available in the Github repository and you can use it under the MIT license terms. I only wrote only 150 lines of code and copy-pasted 200 more lines. For example take a look at the node.js function that connects to a twitter stream of debate-related messages and sends the tweets objects automatically serialized to JSON strings to the Pub/Sub service as soon as they are delivered. Note that the code is only interested in the tweets that are sent from the USA cities. This is done in order to make sure that the further data analysis could compare the data associated with different cities or states of the country that hosts the debate and elections. In addition it helps to prevent possible excess of the Natural Language API limits that permit only 1000 requests per 10 seconds, but the Pub/Sub and DataFlow are able to handle 10K tweets per second!

What is the “topic” argument passed in this function? In the Pub/Sub service a publisher sends messages to a topic. Subscribers components create a subscription to a topic to receive messages from it. Here is the code that creates the “debates_tweets” topic and passes it to startTracking function shown above:

The code that runs the DataFlow pipeline that subscribes to the “debate_tweets” topic, extracts a twitter text from the incoming message, processes it with the Natural Language API and saves the results in the BigQuery table is also quite simple, although it was quite an experience to write a code in Java that I didn’t touch for years.

The messages that the pipeline receives from the subscription are tweets objects serialized to JSON strings. Note that the code does not try to fully parse them and then put the fields in the separate columns of a table row, but rather saves these strings as is in the single column. The rationale behind it is the data are saved in the BigQuery that allows to include the JavaScript functions in the SQL queries and parsing JSON with JS code is a cakewalk. Moreover, the code serializes the syntax analysis results to JSON strings and preserves them in the single column — I found it pretty convenient to keep the arrays of variable lengths that way. The sentiment analysis is represented with 2 float values that can be stored in two separates columns.

Debate Night

At 9pm EST September, 26 I started the pipeline and run the twitter stream client to harvest the tweets that were posted during the first presidential debate. 2 hours later the BigQuery table has been populated with 7K rows of data that include the information about the authors of the tweets, their location and the sentiment and syntax analysis results. I just needed a good tool for data exploration, analysis, and visualization and Google offers it as well! Cloud Datalab is the browser-based tool that can be used to interactively explore, transform, analyze, and visualize data using BigQuery and Python. The amount of data I’ve collected is not that big, but still the Datalab proved itself to be extremely helpful. Here is the sample of SQL+JS query that selects the most common nouns used in the tweets and its output.

You can find the full analysis notebook here, Github nicely visualizes it. It’s just early, very basic steps though, the full analysis will take some time and it perhaps will be a subject of my next post, but the initial results are included in this story. First of all, here are some syntax analysis visualizations.

As for sentiment analysis, the Google API returns two float values for a given text : polarity and magnitude. Here is the explanation.

Here are the plots of polarity and magnitude distribution for both candidates:

It’s up to a reader to interpret these plots. What I can see here is a lot of negative feedback in the tweets about Mr. Trump.

Imagine the unimaginable

Let’s ponder for a minute on what could be possible to do by using the same technologies and even the very same code that was described in the article. This is actually the main purpose of this story, the debates stuff was used to get some attention ;)

  • Startup idea : let’s say it’s Friday night in NewYork and a young hipster chooses a destination to go out. The app on his mobile phone shows the interactive map that highlights the places from where the most positive tweets are being sent. On the server side the app uses the real-time twitter analytics, just like the one that this story is about. In addition the app takes into account the analysis of photos that are shared via twitter. It is done by using the Google Vision API that provides the insight from the images, including facial expressions of emotions!
  • Surveillance and Investigation. Twitter API allows to track the tweets from a specific user. By analyzing the info extracted from a stream of messages and photos that he or she posts on a daily basis it’s possible, for example, to predict(by using Google Cloud Machine Learning) an attempt to commit a crime or suicide, or just watch a person’s life. If you are a fan of “Mr. Robot” TV show, you probably recall the Eliot’s manual hack of Fernando Vera. I am pretty sure that it’s feasible to automate such a tedious task:

Fernando Vera, Shayla’s supplier. One of the worst human beings I’ve ever hacked. His password? “eatadick6969.” Aside from the massive amounts of money he spends on porn and webcams, he does all his drug transactions through emails, IMs, Twitter. The fact that the cops haven’t caught him yet is beyond me. If they had half a brain cell, they’d be able to crack his gang’s simplistic code, if it can even be called that. After only a couple of hours of timing his tweets with related news articles, I figured out that “biscuit” and “clickety” clearly referenced guns. “Food,” “sea shells” or “gas” for bullets. “rock to sleep early.” I haven’t made the direct connection to a hit yet, but the math of guns plus bullets usually adds up to one thing.

  • Real-time brand monitoring is a piece of cake that would take a day or two to implement.

By the way, if you are a startup investor, private investigator or brand rep who are interested to collaborate, feel free to contact me. I will quit by boring job and we will watch the world together. Oppressive governments are not welcomed :p

--

--