Twitter Analytics (Part 2)

Published in

Google Cloud - Community

3 min readOct 14, 2017

Twitter is a real good tool to find out what’s going on in the real world. That’s why i choose to make a real-time analysis of the user’s state of mind during the last 2 soccer matches of the Argentinian National Team, using Python and Google Cloud infrastructure.

This is the second part of the article ( to see how this story begins visit : https://medium.com/google-cloud/twitter-analytics-part-1-801c9d494487 )

Second Step: Analyze the Data

1) Google Datalab

Google Datalab offers the possibility to easily access the data in BQ and analyze it in a Jupiter notebook. We can also use a large number of existing packages for statistics, machine learning and data processes. It’s also easy to deploy within the Google Cloud platform : https://cloud.google.com/datalab/docs/quickstarts

First thing to do is make sure you have all the packages installed in your project, I mean Pandas, Seaborn, Numpy, Matplotlib and Scikit-learn (just needed if you want to make some ML project). If not, you can always install or update a package opening a notebook and typing:

!pip install lib-name

Second you have to make the data accessible into the notebook. This we can do it like this:

%%sql — module data_name -d standard
#SQL query
SELECT
created_normalize,
text
FROM
[table_name]

And then fetch the data by coding in python

import datalab.bigquery as bq
my_data_frame = bq.Query(data_name).to_dataframe()

As you can see in this table i have a field named “created_normalize” that is because I process the data first in python to normalize the dates.

Ok so now I have a data frame with the result of the query what next? I started by change dates into a time line in minutes with the start of the match as 0.

min_datetime = my_data_frame[‘created_normalize’].min()
difference = 20
my_data_frame[‘time_line’] = [ int( ( ( x — min_datetime ).total_seconds())/60)-difference for x in my_data_frame[‘created_normalize’]]

Note I used a 20 min gap, that’s because I started recording 20 minutes before the match starts.

This added one new column to the data set ‘time_line’, which let us easily plot the amount of tweets for minutes for example.

my_data_frame[‘tweets’] = my_data_frame[‘time_line’]
ax = sns.kdeplot(my_data_frame[‘tweets’], shade=True)
ax.set(xlabel=’minutes’, ylabel=’density’,title=’Tweet\’s distribution’)
plt.axvline(x=0, color=’g’)
plt.axvline(x=47, color=’g’)
plt.axvline(x=66, color=’g’)
plt.axvline(x=71, color=’r’)
plt.axvline(x=74, color=’b’)
plt.axvline(x=115, color=’g’)
plt.show()

Green lines represent the start/end of each half, red is a goal of Venezuela and blue line represents a goal of Argentina.

Or we can look for mentions of each Argentinian player.

We can do much more with this data, the rest I leave it to you…

If you want the data set of these two matches I will make it public for everybody.

Twitter Analytics (Part 2)

Second Step: Analyze the Data

1) Google Datalab

Written by Patricio Navarro