Extracting data from Twitter with Tarantulla

Leticia Lapenda
Oncase
Published in
6 min readJun 4, 2018

--

If you are here it is likely you are working in a project that requires data extraction from social networks, such as Twitter, for example. Problem is, maybe, you still don’t know where to begin or, maybe, you are looking for a fast deployment solution that requires only few steps of previous configurations. In any case, we have some good news: you’ve just found what you’ve been looking for!

Here, at Oncase, we developed a solution able to extract data from social networks pages in a quick way and with very few previous settings. The solution, called Tarantulla, is composed of several modules, each one specific for the social network to be analyzed — Twitter, Youtube, Facebook, or even web pages.

You might be wondering why did we choose the name Tarantulla! In fact, Tarantulla is a spider, and, like the cobwebs constructed by them, able to connect many points, the proposal of our solution is to connect many sources, creating a robust module, but flexible, since it is possible to use independent submodules, such as Tarantulla-Twitter, for example.

In this post we will talk about Tarantulla-Twitter, which, as you can imagine, is Tarantulla’s module for Twitter data extraction.

In order to execute the solution, you must have access keys to User Timeline API, Python 3.x (>=3.4), pip3, tweepy and Linux operating system. Additionally, in case you wish to organize the data into a database, you can use Pentaho Data Integration 8 (with JDK 8).

The module receives as input the Twitter page name to be analyzed and returns the following fields:

  • twitter id
  • timestamp
  • publication date
  • publisher screen name
  • publisher full name
  • tweet content
  • hashtags
  • language
  • locale
  • category
  • tweet url
  • number of retweets
  • number of favorites
  • engagement

Let’s play with Tarantulla-Twitter! In order to conduct the analysis, we’ve selected three Twitter publishers, known in Technology world: AndroidPIT.com (@AndroidPITcom), TechTudo (@TechTudo) and GizmodoBR (@GizmodoBR). AndroidPIT.com is a US publisher, meanwhile TechTudo and GizmodoBR are Brazilians.

After executing Tarantulla-Twitter and storing the tweets returned by the solution in PostgreSQL database, we have been curious to analyze the data. We’ve noticed a total amount of 3,238 tweets from AndroidPIT.com, 3,222 tweets from TechTudo and 3,158 tweets from GizmodoBR. As the amount of tweets per publishers was similar between these three tech pages, we’ve been able to continue our analyzes with confidence, since, in case there were a discrepant difference in the amounts, we might have been giving the publishers with higher number of publications in the database, a higher chance to possess popular tweets, for example.

First of all, we’ve investigated the average number of retweets (rts) per publisher. Curious to notice that AndroidPIT.com has a low average number of retweets when compared to the Brazilians publishers.

In a similar way, when analyzing the average number of favorites per publisher, we’ve realized Brazilians publishers are ahead, however, while GizmodoBR seems to be the most retweeted, TechTudo wins when considering number of favorites.

Results were persistent in show that AndroidPIT.com has a lower number of interactions when compared to the other two publishers. What may lead to this difference?

A further analysis, conducted at these publisher’s pages, investigated the number of followers each of them have. It makes sense that a publisher with many followers receives more interactions, right?

As we expected, AndroidPIT.com is the least followed page: 16.8K, when compared to TechTudo (399K) and GizmodoBR (163K). This evidence is a possible reason why AndroidPIT.com has less retweets and favorites than the others.

This leads to other questioning: how could we define an engagement metric able to compare in a "fairer way" these three publishers?

We should consider not only the raw amount of interactions on each page, but also the number of followers. We suggest the following metric, named engagement:

That said, let’s visualize the engagement graph.

Interesting to notice that AndroidPIT.com overcomes TechTudo in this new analysis, what suggests AndroidPIT.com’s page receives more engagement from its followers than TechTudo’s. Besides, GizmodoBR has won the title of the most popular publisher!

Just for curiosity, from the top 10 tweets with more engagement in this database, 2 of them are from AndroidPIT.com, 5 from TechTudo and 3 from GizmodoBR. Notice that, despite TechTudo stay behind the other publishers when dealing with engagement, it is likely this publisher possesses some tweets of great interest for the public!

You may be wondering what are the most discussed themes in these tweets! Yes, we’ve also analyzed this: Sony Xperia XZ2, Samsung Galaxy S9, CBLoL, alien mummy, FIFA 18, Android Oreo, artificial uterus and possible Earth last survivals. Diverse, right!?

Configuring Tarantulla-Twitter

If you enjoyed and wish to perform your own analysis with your data of interest, keep reading, we will explain, briefly, how to configure and execute Tarantulla-Twitter solution. If you feel the need for a more detailed explanation, visit Oncase’s page on Github.

Let’s follow some steps for solution deployment and configuration. There are mainly 3 steps:

  1. Clone Git repository
  2. Edit file with publishers
  3. Edit file with API keys

Just in case you want to save the results into a database, there are 2 additional steps:

4. Edit file with database information

5. Execute SQL script

It is worth remember that database integration is done through PDI — Pentaho Data Integration — a platform to accelerate data pipeline.

Let’s explain each step:

  1. Initially, clone the repository, preferably into ‘/opt/git’.

2. You must set the file config-timeline.json. The fields to be set on the file retain important information so the solution knows from which pages to recover data (publishers) and where to put the collected data during executing phase (temp_output). Besides, it is important to define which python should be used (python-command), specially if you have more than one python at your machine. The file is as in the example below:

{
"temp_output": "../data/",
"python-command":"python3",
"publishers" :
[
{
"_twitter_screen": "AndroidPITcom",
"name": "AndroidPIT US"
}
]
}

3. Edit file api-keys.json

You must have the API access keys that will be used by Tarantulla-Twitter. Edit the file api-keys.json, informing your keys.

You will have to inform 4 keys: the access token (Access Token), the secret access token (Access Token Secret), the consumer key (Consumer key) and the secret consumer key (Consumer key secret).

It is important to remember that Twitter user_timeline API returns a maximum of 3,200 of the most recent tweets from a given publisher.

4. Edit file config-db.json

Edit file config-db.json informing database name, schema and table names that will be used, as well as password and other relevant characteristics.

5. Execute SQL script

SQL script has a clause CREATE TABLE able the create a table for the project. Remember to change this script according to the schema’s and table’s name you wish to use.

Executing Tarantulla-Twitter

Now it is all set! You can already execute Tarantulla-Twitter. If you wish to use the solution with PDI, run:

$ <PDI_HOME>/./kitchen.sh  -file="<YOUR TARANTULLA TWITTER FOLDER>/etl/main.kjb"

If you have set PDI_HOME to /opt/Pentaho/design-tools/data-integration , it is enough to run:

$ <YOUR TARANTULLA TWITTER FOLDER>/scripts/etl.sh job ../etl/main.kjb

And without PDI:

$ python3 user_timeline_api.py

That’s it! We hope Tarantulla-Twitter is the solution you were looking for! If you have any questions do let us know! See you soon!

--

--