Tarantulla-YouTube: a new way to extract data from YouTube
In this post, we will discuss about one of Tarantulla’s suite module: Tarantulla-YouTube, responsible for data extraction from this video platform. If you still don’t know Tarantulla, you can take a look at this other post, in which we’ve explained the main solution idea.
In order to be ready for execution, Tarantulla-Youtube will require access keys to YouTube Data API, Python (>=3.5) and, additionally, Pentaho Data Integration 8 (just in case you want to organize the returned data into a database).
The solution returns the following fields:
- channel ID
- channel name
- publication date
- playlist ID
- video ID
- number of views
- number of likes
- number of dislikes
- number of comments
- publisher name
Let’s show how Tarantulla-YouTube works with a simple example. To do this, we’ve selected 3 publishers: Engadget (https://www.youtube.com/user/engadget), AndroidPIT.com (https://www.youtube.com/user/AndroidPITcom) , US channels, and GizmodoBR (https://www.youtube.com/user/GizmodoBR ), a Brazilian channel.
First of all, we’ve investigated the average number of views per video for each channel. We’ve noticed that Engadget videos have, in average, a higher number of views than the videos on the other channels, summing up an average of 40,094.56 views against 12,280.51 from GizmodoBR and 32,034.8 from AndroidPIT.com.
It’s important to emphasize, however, that Engadget has a noticeable higher number of subscribers. There are 519k, against 4k from GizmodoBR and 29k from AndroidPIT.com, what may suggest Engadget is more popular, leading to a higher number of views per video.
Other analyzed metrics were: average number of comments and likes in each channel’s videos.
We’ve noticed that Engadget keeps its position as popularity leader, followed by AndroidPIT.com. Interesting to observe that the graphs show similar percentages for each channel for both metrics analyzed. It is even likely that the number of likes and comments might be somehow related.
We also had the curiosity to investigate the top 10 videos with more views and, as was expected, all of them are from Engadget! Among the most approached themes, we could find: sex robots, Google Glass, iPhone X, IBM Watson, PlayStation 4, Samsung Galaxy Gear, Sony Xperia and CES.
What do you think? Do you believe you can use Tarantulla-Youtube to reach your applications’ goals? If the answer is affirmative, keep reading. We will explain, briefly, how to configure and execute the solution.
If you wish to know more details about the solution deployment, we suggest that you visit Oncase’s Github page:
tarantulla-youtube - Módulo do tarantulla para extração de links do YouTube por tópico de interesse.github.com
Let’s follow some steps for solution deployment and configuration. There are mainly 3 steps:
- Clone Git repository
- Edit file with publishers
- Edit file with API keys
Just in case you want to save the results into a database, there are 2 additional steps:
4. Edit file with database information
5. Execute SQL script
It is worth remember that database integration is done through PDI — Pentaho Data Integration — a platform to accelerate data pipeline.
Let’s explain each step:
- Initially, clone the repository, preferably into ‘/opt/git’.
- You must set the file config-users.json. The fields to be set on the file retain important information so the solution knows from which pages to recover data (publishers) and where to put the collected data during executing phase (temp_output). Besides, it is important to define which python should be used (python-command), specially if you have more than one python at your machine. The file is as in the example below:
"name": "AndroidPIT US",
3. Edit file api-keys.json
You must have the API access keys that will be used by Tarantulla-YouTube. Edit the file api-keys.json, informing your keys.
You should inform YTAPIKEY , which stores the API access key.
It is worth remember that Youtube Data API has a request limit based on units, in which each API’s operation costs a particular amount of units. This amount per operation can be checked in: https://developers.google.com/youtube/v3/determine_quota_cost. The API limits 1,000,000 units per day or 30,000 units per second per user or 3,000,000 of units per 100 seconds.
4. Edit file config-db.json
Edit file config-db.json informing database name, schema and table names that will be used, as well as password and other relevant characteristics.
5. Execute SQL script
SQL script has a clause CREATE TABLE able the create a table for the project. Remember to change this script according to the schema’s and table’s name you wish to use.
Now it is all set! You can run Tarantulla-Youtube and gather the desired data!
If you wish to execute with PDI, run:
$ <PDI_HOME>/./kitchen.sh -file="<YOUR TARANTULLA YOUTUBE FOLDER>/etl/main.kjb"
If you have set PDI_HOME to
/opt/Pentaho/design-tools/data-integration , it is enough to run:
$ <YOUR TARANTULLA YOUTUBE FOLDER>/scripts/etl.sh job ../etl/main.kjb
Otherwise, without PDI:
$ python3 statsMain.py
We hope this post have been useful to you! If you have any questions, do keep in touch! See you soon!