How to track tweets about movies with Apache Airflow: a Data Engineering Project

Lucca Miorelli
6 min readNov 3, 2022

--

Introduction

Alongside my fellow data engineering friends Samuel and Luiz, we started wondering: what is the public opinion regarding the best films ever? Is there a way we can measure that?

Of course that is a complex thing to figure it out, though we decided to put our data engineering skills to the proof and try to answer that.

Photo by freestocks.org

Goals

Despite our curiosity, we limited our efforts to data engineering scope — leaving the NLP work for data scientists.

So our main goal, indeed, is to build and maintain a ETL pipeline that stores famous movies data and tweets that mention them.

How to frame this problem?

This engineering problem can be approached with a few simple and standard steps:

  1. Extract the data from multiple sources.
  2. Transform the data into usable formats.
  3. Load the data in a database.
  4. Orchestrate it to run programmatically.

Extract

Basically, we have 2 main data sources here:

  1. IMDb Website: which is the world’s most popular source of movie and TV content reviews.
  2. Twitter: one of the most used social media platforms.

By the way, we extracted IMDb’s Top 1000 movies directly from Kaggle’s dataset. This is an immutable dataset that we used for didactic purposes, despite IMDb having its own API that could be also orchestrated.

For Twitter data, however, we wanted to get data regularly and monitor what people say week-by-week. For that purpose, we used Tweepy - the official Twitter API for Python. The main methods used were search_recent_tweets and get_recent_tweets_count.

Transform

We already have the data sources, but we still have to clean the data, match data in each table, save it to our database and make this job run regularly. For that, we use our good old friend Python (more on specific libs later).

Photo by Oskar Yildiz on Unsplash

Load

As the final step for our ETL process, we stored data in a RDS instance at our personal AWS account. In that way, we could relate tables and query them whenever needed.

Orchestrate

Orchestration, finally, is responsible for iterating the ETL process and triggering scripts with the desired frequency. We made use of Apache Airflow for this.

Airflow works based on DAGs (Direct Acyclic Graphs) that represents the right sequence of tasks to be run.

Main Tools

Since Python is our daily worktool, we’ve made use of several libs, such as:

  • Pandas: for basic data wrangling.
  • SQLAlchemy: for communicating with our PostgreSQL database.
  • Boto3 and AWS Wrangler: for communicating with our AWS services.
  • Plotly Dash: to visualize data.

We’ve also make use of TweePy, twitter’s official Python API.

Orchestration

We decided to use Apache Airflow to orchestrate this pipeline, by running it locally in our machine through multiple Docker containers.

Apache Airflow is an open-source workflow management platform for data engineering pipelines in Python. Docker, on the other hand, is an open-source platform used to containerize applications. Finally, Docker Compose is a tool to ease the usage of a multi-container application. In that way, we can run Apache Airflow on our local machine reducing the complexity of its installation.

You can check more information on how to run the Docker containers for Apache Airflow in this project’s README file or in Airflow’s official documentation.

Relational Database

We decided to store our data in a PostgreSQL database of a RDS instance (Relational Database Service) on AWS. If you are not familiar with this AWS resource, it is a managed-database service that eases the database’s configuration and usability. Since we’re making use of cloud services, this also increases our database’s availability. You can check out more about RDS here.

Data Analysis

Since data visualization was not the main purpose of this project, we used basic Plotly / Dash features to visualize the data we were storing in the PostgreSQL database. Plotly Dash’s official website is found here.

Project’s Architecture

The project’s final architecture looks like this:

Photo by the author

The full application runs around our database: it stores a static table with IMDb’s top 1000 movies’ data that were ingested by an ETL job that ran once. The dockerized Apache Airflow pipelines read movie’s names from the table and search for tweets that mention them, which is meant to run every 7 days. These tweets are stored in the same database schema, though in another table. Finally, some of this data is shown in a Plotly Dash dashboard.

Development

After instantiating the database, we stored the data in a schema like this:

Photo by the author

In this schema, imdb_kaggle table is a static table that contains data about the Top 1000 movies on IMDb. The other two tables relate to it with the movie_id column, and are constantly appended by the twitter pipeline running on Apache Airflow.

count_tweets_per_movie

  • movie_id: the foreign key with which we relate tables.
  • movie: string with the name of the movie.
  • end, start: end and start dates of our searches.
  • tweet_count: the actual count of tweets mentioning the movie.

tweets_per_movie

  • tweet_id: an unique identifier of the tweet created by Twitter at its publication.
  • created_at: timestamp of when the tweet was created.

Also, a DAG was configured on Apache Airflow with PythonOperators for getting tweets regularly. These PythonOperators call Python functions for Extracting, Transforming and Loading tasks. Below, we have the DAG’s graph shown in Airflow’s UI:

Photo by the author

Conclusion

With this project, we practiced some data engineering concepts like ETL jobs and orchestration with varied tools: Python, Docker, Docker Compose, Apache Airflow, PostgreSQL and AWS services. Although there’s much room for improvement on this application, its infrastructure can scale up as a robust orchestrated pipeline. And with that, we achieved our main goal.

Despite this being a good solution to our framed problem, there could be more optimal architectures for it: for example, using a DataLake and also serverless AWS services. This improvement should be a clear next step for this project.

At last, a brief visualization of count_tweets_per_movie table can be seen below:

Thanks for reading this article! Hope we could contribute with some data engineering ideas as much as we have learned from it. Feel free to reach out to me on my social media!

Github | Linkedin

--

--