Data Engineering Project: Twitter Airflow Data Pipeline

Nisha Sreedharan
13 min readAug 13, 2023

--

Well, nowadays social media is abuzz with the legendary fight between Meta’s CEO Mark Zuckerberg and X’s owner Elon Musk. It has even escalated to the point of a cage fight between the two tech giants. Well we all know how that could turn out.

Musk Vs Zuckerberg Cage Fight

Well, their fight doesn’t at all change the daily tasks for a data engineer :p We still have to cater to irritating requests like — ‘can you quickly pull this data for me?’ at work.

So if you want to take up a few extra projects on weekends to make yourself feel like you are a data engineer for real, then you have to follow a few legends who make great project use cases. I personally love the videos made by Darshil Parmar on data engineering. They will take you from a basic to an intermediate level in a matter of hours! I worked on one such project last weekend — Twitter data pipeline! It’s honestly a great pipeline for beginners in data engineering. It’s not transformation or code intensive but it helps you understand a basic flow that is required for building great, robust data pipelines. If you want to watch the entire video on youtube you can watch it here — https://www.youtube.com/watch?v=q8q3OFFfY6c. In this article I’ll be going in detail over the steps you need to follow to complete this project.

Twitter/X data pipeline flow diagram

In this project we’ll be extracting data from Twitter and process the data in Python. In Python we’ll be using a package named Tweepy to extract data from Twitter. For the data transformation in Python we’ll be using Pandas. Post that we’ll install Apache Airflow in our EC2 instance and write the output data to Amazon S3. The assumption in the project is that you have a basic level understanding of Python and basic AWS . If you are completely new to these, I would suggest a quick and free course to get some basic understanding on Python and AWS services before starting this project.

PRE-REQUISITES FOR THIS PROJECT:

To get started on the project you would need the following installed on your computer/laptop. (assuming you have a laptop and decent internet connection):

  1. Have Python installed on your local system. Would be great if you also have an IDE installed on your system to help you run the code in Python. I personally use Visual Studio Code for running my Python codes. However, if you have issues in installing Python or IDE installed on your laptop, don’t worry. You can use Jupyter Notebooks instead!
  2. Jupyter Notebooks : Jupyter Notebooks is a web bases interactive platform on which we can run, execute code snippets. This platform helps in running notebook documents via the web browser. Its a great platform to test codes before putting them in production.

You can follow this link to install Jupyter notebook on your system — https://jupyter.org/install. If you are using Mac to start Jupyter Notebook, here are the two commands that you need to start it on your platform. (Assuming you have the latest version of pip installed in your laptop):

pip install notebook
jupyter notebook

3. AWS Account: In this project we’ll be using EC2 and Amazon S3 . So if you already have an AWS account then great. Or else you can create a free tier AWS account for 12 months. You can refer to an earlier article of mine- https://medium.com/aws-tip/data-engineering-create-aws-account-5108fd9084fa . This article has a step-by-step process on creating your AWS free tier account.

4. Since you would extracting data from Twitter/X, you should also have one active Twitter account to help you login to the platform.

BASICS OF AIRFLOW

Apache Airflow is an open source workflow managament platform which was developed by Airbnb. It was then added as part of Apache incubator and was made open source to be used and enhanced by developers.

Apache airflow orchestration tool

Airflow is an orchestration tool to build, schedule, monitor and manage different data pipelines. To gain a deeper understanding of Airflow please refer to the official documentation — https://airflow.apache.org/docs/apache-airflow/stable/tutorial/index.html.

I’ll briefly discuss the components of Airflow so that you have a basic understanding before starting the project.

AIRFLOW COMPONENTS

The basic components of airflow are — DAG, Task, Operator.

DAG( directed acylic graph): Has a set of tasks which have a set execution order . This is the main flow of your data pipeline.

Task: A basic unit of work in Airflow. Eg: A task can be to read data from API, Write data to database etc.

Operator: A task is executed using Operators. They are like templates to do your unit of work. Eg: you can have PythonOperator to execute a python code in your DAG. BashOperator to run Bash commands.

Dag vs task vs operator in Airflow

For a high level understanding of fundamentals would suggest you read this link- https://airflow.apache.org/docs/apache-airflow/stable/tutorial/fundamentals.html.

Remember that if you are new to these terms and technologies, everything might seem like an uphill task and there would be a voice in our head screaming ‘hey! we can’t do this!!!!’. Your only job while executing this project or learning the technology is to ignore that voice and just get into the project execution and just keep at it. Trust me, fake it till you make it is very relevant in all fields where you are starting afresh!

PROJECT EXECUTION:

1.Creating Twitter/X account to get the API keys.

On google type — ‘Twitter api’ to get or create account to help you get the access to the API screen.

twitter developer platform for API
free account to get the data from twitter

Please understand that if you are following the youtube video, the video was created back when extracting twitter/tweets data through an API was free. But now there are changes to get the data . The free version will not allow you to extract and read tweets. So if you want to pay for extracting the data from Twitter’s developer API then please follow the process explained by Darshil in the project. And refer to his github account link to get the original code- https://github.com/darshilparmar/twitter-airflow-data-engineering-project/tree/main. However, I have subscribed to the basic account which is free. The main objective for this project is to understand the data flow which is why I am not making the python script very compute intensive. If you want to complete the project using the free data api, then please refer to my github link — https://github.com/nishasreedharandata/twitter-x-airflow-aws-data-pipeline/tree/main. I have modified the code to only read data that is available using basic(free) subscription.

For creating the basic account you need to give detailed reason for using the data

While you are creating the basic account for Twitter it’s necessary to give a detailed explaination for why you are using the data and what you plan to do with it. You can formulate your own reasons or you can take help from our best friend — chatGPT to formulate the reason.

Once the account is created you can see the dashboard to view your existing projects. If there is no existing project created then please click on ‘Add new project’ and create a demo project to get started. Or else in most cases a default project would be visible on your developer API’s login page. From that default project you need to go to project settings to get the following tokens for authenticating and using your API to retrieve the data:

  1. API Key
  2. API Key Secret
  3. Access Token
  4. Access Token Secret

Copy paste these values to be used when you are writing the ETL script to retrieve the data.

2. Start with the Python/ETL script to extract Twitter

When we start the code on our local machine I would suggest you create a folder and store all your files/code on that folder in your local system. Before starting the code, we first need to install packages we need to run our code.

pip install pandas

pip install tweeps

pip install s3fs

Once the packages are installed you can start with the basic code for extracting the data in a file named twitter_etl.py.

import tweepy
import pandas as pd
import json
from datetime import datetime
import s3fs


access_key = ""
access_secret = ""
consumer_key = ""
consumer_secret = ""


# Twitter authentication
auth = tweepy.OAuthHandler(access_key, access_secret)
auth.set_access_token(consumer_key, consumer_secret)

# # # Creating an API object
api = tweepy.API(auth)
# using get_user with screen_name
screen_name = "elonmusk"
user2 = api.get_user(screen_name=screen_name)

print("user id",user2.id)
print("user desc",user2.description)
print("follower count",user2.followers_count)

The auth object will help us to establish the connection with the twitter API . Paste the values you had stored earlier for API and Access tokens into the code above . Please remember this code is only for free tier account in developer api of twitter. If you are using the paid version I would suggest this link as well — https://datascienceparichay.com/article/get-data-from-twitter-api-in-python-step-by-step-guide/ to help you go deeper into extracting tweets. If you try to read tweets using the basic account you would most likely run into this error — https://stackoverflow.com/questions/76528610/403-forbidden-453-you-currently-have-access-to-a-subset-of-twitter-api-v2-endp. Which is why I have modified the code to only read user details from twitter instead of tweets. If you want to modify and add a few more enhancements while using the free basic version I would suggest reading this link — https://blog.quantinsti.com/python-twitter-api/ as well.

Code for reading tweets from paid version of developer api

At this point to check Twitter’s connection I would suggest you try to run this code on your system by navigating to the folder and using command:

python3 twitter_etl.py

If you are able to see the results then we are good to go! If you please send me your errors on the comments and I’ll try to help you out.

Once this is completed then we need to modify the code to arrange the result into a dataframe . Please use my github link to complete that section of the code.

3. Start with AWS

To get started with Airflow part of the project, let’s launch an EC2 instance on AWS . Firstly, go to the AWS console and click on EC2 to open the interface to create an EC2 instance. Please ensure you select a region that is close to where you live.

click on ‘launch instance’ to open a new instance

Create a name for the instance and select Ubuntu as the OS for the machine.

Select ubuntu as your OS

For the machine size, if you are going for a free tier, I would recommend t2.micro. You can choose a bigger size if you are ok to make the payment but in our use case the data size is not very huge.

select t2.micro to use free tier ec2 instance

Create a key pair which will be used later to login to the instance. The name I used is — airflow_twitter_ec2_key.pem. Ensure that this .pem file is put into your project folder once its downloaded. Once that is done, choose to allow traffic from all sources right now. Please note for security reasons allowing all traffic is never a good idea. But this is just a learner project so we’ll select all for now.

allow traffic from all in the security section for this project

And Voila! The instance would be successfully created.

EC2 is successfully created!

Once the instance is launched, it would show as running in the instance state. This might take a few minutes. Now to connect to the instance click on the instance and click on connect. Click on the ‘ssh client’ option and copy the command given to enter your EC2 instance from your machine.

Once you ssh into your machine you might see something like — ubuntu@ip-<> . That means you have entered your EC2 instance.

Now we’ll install the required packages in Ubuntu machine.

sudo apt-get update
sudo apt install python3-pip
sudo pip install apache-airflow
sudo pip install pandas
sudo pip install s3fs
sudo pip install tweepy

To check if airflow is correct installed just write — airflow on your command line and you should be able to see something like this:

airflow is successfully installed

To start airflow you have two options:

  1. Run the command — “airflow standalone”. This should start and create your login id and passoword for you.
  2. In case the command above doesn’t run try these instead:
airflow users  create --role Admin --username admin_user --email admin --firstname admin --lastname admin --password admin


airflow webserver --port 8080

To open the airflow UI, go to EC2 instance and copy the Public DNS from there and paste it on your browser with the port 8080.

use public dns of ec2 along with port 8080(which is airflow’s port)

We can see that this doesnt open the UI. Any guesses for why this occurs?

Well this is because my machine is not allowing me to use the port 8080! To solve this issue, go to EC2 and click on ‘instance id’. In that click on security and security groups.

securtity in ec2

Go to inbound rules and start editing. Add a new rule and for allow the access from ‘anywhere’ or your ‘myIP’ and save the rule. Once you do that you may have to restart your airflow. Or else you can now see the login for the airflow UI:

airflow Ui visible

Using the user name and password you can login to airflow UI. Now we’ll start writing the DAG code in our airflow. Before that we should convert our ETL python/pandas code to a function (refer to the code in github for reference). Start writing your DAG in your local machine by naming the file as twitter_dag.py.

from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from datetime import datetime
from twitter_etl import run_twitter_etl

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 11, 8),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}

dag = DAG(
'twitter_dag',
default_args=default_args,
description='Our first DAG with ETL process!',
schedule_interval=timedelta(days=1),
)

run_etl = PythonOperator(
task_id='complete_twitter_etl',
python_callable=run_twitter_etl,
dag=dag,
)

run_etl

Now we need to make another change in our python ETL code to ensure we can write the final data to a s3 bucket. So we first create an S3 bucket in AWS . The last line of the ETL code twitter_etl.py ensures we write our final dataframe to S3 bucket we create below.

create s3 bucket in aws

Also since we are running the code on our EC2 machine, we need to ensure our EC2 machine has the access to write to our S3 bucket. For that we have to go to our EC2 instance and go to Actions → security → Modify IAM role. Click on “New IAM role”. In that go to EC2 and click next. Search for S3 and select S3 full access. Then click on EC2 in search and select EC2 full acess and save the changes. Have a name for this IAM role.

S3 create IAM role

If you go back to the IAM role in EC2 now you can find the name of the IAM role you created in EC2’s drop down. Click on update IAM role.

IAM ROLE FOR EC2

Well now that we have fully powered EC2 and S3 to talk to each other, lets get our focus back to Airflow. In one of our instances in our local machine’s terminal the airflow code is running, so we can open another instance of our terminal and ssh into our EC2 machine again to copy our code (DAG and ETL code) to airflow. Once we are in EC2 instance, we need to check for our airflow folder. If you run a simple ‘ls’ command you can see the folder. So we need to go into the folder, modify a line in the airflow.cfg file, copy the twitter_dag.py and twitter_etl.py in the folder . To complete that please follow the steps mentioned below:

sudo nano airflow.cfg
#write this in the .cfg file
dags_folder = /home/ubuntu/airflow/twitter_dags
#save and exit
#create the new folder
mkdir twitter_dags
#go into this folder
cd twitter_dags/

#create the dag and etl codes into this folder and copy the content into them

sudo nano twitter_dag.py #- copy dag content

Sudo nano twitter_etl.py # — copy etl content

This is required so that our airflow instance in EC2 can read and execute our twitter code. Once this is completed, you may have to stop and re-start your airflow again on EC2 to start seeing this DAG on the airflow UI.

Yay! You can now see the twitter_dag in airflow’s list of DAGs

Once the twitter DAG is visible you can click on it and explore various features of Airflow.

You can view the airflow DAG code in the UI

To run the DAG you need to click the ‘run’ button on top right of the screen in the DAG UI. Once the DAG is triggered, you can understand its state through different colours on its DAG tasks. Eg: Dark green means its completed successfully, light green means its still running, yellow means its re-trying the code. You can click on the task to look at the logs as well.

DAG starting to execute

Once the execution is finished, you can check the s3 bucket to see if the data is fully written. That is a sure shot way to know that the code has executed correctly!

final output file written to s3

And scene!

Hope you enjoy working on this project. You would encounter a few errors along the way but that’s where the learning actually happens. Please do comment and let me know if you enjoyed this project. Will be happy to assist you in solving any errors as well.

Happy learning!

--

--

Nisha Sreedharan

Decent Programmer. Great Data Engineer. Lazy Traveller. Hardworking Sleeper. Reluctant Reader.