How to Explore Twitter Streaming Data using Python and Spark

Bennett Holiday
Analytics Vidhya
Published in
7 min readSep 10, 2021

This tutorial explores Twitter streaming data using Spark and Python.

Data is being generated at an unprecedented rate, and by analyzing it correctly and providing valuable and meaningful insights at the right time, it can result in valuable solutions for an array of domains involved with data. Real-time streaming data is widely used across a range of industries, from health care and banking to media and retail. Netflix, for example, provides real-time recommendations tailored to individual preferences. Similarly to every business that streams large amounts of data and relies on various analytics, Amazon tracks its users’ interactions with its products and makes prompt recommendations of related items.

Apache Spark is an efficient framework for analyzing large amounts of data in real-time and performing various analyses on the data. Many resources discuss Spark and its popularity in the big data space, but it is worthwhile to highlight that its core features include real-time big data processing using Resilient Distributed Datasets (RDDs), streaming, and machine learning. In this tutorial, we will demonstrate how Spark Streaming components can be used in conjunction with PySpark to resolve a business problem.

Motivation

In today’s society, the importance of social media cannot be denied. Most businesses collect feedback from their Twitter followers in order to gain insight and to gain a better understanding of their customers. Social media feedback changes rapidly, and the ability to analyze that feedback in real time is essential for the success of any business. There are several ways to discover how people react to a new product, brand, or event. For example, the sentiment expressed through tweets about a particular topic, product, brand or event can provide an indication of the level of willingness or trust in a product. Hence, we use this premise in our tutorial and extract trending #tags related to our desired topic every few minutes since hashtagged tweets are more engaging.

Implementation

We will use Tweepy to access Twitter’s streaming API and the Spark streaming component with TCP socket to receive tweets. We will layer tweets on RDD and then retrieve the most popular hashtags. After that, we use Spark SQL to save the top hashtags to a temporary database. Finally, we visualize the results using Python’s visualization tools.

The following image illustrates the overall architecture of our program.

End-to-end architecture

There are two parts of this tutorial.

Part 1: Retrieve Tweets from the Twitter API

Step 1: Import the necessary packages

We will use tweepy, a Python module that connects to the Twitter API, along with Stream, StreamListener, and OAuthHandler to build our data streaming pipeline in receive_tweets.py. The socket module is required in order to communicate with the Twitter API on our local machine. To work with json objects, we will require the json module.

Step 2: Insert Twitter developer credentials

Insert the credentials for your Twitter developer account. To obtain credentials, you must first create a Twitter developer account, and then request credentials from the Development Portal.

Step 3: Create a StreamListener instance

TweetListener represents an instance of StreamListener that connects to the Twitter API and returns a tweet at a time. As soon as we activate the Stream, instances are automatically created.

The TweetsListener class contains three methods: on_data, if_error, and __init__. The on_data method retrieves the incoming tweet’s json file, which contains one tweet, and determines which parts of the tweet should be retained. For instance, a tweet’s message, comments, or hashtags could serve as examples. If_error ensures that the stream works and __init__ sets up the socket of the Twitter API.

Step 4: Send data from Twitter

In order to retrieve data from Twitter API, we must first authenticate our connection using the pre-defined credentials. Following authentication, we stream the tweet data objects containing a specified keyword and language. TweetListener returns tweets as objects.

Step 5: Start streaming

Streaming the data from the Twitter API requires creating a listening TCP socket in the local machine (server) on a predefined local IP address and port. A socket consists of a server-side, which is our local machine, and a client-side, which is Twitter API. The open socket from the server-side listens for connections from the client. When the client-side is up and running, the socket will receive the data from the Twitter API based on the topic or keywords defined in the StreamListener instance.

Part 2: Tweets preprocessing and finding trending #tags

Step 1: Import the necessary packages

In the second part, we will locate Spark on our local machine using the findspark library, and then import the necessary packages from pySpark.

Our tutorial makes use of Spark Structured Streaming, a stream processing engine based on Spark SQL, for which we import the pyspark.sql module.

Step 2: Initiate SparkContext

We now initiate SparkContext. SparkContext is the point of entry for all spark functions. When we run any Spark application, a driver program runs, which serves as the main function, and the SparkContext is initiated here. SparkContexts represent a connection to a Spark cluster, and can be used to create RDDs, accumulators, and broadcast variables on that cluster. Here, SparkContext uses Py4J to launch a JVM and generates a JavaSparkContext. At any given time, only one SparkContext may be active. We then initiate StreamingContext() with 10-second intervals, which means that the input streams will be divided into batches every 10 seconds during the streaming process.

Next, we will assign the streaming input source, and then set up the incoming data in lines.

We are using the same port number (5555) that we specified in the first part to send the tweets and we are using the same IP address since we are running the application locally. We are also using the window() function to determine that we are analyzing tweets every minute (60 seconds) to identify which are the top 10 #tags. The image above illustrates how Spark’s window() function operates.

Now that we have completed all the pre-requisites, let us see how we can clean the tweets that begin with # and save the top 10 tweets in a temporary SQL table.

To save the counts of the tags, we created a namedtuple object. Then, we used flatmap() to generate an array of tokenized tweets. We are using lambda functions because they require less memory and run faster. We then filter tweets that do not begin with #. The foreachRDD() function in pySpark is an important function that enables faster processing of RDDs. It is applied to each RDD to convert it to a dataframe and is then stored in a temporary table titled “tweets”.

We can now run receive-tweets.py and after that we can begin streaming by running read_tweets.py.

Note: We must wait for a few minutes after running receive-tweets.py to ensure that we have enough tweets to work with.

Tweets streaming

Next, we import our visualization libraries and draw the graph of the top 10 #tags relevant to the topic of “football”. We run our streaming every minute to check the top 10 #tags only a few (5) times just for learning purposes, and you can see that the #tags do not change that often, but if you want to see better results, you can keep it running longer:

Here are the results:

Top 10 trending #tags in topic football

Congrats! You have made it! The code is available at my GitHub.

Note: Both parts were performed using Jupyter Notebook. You can find the versions of Java, Python, Spark, and Winutils used in this tutorial on GitHub.

Key Takeways

This was a simple example of how Apache Spark’s Streaming component works. There are always more sophisticated ways to apply the same approach to different kinds of input streams, such as user interaction data from popular websites, such as YouTube or Amazon (media, retail). Another use case for Apache Spark is the Stock Market, where streaming massive amounts of data and running a variety of analysis in real time is crucial to Stock broker companies.

Final thoughts

In this tutorial, we utilized Spark and Python to identify trending #tags in topic football. If you wish, you can change the keyword and search for tweets related to the topics of your interest. Experiment with different keywords and gain some insight into them!

Thanks for reading! If you enjoyed this piece, a quick 👏 would mean the world to me. Stay in the loop by subscribing to my profile and don’t hesitate to comment or reach out. Happy reading!

--

--

Bennett Holiday
Analytics Vidhya

Doctoral scholar, freelance writer, forever curious. Research is my playground, sharing insights is my passion.