Tutorial: Gathering text data w/ Python & Twitter Streaming API

John Naujoks
The Startup
Published in
5 min readOct 30, 2019
via giphy

Oh, Tweets. Not the most elegant form of communication, but concise and a robust way to get real time feedback and information. I regularly attend conferences, like San Diego Comic Con, and Twitter is invaluable in navigating all the activity that is happening at the same time. On a normal day, Twitter can feel a bit much for me, but it is a fascinating way to get live information.

Whether profound or silly, tweets can help get information and reaction about what is happening in the world. Let’s say you are planning a small music festival. By performing some NLP tasks with tweets while the event is taking place you could:

  • Perform sentiment analysis to see whether attendees are responding positively or negatively.
  • Look for common questions that need to be addressed
  • Find out if there are any other interesting patterns that may be helpful, like certain artists performing poorly, if bathrooms in one area aren’t working, or if one entrance is going much slower than the others.

You can pay a team to look into these things manually, but we have the ability to do these types of tasks in real time with a little computer assistance. In this demo, I am going to walk through creating a script to use the Twitter Streaming API to get a live feed of tweets based on selected topics.

1. Setup necessary packages

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import datetime
import csv

To start, we are going to need some packages to help build our streaming script. Tweepy is an excellent Python package for interacting with the Twitter API. For this, we need a few classes from it: StreamListener and Stream (for building our stream) and OAuthHandler (for authentication on Twitter). You will need to register on Twitter for developer credentials that we’ll be inserting later. datetime and csvwill be used for handling the output and putting the tweets from our stream into a file.

2. Select our details

class StdOutListener(StreamListener): def on_status(self, status):
if (status.lang == "en") & (status.user.followers_count >= 500):
# Altering tweet text so that it keeps to one line
text_for_output = "'" + status.text.replace('\n', ' ') +"'"
csvw.writerow([status.id,
status.user.screen_name,
# Using datetime to parse it to just get date
status.created_at.strftime('%m/%d/%y'),
status.user.followers_count,
text_for_output])
return True
def on_error(self, status_code):
if status_code == 420:
# Returning False in on_error disconnects the stream
return False

When we activate our StreamListener, which will serve all tweets that match our criteria, we use the above class to help handle it. Our functions, on_status and on_error are for checking to make sure the stream is working. If it returns an error status code, the script will stop.

When our stream is going, each tweet comes through as status, which has a number of details. Here is just a quick breakdown of the details contained in each tweet:

  • id or id_str: unique Tweet id, returned as number or string
  • created_at: date and time posted
  • text: actual tweet text
  • user: contains a dictionary of all the details for the individual who sent tweet, including screen_name, followers_count, and all other profile details.
  • geo, place, coordinates: location info for tweet, if available.
  • retweeted_status: contains a dictionary of the details if the tweet includes the re-tweet, including the user dictionary for the person who wrote the original tweet.

I recommend trying to familiarize yourself with the structure of the Status object, because it offers a lot of details, you just have to carve out what part you want. You can take the whole record info with status._json, but that will make any files you have very big to start with and crowded with data that might not be relevant to what you are looking into.

In the example code above, it is set to conditionally look at each status, make sure it is in English (my primary language, so easiest for me to interpret) and that the user has over 500 followers. The follower count piece is a guess. I checked a smattering of articles online that claimed average user follower counts at very different numbers. My goal in limiting to users with over 500 followers is to 1. keep my data flow to a reasonable amount, and 2. trend toward picking up tweets from users who are more active and engaged. This may vary based on your topic, but for broad topics, it can be good to add some additional conditionals.

With my conditional argument in place and my attributes selected, my on_status function is now set to write a new line to a csv every time it gets a tweet in the data stream.

3. Provide authorization

l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)

As mentioned before, you will need to setup a developer account with Twitter to be able to access and interact with the Streaming API. Here our class is instantiated and given the authorization credentials. You want to keep your keys locally and off of GitHub. The important thing to remember is to keep them in a separate file and make sure that file is listed in your .gitignore so that they are kept to yourself.

4. Open the gates!

csvwriter = csv.writer("our_blank.csv", "a"))
csvwriter.writerow(['twitter_id', 'name', 'created_at','followers_count', 'text'])
stream.filter(track=['star wars'])

Alright, now we are ready to open the stream and have it populate an empty csv file. You can see we are using csv to open the file, write our first row which is the column names we would like to add to the top of our file, and then the stream.filter will do its thing to bring in tweets. For this little test, I just used ‘star wars’ as the text to filter on, but the stream can take any text or a list of words to track. In no time at all, you get a csv file with plenty to work with:

Try to be as specific and concise as possible with your filter terms. For specific timed events, you may want everything, but be mindful of how intense it can be if you are gathering a lot of detail. This demo with using ‘star wars’ as the filter gave me about 50 records in about 3 minutes. Also note the script we created here will have to be stopped manually, or you can do additional coding to have it count up to a certain number of records. No matter what, you now have the power to wade bravely into the endless stream of Twitter!

I put all the demo code for using the Twitter Streaming API in one gist here.

--

--