Scraping Twitter with TweetScraper and Python

Kevin Crystal
3 min readJun 11, 2019

--

Social media can be an incredible source of real-time updates on current events, but accessing the data often presents challenges. If you need to scrape Twitter and are struggling with other packages, TweetScraper is a solid option for quickly collecting a large number of historical tweets.

My teammates and I were tasked with using social media to identify power outages, with the ultimate goal of improving the ability of emergency management officials to allocate resources in real-time. We identified Twitter as the platform most likely to yield a large number of posts related to the subject, and enthusiastically set out to begin collecting tweets. Our enthusiasm was tested, however, as we ran into several stumbling blocks.

The first logical choice was to use Twitter’s official Search API. Unfortunately, the free version only allows access to the seven most recent days of historical tweets, with further access requiring a costly subscription. Given that our deadline was less than two weeks away, and that we did not have the resources to buy a subscription, there was no way this method would allow us to amass enough data to train a model.

Next, we tried the TwitterScraper package. This seemed promising, and, indeed, our classmates got excellent results with TwitterScraper. Unfortunately for my team, we all overtaxed the API during our initial experimentation with using the package, and our computers were locked out from further access before we could refine our query to get meaningful results. This was a major bummer, but the deadline loomed and we still had our hearts set on using Twitter. Enter TweetScraper!

TweetScraper doesn’t have the same built-in Python functionality that TwitterScraper does, so, after installing the package as instructed at https://github.com/jonbakerfish/TweetScraper, we had to run our initial queries in the terminal. An issue we encountered was that we couldn’t use search operators to refine the search (the documentation indicates that this is possible, but our attempts to use the operators did not generate results), so we had to stick to pretty basic queries, implicitly specifying the location by including the name of a major utility as one of the search terms.

Running this command several times returned 4,374 tweets ranging from 2007–07–20 to the time of the scrape. TweetScraper returned each tweet as an individual JSON file, saved to the folder “../TweetScraper/Data/tweet” . Our main obstacle as burgeoning Python programmers was to figure out how to access these JSON files.

After copying the folder to the project directory, we accomplished this with the following code:

Finally, we converted the list of tweets to a workable Pandas dataframe:

Thanks, TweetScraper. You saved the day!

--

--