Scraping Tweets by Location in Python using snscrape
A deep dive into the snscrape Python wrapper, how to use it to search for tweets by location, and why it sometimes doesn’t work as expected.
snscrape is a library that allows anyone to scrape tweets without requiring personal API keys. It can return thousands of tweets in seconds, and has powerful search tools that allows for highly customisable searches. Currently it is lacking in documentation, especially surrounding scraping tweets by location, so I hope to provide a thorough introduction to the topic.
Introduction to snscrape in Python
Throughout this article I’ll be using the development version of snscrape, which can be installed using
pip install git+https://github.com/JustAnotherArchivist/snscrape.git
Note: this requires Python 3.8 or higher
Some knowledge of the Pandas
module is also necessary.
Only three packages are required, shown below
To find the first (i.e. most recent) 100 tweets that contains the phrase data science we can use the code:
Which can be shortened to the following line:
Outputting the first five results we can start to see the information this line gives us:
but that isn’t it! In total it returns 21 columns of data, these are:
I’d recommend having a play around to see what each of these are, or have a browse of this article to find out more. I’ll go a bit deeper into the user field later on.
Advanced Search Features
The TwitterSearchScraper
uses Twitter’s advanced search; for an in depth run down of all its capabilities check out this table — I cannot recommend it enough!
Scraping by Location
When filtering by location there are two options: you can use the near:city
tag alongside within:radius
or geocode:lat,long,radius
. After extensive research I can confirm they yield identical results when used correctly (or should I say, as Twitter interprets them).
As an example, say you wanted to find all tweets about pizza in Los Angeles. This can be achieved using the following code:
Instead of using the city’s name you can use its coordinates:
To compare the results we can use an inner merge on the two DataFrames:
Which returns 50
, i.e. they both contain the exact same rows.
What exactly is this location?
There are two ways to get a location from Twitter; a geo-tag from a specific tweet, or a user’s location as part of their profile. According to Twitter, only 1–2% of tweets are geo-tagged hence it isn’t a great metric to be using; on the other hand a significant amount of users have a location in their profile, but they can enter whatever they want. Some are nice to people like us and will write ‘London, England’ or similar, while others are less useful, putting things like ‘My Parents Basement’.
All the documentation I could find indicated using a location as part of the search would only find geo-tagged tweets, however this isn’t the case. From my investigations I have discovered Twitter have some algorithms working as part of their advanced search that can interpret where a user’s location is from their profile, and assumes all of their tweets come from there. This means when you search for tweets using coordinates or a city name, the search returns tweets that are geo-tagged from that location, or that were tweeted by users with that location (or somewhere close-by) in their profile.
As an example, when I searched for tweets near:"London"
I managed to find examples of both:
The first tweet is geo-tagged, and the user does not have a location as part of the profile, i.e. the tweet was found because of its geo-tag. The second tweet is not geo-tagged, and was found because the user has their location in their profile.
Getting location from a scraped tweet
If you want to get the user’s location having scraped the tweet, this is also possible using snscrape. In the example below I scrape 50 tweets from within 10km of LA, store it as a DataFrame and then create a new column for the user’s location.
Looking at the first 5 rows, we can see that although not all locations are formatted as the same, they can all be interpreted as Los Angeles.
When it doesn’t work as expected
The way Twitter uses this search tag isn’t obvious, and when iterating through Counties and Unitary Authorities in England I found the results to be inconsistent. For example, when searching for tweets near:Lewisham
, all the tweets appear to be geo-tagged and come from Hobart, Australia (see below); that’s more than 17,000 km away! I found that using city names worked as expected but towns, villages and even countries returned suspicious results.
When using snscrape to scrape tweets by location I’d always recommend using the geocode
tag with longitude and latitude coordinates, with a radius to bound the search area. This will provide the most accurate results possible given the data available and features at hand.
Conclusion
This simple yet incredibly powerful Python module allows for some very specific searches. Again, I’d recommend checking this table out for a clear run down of its full capabilities. Twitter have done the hard work in converting user input locations into real places, allowing us to find them by name or coordinates. Tweets provide a great source of information, and when used in conjunction with a tool as powerful as snscrape allows for wide range interesting projects to be completed without much experience of Data Science, or knowledge of the subject! Happy scraping :)
All the code mentioned in the article can be found here: https://github.com/satomlins/snscrape-by-location
Find me on LinkedIn
References
snscrape GitHub: https://github.com/JustAnotherArchivist/snscrape
Twitter’s advanced search operators: https://github.com/igorbrigadir/twitter-advanced-search
Reference article on how to use snscrape Python wrapper: https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af