Mining Twitter Data without API keys

Getting Twitter Big Data for Analysis with a Single Line of Command.

Victor E. Irekponor
Analytics Vidhya
8 min readNov 15, 2019

--

Social Media platforms: Huge mine of textual big data

Data is the new oil, it is everywhere. Over the last decade, more than 90 percent of Big Data has been generated by people living in urban areas. With the advent of internet-of-things (IoT ) and the increased use of the internet, social media has become an integral part of people’s daily lives. Millions of unstructured data are being sent to the cloud every second providing free and unbiased information practically on any discourse. Every second, an average of around 6000 tweets are tweeted on Twitter which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year. See stats here.

Imagine getting this data on your personal computer without having to write any bulky code. Just a bunch of commands from the command line or terminal.

I was opportuned to be in a team working on a project recently that was centered on Community Resilience and Studentification, and I needed to download and carry out some analysis on Old/Backdated Twitter data from about 10 years, naively I went about the very standard approach which was first creating a developer account with twitter, and then using Tweepy to query Twitter’s RESTful API in order to download the user tweets with some other meta-data such as retweets, likes, favorites, date and time, username, et cetera.

I was soon to discover that going down that road would not be very helpful to my task which was to download twitter data on a specific discourse over the last 10 years, which would then be fed into several pipelines I had built for analysis. And that was because of Twitter’s rate limits and time constraints; using twitter’s API you can not mine tweets older than seven days, similarly, for a single search query, twitter only gives 100 results of the current week. Not very helpful to me I must say.

Thank you Calculator, but the answer is not very helpul. :-)

So I went into my default research mode. I had a problem to be solved which was to get old tweets programmatically, then I found out about the GetOldTweets-python tool originally developed by Jefferson Henrique in Python version 2.x, and later modified by Dmitri Mottl in Python version 3.x. Probing deeper I saw that it was essentially a tool that queries twitter search engines via the browser using python libraries such as pyquery, urllib, requests, and Lxml. Interesting! And very helpful for my task.

The algorithm works by querying twitter search very fast through the browser, looking for the search keywords, usernames, or hashtags you specified in the command, until the end of your search duration. Basically, when you enter on Twitter, a scroll loader starts, if you scroll down you start to get more and more tweets, all through calls to a JSON provider. It can search for the deepest and oldest tweets.

At that stage in my research, I had found just what I was looking for. Essentially in Software development and Engineering, you don’t try to re-invent the wheels. So I jumped on it but found it difficult to write a working implementation for my own use case. I wasn’t getting the results I needed. Also, I figured a lot of people are in my shoes, or would also one day be in my shoes, so why not contribute to open source and make things a lot easier for even folks that don’t have strong python/programming background.

And that led to my own improvement fork, which is the crux of this article.

So we are just getting started.

TL;DR

This improvement fork of the code ensures that downloading millions of backdated/old tweets that could be used for your analysis becomes relatively easier and stressless either on your Windows, Ubuntu, or Mac OS powered machine, and from command line. Basically a one-liner command. Just by following the simple steps which I have priorly enumerated on my Github but would show here for practicality sake.

Requirements.

The first prerequisite would be to have any python3.x version installed on your local machine, and you should have already set the environment variable path, so you can interactively fire up python from your Command Prompt or Terminal without getting any error. The easiest way to do that on Windows is to run the python installer again and tick the box saying “Add Python to environment variables” under the advanced option. This answer on stackoverflow shows it more clearly. And on terminal, if you’re using the Ubuntu distribution (which I think is more popular among programmers that prefer the Linux OS), just run sudo apt-get install python3.6 on your terminal and you should be able to fire up python3 from terminal after installation by just typing “python”.

After doing that, the next major requirement would be to install pyQuery and Lxml for handling requests and xml/html document types. This could easily be done by just running on Terminal or Command Prompt; pip install pyquery and pip install lxml

Now once you’ve completed all these prerequisites, Head down to my Github here, to fork and/or clone the improvement fork of Mottl’s python3 version, which is also an improvement fork of the original package written by Jefferson Henrique.

I’d like to have some Nigerian Jollof and Turkey, please :-)

There’s a README file which contains the implementations and some basic examples attached to my repo, but I’ll be giving a short hands-on tutorial on downloading old/backdated twitter data.

Now once you have cloned or downloaded the repo to your local machine, you can start downloading your old/backdated twitter data for any kind of analysis by the following the steps below:

  1. Clone or download the repo to your local machine.
  2. Unzip it.
  3. Navigate to the unzipped Optimized-Modified-GetOldTweets3-OMGOT-master folder
  4. cd again to the GetOldTweets3–0.0.10 folder inside unzipped Optimized-Modified-GetOldTweets3-OMGOT-master folder, and fire up a command prompt or terminal right there.
  5. There you go, you can start playing around with your dataset now.

Run the commands from the GitHub readme to start downloading your twitter data which would then be saved on your computer as an output.csv file. But you can specify a name to save your downloaded dataset by passing a “name”.csv to the “ — output” argument. For instance: --output dataset.csv saves the downloaded file as dataset.csv in your working directory.

Now, getting our hands dirtier with more examples. Lets say we need to download tweets by a specific User.

We simply do step 1–4 stated above, and type in the following command culled from the Github repo

python GetOldTweets3.py --username "mo4president"  --maxtweets 50 --output dataset.csv

The above command would query twitter’s search engine and download all the tweets made by @mo4president, you can also use the --maxtweets parameter to specify the amount of tweets you need, and --output parameter to specify the output name. Leaving out the --maxtweets argument basically downloads all of the tweets by the user. But here we specified 50, meaning we would only get 50 rows of tweet info.

Also if you need to download tweets with respect to a particular keyword e.g “rent” over a period of time from like say 2014 till date and within a geographic geolocation, which is about 5 years worth of old tweets, you use this command also culled from my github repo, after executing step 1–4 as stated above.

python GetOldTweets3.py --querysearch "rent" --near "6.52, 3.37" --within 40km --maxtweets 350 --output rent.csv --since 2014-01-01 --until 2019-08-31

So this command above would fetch us all the tweets that contains the word or hashtag “rent”, within a 40km radius of the latitudinal and longitudinal coordinates of Lagos state (because that’s what we specified) from 2014 till date, and would store the results in a rent.csv file that would be saved in our current working directory . The maxtweets argument here would return 350 tweets because that is what was specified.

essentially a one-liner command to start downloading tweets
The output in csv format

Running this same process I was able to gather enough big data on a variety of discourse represented by keywords and analysed them accordingly to see trends relating to community resilience and studentification amongst other analysis over a 10 year period.

As shown above, after downloading the twitter big data in its raw unstructured state, you’ll need to do a bit of dataframe manipulation with pandas library in python to drop some columns and probably do some basic exploratory data analysis.

It is worthy to note that this modified tool now automatically takes care of all the tweet text cleaning and the removal of unwanted characters.

Then you might want to pass the tweet texts into a sentiment analysis pipeline built around vader or textblob for further analysis, or do a document-topic-word modeling using Gensim’s library in python and the Latent Dirichlet Allocation model. But all these different analysis depends on your proficiency and the direction of your project as a Data Scientist or Researcher.

In the next couple of posts, I will be looking at pretty much all the analysis that could be carried out with the tweet texts to derive better insights, and also how to handle noise in your dataset. But for now, I hope you find this article helpful in gathering your twitter “big” data.

Code Contributors:

Irekponor Victor, University of Lagos, Nigeria.

Aminu Israel, Lagos State University, Nigeria.

Olah Femi, University of Lagos, Nigeria.

UPDATE: 13–10–2020.

This tool (optimizedGetOldTweets) is currently broken because twitter made an upgrade to their API endpoints from V1.1 to V1.2, and as a result of that, some of the endpoints used in this tool became deprecated Here are some more details on their new API. Things moved pretty fast, I’ve been able to fix this yet. But for the time being, if you’re here, check out this addition I made to the Github readme.

I believe it would go a long way in helping you in your twitter data mining escapades. 😊.

--

--

Victor E. Irekponor
Analytics Vidhya

Ph.D student @ Center for Geospatial Information Science (CGIS), University of Maryland, College Park | Spatial Data Scientist | GeoAI | Autodidact