Read this before you start developing using Twitter data.

Akansha Jain
Learn with Akansha
Published in
4 min readFeb 24, 2018

Twitter currently is the best open data source platform, as of today, having 1.3 billion total registered users and 157 million daily active users.

Researchers, developers and students like me, all around the world, are utilising the real time nature of the Twitter data. Twitter users are often open and candid about their experiences while tweeting, which makes it a better data source when analysing the behavior and attitude of the population on any required topic. In addition to above advantages, it is fast, cost effective and provides high coverage being the successful global social media platform.

Now, the question is how do you get this data? Programatically, it’s not that difficult and Twitter being generous enough provides API services for the same. Also, there are various wrappers available to access these APIs using different programming languages like Python, PHP, Java etc. I, hopefully being future’s data scientist and current python enthusiast, really like using tweepy for this purpose.

Recently, I started coding to extract Twitter data for my project, with the help of internet obviously, without realising about its different APIs, their limits, and features, and ended up being confused. After which, I decided to research, and help anybody who’s a beginner and is confused or is just curious to understand Twitter API’s. There are currently following APIs that Twitter provides to access its data:

  1. Twitter’s Search API
  2. Twitter’s Streaming API
  3. Twitter’s Firehose API

Let’s understand each of them,

Twitter’s Search API gives you access to a data set that already exists from tweets that have occurred, i.e historical data. It is also known as Twitter’s REST API. The data is pulled from the API, where search result is based on satisfying some criteria. The criteria can be keyword(hashtag) list, usernames, locations, similar to how you search directly on Twitter search bar. You can get data of maximum till 7 days in the past, from the standard API, but other premium plans are also available that can give you past 30 days data, and enterprise plan that gives data occured since 2006. Some other limts and features to be noted here are; Response formats: JSON, Requires authentication? Yes, Rate limited? Yes, Requests / 15-min window (user auth):180, Requests / 15-min window (app auth):450.

Twitter’s Streaming API is a push of data as tweets happen in near real-time. User registers set of criteria(keywords, usernames, locations, named places, etc.) and as the tweets matches the criteria, in real time, they are pushed directly to the user. This is a push of data by Twitter, rather than a pull of data initiated by the end user, as in case Twitter’s Search API. The major drawback is that Twitter’s Steaming API provides only a sample of tweets that are occurring anywhere about 1% of the tweets in near real-time. The actual percentage of total tweets users receive with Twitter’s Streaming API varies heavily based on the criteria user requests and the current traffic. Some other limts and features to be noted here are; Response formats: JSON, Requires authentication? Yes, Rate limited? Yes.

Twitter Firehose was handled by data providers like GNIP and DataSift, till 2015 when Twitter shut down providing access to its 100% data to any third party, instead they have Decahose stream which is in fact, very similar to the Twitter’s Streaming API, delivers 10% random sample of the realtime Twitter Firehose through a streaming connection. Historical PowerTrack on the other hand provides access to the entire historical archive of public Twitter data — back to the first Tweet — using the same rule-based-filtering system.

Decahose stream and Historical PowerTrack are enterprise API available within Twitter’s managed access levels only. To use these API, you must first set up an account with Twitter’s enterprise sales team.

Now, you are ready to decide which API to use, only if you have clear understanding of the type of data you need and project requirements. For example, does your project require real time data access, like stock market analysis needs, or it requires historical data to do sentiment analysis on some event that happenned or on some popular hashtag, you should be able to say that Twitter’s Streaming API will solve former problem and Twitter’s Search API will do latter. If your requirements are higher and for enterprise level application you can contact Twitter team for firehose access.

Another observation after implementing both search and stream APIs was, Twitter’s Search API, only checks the text part of the tweet with keyword mentioned in criteria while Twitter’s Streaming API may include tweets that has keyword even in the linked tweet/URL apart form the original text. This was verified by search API documentation that says “Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.”

You are required to register a Twitter application and authenticate it to access Twitter’s API. Also, remember, the access to these APIs is rate limited, which means you are not allowed to do multiple requests from a single application within recent time period. You may encounter errors and the application access can get blocked too.

That’s it! After reading this, I am sure, now, you have deeper understanding of Twitter’s various API. Follow the links embedded in the article with the important terms to know more. You can follow me on Twitter, my work on Github, and connect professionally with me on LinkedIn.

--

--

Akansha Jain
Learn with Akansha

Senior Data Scientist, Builder.ai | Master’s in Data Analytics at Indian Institute of Information Technology & Management, Kerala.