Using Twint for Twitter data gathering

Michael Mejia
5 min readApr 22, 2020

Twitter is a social media platform that has been an ever-growing resource for data collectors. When you first begin to attempt to collect tweets from Twitter, you realize that Twitter API for web scrapping has a limit to the number of tweets you can obtain. Some websites have saved millions of tweets saved, but that usually creates a more significant, more time-consuming data cleaning process. Twitter’s API also does not have the option of the location of that tweet. Twint does not return the exact location of the tweet but does give a general area.

For example, you might want tweets from a particular city, but the average person who lives in that city won’t end their tweets with a hashtag “city name.” This is where Twint becomes useful. Twint gives two options for retrieving tweets from locations. One is just the name of the city itself. Twint can be helpful for major cities, but if choosing a smaller city that might have the same name in a different country, it might return mixed results. The second option is to enter the latitude and longitude of the location you are interested in and apply a radius. Both options work, but its best to test each one to see which option produces better results.

From Twints github page:

Some of the benefits of using Twint vs Twitter API:

  • Can fetch almost all tweets (Twitter API limits to last 3200 tweets only);
  • Fast initial setup;
  • Can be used anonymously and without Twitter sign up;
  • No rate limitations.

Unfortunately, at the moment of this blog, Twint does not work on jupyter notebooks. I recommend pycharm or any other ide/text editor that you prefer. I will include the code I used for Twint as most of the examples given by Twint are done on the command line. Before we begin, we need to have a few libraries installed, and Twint’s git cloned.

git clone https://github.com/twintproject/twint.git
cd twint
pip3 install . -r requirements.txt

Then just pip install Twint.

pip3 install twint

This is the code I used when searching for different keywords from tweets.

import twint
import datetime
import pandas
# Configure
list_ = ["outage","power out","power outage","pwr out","power failure","without power","power is out",
"electric is out","electricity out","electricity out","without electricity","lost electricity","lost power" ,
"#poweroutage", "electricity","failure", "utility", "utilities","power lines", "outage"]

After importing Twint, I created a list of key terms I was looking for on Twitter. Your use might be different such as only searching tweets from a single user or hashtag. I was searching for tweets that included these keys. You can imagine that this would already surpass the 3200 tweets limit for the Twitter API.

count=2
base = "2018-12-31 12:49:05"
base1 = datetime.datetime.strptime(base,'%Y-%m-%d %H:%M:%S')
date_list = [base1 - datetime.timedelta(days=x) for x in range(360)]

The “count” variable will be used later on when naming CSV files. The next three lines of code may or may not be used. The only reason I included these lines of code is if you would want to search tweets by the hour to the minute of a day or days. For example, maybe you tried to search for tweets that were tweeted right before a significant event. This creates a DateTime type with any value you desire to input. Next, we see the Twint portion that we are going to use often.

for i in list_:
print(i)
c = twint.Config()
c.Search = i
c.Pandas = True
c.Since = "2015-01-01"
c.Until = "2020-01-01"
#c.Location = True
#c.Limit = 1000
c.Near = "Houston "
c.Custom_csv = ["id", "user_id", "username", "date", "tweet"]
c.Output = f"hh{count}.csv"

Twint only allows the search for one term or Twitter user at a time. This means a loop will need to be created to go through all the strings I had on my list. The c variable is what is used to define what we want Twint to return. I am more familiar with panda data frames, so I set c.Pandas =True, that way the output would be placed as a data frame. The c.Search is the term that twint will search for. As I have a list to loop through, I placed the ‘i’ variable in its place. I have the “c.Location” commented out, but if set to True, Twint will attempt to pull the geolocation of the tweet. This works for some tweets, but not all. It seems to work mostly when users are using some public wifi or hotspots. The “c.Limit” is to limit the number of tweets you might want to pull. “c.Near” is used when placing the name of a city. If you were looking for a more exact location then you would use:

Geo                  (string) - Geo coordinates (lat,lon,km/mi.)

Which would end up being “c.Geo = “lat,lon, radius” with lat, lon, and radius being the coordinates.

c.Custom_csv = ["id", "user_id", "username", "date", "tweet"]
c.Output = f"hh{count}.csv"

This final portion of the code allows me to include the columns that I would want to return. I recommend searching for just one tweet and displaying the output. The column names I chose might not be what you need, or maybe there are other columns of more importance.

c.Output = f"hh{count}.csv"


twint.run.Search(c);

Tweets_df = twint.storage.panda.Tweets_df
count+=1
Tweets_df.to_csv(f"hh_{count}.csv")

As a reminder, we are still inside that for loop! The c.Output outputs the CSV entries for that one key in the list. The bottom portion is an example of Twint returning a data frame in case you might have other processes you wish to do to the data frame before sending it directly to a CSV.

You might be thinking that depending on the number of entries you put on a list, and there might be a lot of CSV files. This is true that will produce a lot of CSV files, so I added a snippet of code after the loop to combine all entries into one file.

filenames = [f"hh_{i}.csv" for i in range(2,23)]
combined_los = pd.concat( [pd.read_csv(f"/home/michaelmejia/PycharmProjects/project/{f}") for f in filenames] )

Depending on the int you used to initiate your count variable will determine the range. Now you have your CSV file with all your entries combines. Twint is capable of much more like searching the tweets of one user or searching the tweets of those user’s followers. I recommend this link if you would like to dive deeper into Twint.

--

--