Analytics Vidhya
Published in

Analytics Vidhya

In-Depth Analysis

Scraping all COVID-19 Symptoms tweets using Python.

Most COVID-19 patients have been writing their road to recovery on twitter hence we detail it here on how to scrape all that info and store it as a csv file for decision making and further analysis

Twitter is a gold mine of data in regards to people sentiments. The information captured on twitter is more structured as compared to other social media data points and ways of posting.Twitter also provides an easy way of retrieving this information .

This article mainly explains quick and straightforward of Scraping all COVID Symptoms related tweets from Twitter in Python using Dmitry Mottl’s GetOldTweets3.

source: twitter

If you want to jump straight to the code you can access the Jupyter Notebook for this tutorial on my GitHub here.
The code is the final outlay of the project description

GetOldTweets3 was created by Dmitry Mottl and it is an improvement fork of Jefferson Henrqiue’s GetOldTweets-python.
This package allows you to retrieve a larger amount of tweets and tweets older than a week.
The picture below shows the information relating to a tweet that can be retrieved.

Prerequisites

Using GetOldTweets3 requires no authorization from twitter but you just need to pip install the library and from there you can get started right away.
You will also require pandas or pyspark so as to manipulate the data which will discuss later on.

For this, import the following lines of codes:

# Pip install GetOldTweets3 if you don’t already have the package
# !pip install GetOldTweets3
# Imports
import GetOldTweets3 as got
import pandas as pd
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
#alwaysimport this for every pyspark analyticsfrom pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName(‘appName’).setMaster(‘local’)
#sc = pyspark.SparkContext(conf=conf)
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)

Scraping tweets from a text search query

To query all COVID symptoms I focused on three items:the query word,the count of tweets and location

With the above three variables I was able to ensure that I retrieved COVID-19 symptoms for top major hotspots(London,Newyork and Paris).

The query below creates a text file of all Tweets near Paris and containing word COVID symptoms:

text_query = ‘COVID symptoms’
count = 7000
geocode=”Paris”
# Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query).setMaxTweets(count).setNear(geocode)
# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
# Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text,tweet.id,tweet.username,tweet.geo] for tweet in tweets]

Manipulating the tweets file further

To store our data in python dataframe we use code the snippet below:

tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text','TweetID','username','geo'])

If you are a pandas expert you can proceed to write the file to csv directly.
For me I like using pyspark so as to utilize the processing features of pyspark.
I convert the pandas dataframe above to pyspark and then I write it to csv file as shown below:

tweets_df_spark.coalesce(1).write.save(path='C:\\Users\\brono\\First_batch\\Finalextract2.csv', format='csv', mode='append', inferSchema = True)

Please note I am appending my tweets to an existing so as to be running my tweets and appending to an already existing file.

Removing duplicates Cleaned tweets file

To put all this together I added the following lines of code to remove duplicates for tweets scraped twice.
I also added the final two lines of code to read the append file and remove duplicates before writing to a new-folder.
Please note with overwrite data on this file every time we run the code

Finaldf = spark.read.csv("C:\\Users\\brono\\First_batch\\Finalextract2.csv", inferSchema = True, header = True)
Finaldf = Finaldf.dropDuplicates(subset=['TweetID'])

Finaldf.sort("TweetID").coalesce(1).write.mode("overwrite").option("header", "true").csv("C:\\Users\\brono\\First_batch\\Cleaned_data.csv")

Putting the code together

I compiled the above into one function shown below:

def text_query_to_csv(text_query, count):
# Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query).setMaxTweets(count).setNear(geocode)
#.setSince(newest_date1).setUntil(newest_date1)
# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
# Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text,tweet.id,tweet.username,tweet.geo] for tweet in tweets]
# Creation of dataframe from tweets
tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text','TweetID','username','geo'])


# Createspark spark dataframe
tweets_df_spark = spark.createDataFrame(tweets_df)
# Converting tweets dataframe to csv file

tweets_df_spark.coalesce(1).write.save(path='C:\\Users\\brono\\First_batch\\Finalextract2.csv', format='csv', mode='append', inferSchema = True)
# Read the appended file and remove duplicates and write the clean as one file
Finaldf = spark.read.csv("C:\\Users\\brono\\First_batch\\Finalextract2.csv", inferSchema = True, header = True)
Finaldf = Finaldf.dropDuplicates(subset=['TweetID'])

Finaldf.sort("TweetID").coalesce(1).write.mode("overwrite").option("header", "true").csv("C:\\Users\\brono\\First_batch\\Cleaned_data.csv")

Final code is published on Github https://gist.github.com/cheruiyot/369e5d99489ef55558ce1d5df2087c64

To access more than 9000 tweets already scraped you can reach out

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bena Brin

Bena Brin

I am Risk Consultant working for a Swiss Fintech. I help Banks fight Fraud using big data technologies.Data Science/Machine Learning/