Twitter…Eat, Sleep, Retreat

Paul Ellis
The Startup
Published in
6 min readFeb 3, 2021

I’ve tried to pull together a discussion and process that I used to extract tweets from Twitter over a 23 hour period during the covid-19 outbreak specifically for the UK and Ireland. I have omitted the set-up of Docker and MongoDB, we’ll save that for another time, and have only included the insertion of records into MongoDB. I would certainly recommend that you also check out Ty Shaikh’s Udemy course which utilises Redis and includes an interface to view the results.

The extraction of Twitter data is initially based upon a bounding box whose coordinates, for completeness, include the UK and Ireland. The data was obtained from https://boundingbox.klokantech.com/ , utilising the CSV copy & paste option.

Given that a bounding box filter will not permit another logical ‘and’ filter, Twitter only permit a logical ‘or’ operator, an additional filter has been created which utilises a list of hashtag keywords which had been collated. I have taken the liberty to add a couple more hashtags which were pertinent to the exercise which was monitoring Twitter related Covid tweets.

There was the option of using an alternative API which would have read the previous 24 hours but for this task I chose to utilise an active stream and set 2 termination clauses. The first is the required 24 hours, that is, when the current time exceeds a calculated end date and time the stream will end. Similarly, in the event of too large a volume of data it was deemed that an additional count clause be put in place the value of which was set to 1,000,000.

The setting of the end date was performed using the timedelta function:

To avoid duplication of text data for key words the script omitted re-tweets. Similarly, the field containing hashtags was extracted and used to filter the required selected hashtags.

Another enhancement was the selection of key data from the tweets. Rather than leaving the default to select all data and given the proposed task, it was more pertinent to code a tweet container based upon selected criteria. The advantage is both the space saved and the performance enhancement when later querying the data. The code can be amended to utilise any field from the Twitter stream.

Finally, an error code was applied when a failure occurred in the Twitter stream:

So, moving forward 1 day we have accomplished the first task which was to extract 24 hours’ of Twitter data. The data of course is in a utf-8 format and contains images and urls. Given that the task is to obtain keywords we exported the ‘text’ column content to a file.

We will now read the data file and perform the following operations:

  1. Set the imported data to tokens which are lower case
  2. We will remove all tokens which are not alphabetic
  3. Finally, we remove all stop words

We now have a complete list of Twitter words published over the last 24 hours in the UK and Ireland which relate to the current covid-19 outbreak. To ensure that only text and no hashtag data was included I reviewed the overall tweets and took a sample of hashtags, for example #Kensington:

https://t.co/xlTOXLscPkYou are not alone in this #lockdown @CcKensington still caring and serving our community. #Kensington #coronavirus…

Sure enough the word Kensington did not appear in the list of Twitter words.

The final step is to export those key words in advance of the proposed map-reduce implementation.

In order to execute the mapreduce script we fire up docker and execute jupyter notebook.

We then upload the text file and create the script to load, summarise and analyse the key words.

The mapreduce takes the tweet_words.txt file and successfully executes the script to create the file /output/part0000:

The python scripts loads the part-0000 text file, removes the symbols “)”, “(“, “ ‘ “ before exporting to the file tweets.txt:

We then import the data and sort using the ‘count’ column:

As we only require the top 50 words; we will export this data to a new pandas dataframe:

With the proliferation of dashboards within industry we could provide another perspective on the data that has been collected and it seemed beneficial to plot the output.

In the first example a basic bar chart was created.

Given that the image is quite small an alternative was to utilise cufflinks, which is more malleable and allows you to filter data:

Paul

--

--