How to collect data like a spy — Part 3

How to collect social media data like a pro

We will configure NiFi to collect tweets from Twitter. The overall flow will be like this;

  • Connect to the Twitter API
  • Filter the Twitter API by a search criteria
  • Extract data from tweet data (JSON)
  • Filter tweets with only a geo location in the JSON
  • In a parallel flow, process the tweet for location and people via a NLP processor
  • Store all the data in AWS S3

Twitter Configuration

First of all we need to configure twitter and get an access key to allow use to query the API. Follow one of the many guides on how to get a Twitter access key.

Twitter Processor

Configure the NiFi Twitter processor to retrieve tweets

Set the Twitter Endpoint to: Filter Endpoint

This will only retrieve tweets from the API that meet the search criteria in “Terms to Filter On”. Which is “Plymouth, London” in this configuration above. You could filter by ID if you like.

Put in your Consumer Key, Consumer Secret, Access Token and Access Token Secret.

The Twitter processor is now configured.

Extract Tweet Data

Now we need to extract the tweet data from the JSON response from the API.

This is what the JSON response looks like this.

Twitter JSON Response

To extract the data from the flow file and put it into an attribute where we can use it to route the flow, we need to extract the data from the flowfile. This is achieved by using the EvaluateJSONPath processor. This should be configured as follows:

EvaluateJSONPath Processor Configuration

We have created 5 additional attributes, twitter.handle, twitter.lat, twitter.long, twitter.msg and twitter.user. These use the JSON path to identify the data in the JSON file.

If we want to extract the total number of followers for this user, we would need to create an additional attribute twitter.followers.count, with a value of $.user.followers_count

Now we have the data extract and in our attributes, we can now start routing the flowfile based on this data.

Our flowfile now looks like this.

First two stages of Twitter flowfile

Process the Data

We only want to process tweets that have the geo location in the JSON response, so we will route the flowfiles that meets this criteria to the next step.

So we are going to use RouteOnAttribute to check if our JSON file contains tweets with a geo location.

We should configure the RouteOnAttribute processor to only route when all the conditions are met.

Use RouteOnAttribute to only process JSON files with Geo Location

The value for the properties contains ‘Expression Language’, see this guide for a more detailed explanation.

We will only route to ‘matched’ when lat, long and tweet are not empty. So the processor will only send the tweets that contain a tweet, lat and long to the next processor.

Now our data flow should look like this.

First three steps of the data flow

Store the Data

Now we are going to store the JSON files in an Amazon AWS S3 bucket for processing via AWS Athena (Part Four).

Configure the Object Key to be ${filename} which is the current filename of the flowfile. Each API response will have created it’s own flowfile.

Put the name of your S3 bucket which the access key has permissions to list/write files into the Bucket parameter.

We now need to put our AWS credentials in the Access Key and Secret Key. For now will will just use these, but in another flow I will show you have to use the AWS credentials store, so we only need to enter the details once into our data flow.

So now we have fetched the tweets, identified which tweets have a geo location and now we have put them into an S3 bucket.

Onto Part Four. How to search the data using AWS Athena.

The Series

Part One — How to collect data like a spy
Part Two — Getting NiFi up and running
Part Three — How to collect social media data like a pro
Part Four — Creating a database with AWS Athena
Part Five — Connecting RStudio to Athena
Part Six — Creating Maps of the Data in RStudio
Part Seven — Creating an interactive dashboard for your data

Show your support

Clapping shows how much you appreciated Mark Craddock’s story.