Developing a Serverless Twitter Streamer and Performing Sentiment Analysis
Recently I was watching football and listening to the announcers talk about rule changes and wondered what people really thought about the changes to the game. That got me thinking on how we could really look into how happy people were about football (and all of the other cool data we could gather), and ultimately how could we gather and store this data.
One way of collecting people’s feelings (for better or for worse) is with Twitter. On Twitter, people feel free to express their opinions and could be a way of looking into how happy people are. For this post, I wanted to come up with a good way of collecting/storing tweets and analyzing the sentiment in real time. When I was working on this streamer I was using an older computer that ended up dying on me half way through it. Because of that I couldn’t rely on using my local machine and wanted to make the solution serverless so I didn’t have to rely on any infrastructure.
The solution to build a serverless twitter scraper wasn’t as straight forward as I initially hoped. Lambda functions timeout after 15 minutes and Kinesis streams need something streaming the data to them in order to work. There were solutions that used EC2 instances or running python locally, but that defeated the purpose of a serverless implementation. The solution I settled on was to use lambdas to collect and store the data while using step functions to create a new lambda as the previous one timed out. I could do this for a specified number of times in order to collect the data for the entire game. Using this process I could easily stream tweets in real time without having to use any computer/instance. This whole process could be stored in a Cloudformation template file that would deploy all of the resources at once. Using the template file allows for a modular design and easy updating in the future. The code can be found on my Github. If you want to deploy this in your own environment you can clone the repo and make the changes you need to track your own hashtags:
Before we go too far into explaining the code we need to configure a few things. First you need to get Twitter API keys. You can create that here:
We will store our twitter API keys in AWS Secret Manager and access it in our lambda. To do this go to the AWS console and go to the Secrets Manger page. From here we will enter the API keys as key/value pairs for each item:
Then we will name the secrets. For this application I named it prod/twit_api. The secret keys need to be named:
If you change this name or how you named the secrets you will have to update it in the template.yaml.
Second you need to configure the AWS CLI as described here:
You will also need to configure the SAM CLI as described here:
You will also need to create an S3 bucket for SAM to drop artifacts. This bucket needs to have public access. To do this you navigate to the S3 interface in the console and make sure that none of the boxes are checked in the permissions:
The code has been used to stream various teams. You will want to change the TEAM1_HASHTAGS and TEAM2_HASHTAGS variables in the template.yaml file to the teams you want to track. The code is set up to stream each team separately for visualization. For example if I wanted to track the Chicago Bears and the Green Bay Packers I would set the hashtags like:
TEAM1_HASHTAGS: ‘#ChicagoBears,#Bears,#BearsFootball’TEAM2_HASHTAGS: ‘#GreenBayPackers,#Packers,#PackersFootball’
You should also consider how long you want to stream for. This is controlled by the count in the step function and the lambda timeout. The lambda timeout will determine how long the streaming lambda will run, and the count will determine the number of times that will execute. For example, if I set the timeout to 900 seconds (15 minutes) and the count to 5, this code will run for 15 * 5 or 75 minutes. You should update both the lambda timeout for the streaming lambda and the count in the step function to get the duration you want.
Great, now that we have everything set up we are ready to deploy our Cloudformation solution. Cloudformation allows us to specify the different pieces of the deployment and deploy all at once. We will go over each piece of this template to get a better feel for what we are deploying in this project. If you aren’t interested in seeing the inner workings of this code you can skip to the deployment section to see the commands to deploy.
This project can be broken up into 3 pieces:
- Stream and initial storage
- Dealing with sentiment
- Visualize the results
Streaming and Initial Storage
The first part of this code is the lambda to stream tweets. This lambda function is code that could be taken out of the lambda format and run locally. This lambda will track the specified hashtags and when a new tweet is found store it in DynamoDB. We can add attributes in the future and it won’t break our table (this functionality will be used when we add sentiment to our table).
There is another lambda that is used in this step that will track the number of times the tweet streaming lambda has executed. This lambda will pass information on the number of executions from one lambda call to the next.
The process of calling the lambdas is handled with step functions. In this instance the state machine we have created looks like:
The step functions state machine defines the flow of information. In this instance the state machine is used to see if the total count has been reached, if it has it ends the execution, if not it executes the stream tweets lambda.
Once the state machine is executed we will be able to store tweets in a DynamoDB in real time.
Dealing with Sentiment
Once we have tweets populating the DynamoDB in real time we need to analyze the new information in real time. To do this we will create a lambda which will poll the DynamoDB table and check for new rows. We can use this lambda to analyze the sentiment of each new tweet and store this information back in the DynamoDB table.
To perform the sentiment analysis we will use an AWS tool called Comprehend. This service has an API you can call and get positive/negative/neutral scores returned. This will become useful for us to simplify our process. Instead of having to train our own model we can leverage this pretrained model.
From here we can go in several different directions. If we are just interested in sentiment as the game goes on we can collect tweets and visualize the sentiment scores that we just collected. We could also update our records with additional information or additional visualization tools like Tableau or Quicksight. Because the DynamoDB table doesn’t require a specific scheme we could update the lambda function we used to score the tweets and the whole process would still work.
For this example we are going to publish metrics (positive and negative sentiment for each team) that can be visualized in CloudWatch.
Visualizing the Results
One way we can visualize the data in real time is by using Cloudwatch metrics. Metrics allow us to publish a single value to Cloudwatch and visualize it on a plot. This is a somewhat simple application, but in our case very useful. To create the Cloudwatch plot we will publish the metric in our lambda that evaluates sentiment.
Now that we have gone over each piece of the code we are ready to deploy our template. To deploy you need to have the files that have the lambdas, the requirements (which describes which packages are needed for the environment) and the template file. With all of these files in our directory we are ready to deploy, to do this you:
Navigate to the folder which contains the requirements.txt file and type:
pip install -r twit_stream/requirements.txt -t .
Then navigate to the folder with the template.yaml file and type (changing yours3bucket to the name of the bucket you created earlier):
sam buildsam package — template-file template.yaml — s3-bucket yourS3bucket — output-template-file packaged.yamlsam deploy — template-file ./packaged.yaml — stack-name twitstream-stack — capabilities CAPABILITY_IAM
Great! Now you can navigate to your AWS console and see all of the pieces. To kick off our job we will navigate to the Step Functions page in the console. Here we can click Start Execution and our job will execute.
If we watch the step functions we should see the execution start and we can watch it go through the process. When the code is complete we will see the job succeeded and each step in the step function is green.
To see how the sentiment that we pushed as a metric we will go to Cloudwatch in the console. From there we will find our Sentiment namespace and click Game_Sentiment. From here we can choose which metric we want to see. We can choose between Team1 and Team 2 positive and negative sentiment for each team. Cloudwatch will average the values it receives from our Lambda within the period we specify. Put another way if the period is set to 5 minutes it will average the tweet sentiment for 5 minutes. Adjusting this time will smooth out the plot, but if it’s too long it could smooth out real changes.
We can use our process to show to see the sentiment for a given team. If we look at the previous plot we have the positive and negative sentiment for a specific team during a recent football game. Based on sentiment for one of the teams can you guess who won this game?
This game was a game between the LA Rams and Seattle Seahawks where the sentiment shown is for Seattle. The results of this game where a win by the Rams where they were winning throughout the game. Seattle scored in the third quarter bringing the game a little closer which resulted in the increase in positive and decrease in negative sentiment. Using our approach we can get insights into how the game is progressing without knowing the score.
Predicting the score for a football game using Twitter is a cool application but this idea can also be applied in the business world. This type of approach can be used to determine how your customers feel about your product and a competitors. Deploying the latest and greatest algorithm is great but data science is more about deploying the solution that makes the most sense. In this case serverless decreased the cost and complexity with the infrastructure and made the solution easy to update and add to in the future.