Fun with Twitter Bots: Part-2

Creating a twitter app, storing tweets, and analyzing tweets https://github.com/kimoyerr/twitterbot

7 min readMar 4, 2019

This is part 2 of the original article about creating a twitter bot. The Heroku app created in the first part re-tweeted based on keyword search. In this article, I will slightly modify the original Heroku app to store tweets that match the search criteria by connecting to a MongoDB database deployed in the cloud.

I drew inspiration and knowledge from the following excellent blog posts on this subject:

As before, I tried to borrow what worked for me from these posts, and create my own version.

The above is a simplified depiction of the workflow. My first post describes how to set-up your twitter app, and get the python scripts running with Tweepy. In the rest of this post, I will talk about setting up the MongoDB cluster, create python scripts using PyMongo and Tweepy to retrieve and store tweets, and finally deploy this app on Heroku.

Setting up your MongoDB Atlas Account and Free Cluster

MongoDB is one of the popular NoSQL document-oriented database solutions. The documents are stored in JSON format which has the advantage of human-readability while being easy to parse automatically. MongoDB Atlas is a DataBase as a Service (DBaaS) built around MongoDB that makes it easy to clusters hosting the database. There are other DBaaS provides for MongoDB but MongoDB Atlas worked for me, and I can attest that it took me just 15 mins to set up my database clusters.

Create MongoDB Atlas account and cluster

I followed the guide here to set up my account and create free tier clusters. MongoDB Atlas also lets you create projects and associate specific clusters with your projects. Here are how my projects are set-up:

The project TwitterBot has one cluster associated with it which I named TweetDB. I chose the M0 (Shared RAM, 512 MB Storage) tier for this cluster which is free forever but has limitations in terms of features such as back-ups and sharding. If these features are important to you, then you can select other paid options. I’ve only tested the free M0 tier and will test other options in the future.

Connecting to your cluster

Now that we have our first cluster, let’s make sure we can connect it from our local computer or from our Heroku app. Select the cluster you just created and click on the ‘CONNECT’ tab and you will see several options to connect: With Mongo Shell or with Mongo Compass, or using your own application (python, java, Node.js etc.). The latter option is what we want since we will be using python and PyMongo to connect. Once you select this option you will see a screen similar to one below:

Click on the “Short SRV connection string” box and keep it safe. Replace the password placeholder with your password for your MongoDB Atlas account. We will use this string in our Python script (in the next section below) to connect Python to our cluster. Also, make sure to whitelist the IP addresses of machines from which you will connect to MongoDB Atlas by selecting the Security tab on your cluster. The 0.0.0.0/0 will let any IP addresses to access your cluster:

Create a Database and Collection

MongoDB is structured in terms of collections, unlike relational databases that are organized in tables. For example, a single database can have multiple collections and each collection holds multiple documents in JSON format:

In our case, let’s keep it simple by using one database and one collection within this database. I called by database “tweets” and my collection as “crispr_tweets” to collect all tweets that are related to crispr.

Setting up Python scripts that search tweets and store them in MongoDB Atlas

Here I will walk through the steps for writing a simple script, that listens to all tweets with certain keywords and then pushes these tweets to my MongoDB database created above.

Install Tweepy and PyMongo

I used the easy to use tweepy library functions to get the public tweets, search these tweets, and then to retweet. I also used PyMongo which is the recommended way to interact with MongoDB from Python. To install Tweepy and PyMongo in a conda environment follow these steps:

conda create -n twitterbot
conda install -c conda-forge tweepy
conda install -c anaconda pymongo
# To activate this environment
conda activate twitterbot
# To deactivate this environment
conda deactivate

To test that you are able to connect to your MongoDB cluster and database type this within python once you activate your environment:

import pymongo
from os import environ# MongoDB atlas connection
mongo_client = pymongo.MongoClient(“Your Short SRV connection string”) # Add your short string and password as detailed in the section above
my_db = mongo_client.your_database_name # Change this to your database name
print(my_db)
my_collections = my_db.your_collection_name # Change this to your collection name
print(my_collections)

Python script for running the bot on your local environment

Next, create a script to

Create a twitter stream listener to download tweets and metadata in real time
Search for tweets with a specific keyboard
Insert these filtered tweets and metadata to a MongoDB collection

Your Twitter API can run in streaming mode where it establishes a connection with your client and can constants push data to the client. This is generally preferable when you have want high data throughput with minimal latency. Tweepy’s StreamListener class can be used to connect to Twitter’s streaming API and get the data. You need to create your own inherited class from StreamListener and modify the ‘on-status” function to get the tweet text and metadata and then insert these data into your MongoDB collection. The Twitter streaming API sends several data in JSON format that includes, tweet data, keep-alive signals, system messages that are consumed by Tweepy’s StreamListener object. It is also advisable to catch Twitter’s rate-limiting errors and disconnect your client if this occurs to ensure you are not locked out by Twitter.

Here is the code I used to do all this. Please fill in your own app’s keys and tokens to run this, and remember not to publish your keys, tokens and connection strings🔐🔐🔐

As you can see from the code above, I am not storing all the metadata associated with each tweet. Instead, I only store the tweet text, ID, tweet created time, location, and the user name of the tweet creator. This is to make sure that I can store enough tweets without running out of space on my MongoDB cluster which only has 500MB space.

Now, you can run the above script locally on your computer or you can run it on a remote server. There are several options to run your apps remotely, and these solutions are generally referred to as Platform as a Service (PaaS) solutions. Good examples of this category are Heroku, Google app engine, and AWS Elastic Bean Stalk. They provide a simplified framework to get your app up and running without you having to worry about setting up the infrastructure, operating system, and other middleware.

Running your app on Heroku

I used Heroku here since I haven’t used it before, and so far has been quite an easy experience. Refer to Part-1 of this post to where I show how to set up Heroku and deploy an app to run a Python script named bot.py. I modified my original Heroku app’s Procfile from Part-1 to run the new streaming_bot.py script instead of the original bot.py script

Now that we have all the pieces in place, start the Heroku app by runningheroku ps:scale worker=1 -a ‘your-heroku-app-name’. This will start running your python script within your GitHub repo and start retweeting based on the keyword searches you specified within that script. To stop the Heroku app anytime, run heroku ps:stop worker -a ‘your-heroku-app-name’

That was quite easy in retrospection, but I had to learn a few things on the way, which is always good. I am slowly collecting all Crispr tweets using this app. My plan is to do some analyses on these tweets in the coming weeks. Thanks for reading!

Full code can be found at: https://github.com/kimoyerr/twitterbot