Streaming 40 tweets a second from Twitter with Redis, Python, and Elasticsearch
--
Check out Real-time.ml to see Tweet streaming in action!
Ever wanted to see what average people are saying in realtime on Twitter? This summer I was super interested in exploring the Twitter streaming API. After seeing the first democratic debate I thought it would be awesome if I could see the Tweets people are posting in realtime related to the debates! In this post I’ll show walk you though how to ingest tweets in realtime the easy way.
Prerequisites:
The first step is to clone and navigate to the repo:
git clone https://github.com/cpgeier/real-time-stream.git && cd real-time-stream
This repo has a submodule which is the main repo I use for real-time.ml. In order to get the streaming scripts you need you need to also pull the submodule into your folder.
git submodule update --init
The next step is to copy/move the stream.env.list file outside of the real-time-stream folder. This is to ensure you don’t commit any of your secrets to Github. You can set up a .gitignore or use git secrets but this is a simple solution for now.
Right now, YOUR_CONSUMER_KEY, YOUR_CONSUMER_SECRET and friends are in the place of where your keys and tokens from a Twitter app should go. In order to get the keys and tokens, you will need to register for a Twitter app. After the app is created you’ll be able to access this page with your keys and tokens:
With these values you can then fill out the four keys and tokens of your stream.env.list file.
We’re almost done! While you’re inside the real-time-stream folder run:
docker-compose build
If you want to jump right into the Tweets run this command:
docker-compose up
However, I would suggest taking a minute to look at the docker-compose file below included in the repository.
There are five services that this docker file creates. The last three use external images:
- Redis: In-memory caching
- Open Distro for Elasticsearch: Document storage andsearch
- Kibana: Visualizing elasticsearch data
I chose Redis for caching the Twitter because of how fast Redis is. The Twitter ingesting script needs to be as fast as possible because the Twitter API will disconnect if you’re not processing Tweets fast enough. So that’s why you need this middle caching layer because indexing Tweets into Elasticsearch is a slower process.
I chose Open Distro for Elasticsearch because it open-source and is an easy solution for searching and filtering text data. I also included the Kibana image to provide an easy way to see the tweets coming into Elasticsearch.
The next two files were built from Dockerfiles in the real-time.ml/stream/ submodule folder:
- twitter_to_redis: Python file to ingest streaming data from Twitter
- redis_to_elasticsearch: Python file to index Twitter data in Elasticsearch
They are relatively simple Python scripts turned into Docker containers the twitter_to_redis.py file contains the most interesting logic:
Modify the TRACK variable if you want to see Tweets using other keywords. Refer to track on the Twitter API page to see how these keywords work. One cool thing to note about this script is how you connect to the Redis service.
cache = redis.Redis(host='redis', port=6379)
You might say wait, that’s not a hostname. Shouldn’t that look something like “0.0.0.0” or “localhost”?
Docker networking actually just figures out all that for you and you can just refer to services by their container name!
This also explains why you see the urls below in the other files.
# From docker-compose.yaml
ELASTICSEARCH_URL: https://elasticsearch:9200 ELASTICSEARCH_HOSTS: https://elasticsearch:9200# From stream.env.list
ES_URI=https://admin:admin@elasticsearch:9200/
So after you’ve learned a bit about how this application works you can try deploying the docker-compose.yaml file with:
docker-compose up
It will take a minute or two for the Elasticsearch server to start up and there will be a lot of warning logs that you can ignore in the beginning. So just be patient and if you don’t see that any container has exited, after a minute or two you should see logs that logs like:
You are now streaming tweets from Twitter into Elasticsearch! To see the fruits of your hard work go to:
http://localhost:5601/
And login with admin, admin as you username and password respectively.
After specifying the twitter collection, you can start to visualize your data in the Discover tab! You can now filter and explore live Twitter data!
If you got this far in the tutorial, congrats! Be sure to star the repo on Github so other people can find it. During the second democratic debate I saw the number of Tweets per second go above 40 per second and this architecture worked flawlessly. In the future, I hope to explore doing more with this data as the tools used in this post can be easily adapted to collecting tweets for other keywords.
Stay tuned for more blog posts on how I built real-time.ml!