Creating a Web Application to Display TV Show Ratings for All Episodes
I have been looking for a project that could showcase my technical abilities when applying for work as a developer. Recently I have been seeing these charts being created on /r/dataisbeautiful, but the creators were using spreadsheets and Photoshop to make them. Here’s a few examples from reddit:
They are very creative and aesthetically pleasing, but seems difficult to get this data for the thousands of TV shows on IMDb.
Perfect inspiration for creating software to replace tedious human work!
So I created TVCharts, a fullstack web application that takes the raw data files from IMDb datasets, stores them in a database, and displays them as charts of episode ratings over the entire run of the series.
Here’s the results:
The tech stack used for this project:
- React / Redux
- NodeJS / Express
- MongoDB
- Nginx / letsencrypt
- Docker
- AWS CDN (S3 + CloudFront)
The full project is available for review (and testing) on GitHub: https://github.com/ianthekid/tvcharts
Front-end
This project uses React create-react-app
as our front-end. The interface requirements are simple: search for TV shows and display the “chart” page above.
We will need to search a database, so we will setup an API to retrieve from a MongoDB database hosted on the back-end. The API functions in client/src/lib/api.js
communicate with server/api/api.js
First we will add a key for the API with an .env
variable and then build the assets to deploy. The yarn build
outputs the assets into the server/frontend
for our Docker deployment.
$ cd client
$ cp .env.example .env
$ yarn install
$ yarn build
Here’s how the React components piece together:
Back-end
The back-end is made up of separate services that deploy into their own containers so we can continuously update them and refresh our data from IMDb without having to completely re-build every time.
These services are:
- API: This will communicate between our database and our front-end
- Data: This service only has 1 job; pull in the latest data from IMDb
- Nginx: This will serve up our front-end and route our API requests, as well as generate our SSL certificate and push updates to our CDN (AWS)
- MongoDB: Database to query and sort IMDb datasets
Docker
All of these services can be created and updated with the docker-compose
tool to build each image/container to deploy. To build and launch the first time:
$ cd server
$ cp .env.example .env
$ docker-compose build
$ docker-compose up
Within each service, we also handle our application packages and dependencies with their own Dockerfile
I learned a valuable lesson with Docker on this project. The important difference between using RUN, CMD and ENTRYPOINT. When using a combination services, it is easier to manage the order of processes by launching your applications from within Bash
scripts. It is much more difficult trying to magically get the timing right for when RUN processes are completed properly before sending the first CMD.
How to use IMDb Data
One reason I liked this project idea is that IMDb doesn’t have a public API. At least not one regular folk can easily get access to retrieve data. There are a few unofficial 3rd party APIs for IMDb I found, but none of them seem to have the most up-to-date ratings data for TV episodes in the last few years.
IMDb does however publicly supply data in raw, tab-delimited files (.tsv) on their https://www.imdb.com/interfaces/, which are updated frequently.
We’re only going to need datasets which include TV show data:
- title.basics.tsv.gz: The basics from IMDb such as Title, date, and types of records (movies, tv shows, episodes, etc.)
- title.episode.tsv.gz: Simple relational records for TV shows and their episodes by number and season
- title.ratings.tsv.gz: The ratings and vote count for all records
For our application, we download each them with server/data/start.sh
echo "Downloading data source files..."
dir="./files"
imdb="https://datasets.imdbws.com"
files="basics ratings episode"
for file in $files
do
mkdir -p $dir/chunks/$file
wget -q -O $file.tsv.gz $imdb/title.$file.tsv.gz
gunzip $file.tsv.gz
mv $file.tsv $dir
done
There are however a few problems you need to solve if you want to read and sort through these files. They include millions of records in each of these files. This might not be an issue for a single user looking for a specific TV show’s data in Excel. But to make it useful in a web interface, we need a database.
Dealing with Millions of Records
Tackling a problem with large datasets forced me to avoid relying on the normal “good-enough” solutions for transforming data.
“We do these things not because they are easy, but because we thought they were going to be easy when we first started them.”
- every programmer starting a ‘side project’
For example, with small datasets in NodeJS you can load entire files into memory and loop through each line before saving the results. This will make your laptop fan go supersonic with millions of records, and will likely just run out of memory before finishing.
MongoDB has an import function that will accept .tsv files directly. However, there are millions of records that are not needed for this project. For example, we don’t need all of the movies and other non-TV show records in our database. We also can make our queries even faster by sanitizing and defining data types before ingesting them into MongoDB.
Javascript/Node is not multi-threaded by nature, but it is also not needed since we are dealing with text files. We can just break them into chunks to process like so:
#./server/data/start.sh
echo "Start creating file chunks..."
chunkSize=100000
chunkGroup=500000
for file in $files
do
lines=$(cat $dir/$file.tsv | wc -l)
loops=$(((lines/chunkGroup)+1))
for (( i=0; i<loops; i++ ))
do
start=$(( i*chunkGroup ))
max=$(( start+chunkGroup ))
node chunks.js $file $chunkSize $start $max &
done
done
wait
echo "file chunks created"
So instead of a multi-threaded solution we’re using a multi-process iteration. That’s where the &
at the end of the command and the wait
at the end of the loop will launch these simultaneously.
The result is it runs the same node script as multiple processes in smaller batches. This significantly reduces memory consumption over 1 monolith node process as well as maximizes CPU utilization. Which works multiple times faster if you deploy to a VPS with multiple CPUs.
Chunks
In server/data/chunks.js
we iterate through each file at the chunk sizes defined above.
We are not out of the woods yet with memory consumption. These are smaller chunks, but still hundreds of thousands of records. So we will read the file line-by-line as a data stream instead of as the entire file buffer:
readline.createInterface({
input: fs.createReadStream(`${file}.tsv`, {encoding: 'UTF-8'})
})
.on('line', function(line) {
//process data
})
.on('close', function () {
//wrap up
})
To process our data, we will only deal with data within the chunk parameters:
if(total >= start && total < max) {
count++;
chunk += line+'\n';
if(count == chunkSize) {
//Chunk file as JSON obj
let jsonChunk = tsvJSON(`${tsvHeader}\n${chunk}`);
let jsonFile = `${dir}/chunks/${file}/${file}-${fCnt}.json`;
//reset loop counter and chunk for next file
chunk='';
count=0;
doneCnt++;
fCnt++;
//write chunk to file as JSON
fs.writeFile(jsonFile, JSON.stringify(jsonChunk), () => {});
}
}
We will then use the tsvJSON()
function to turn our Tabbed data into JSON objects we can save as separate files. While we’re at it we can clean up our data by skipping null values (\\N))
and transform our text values to relevant Int and Float values so our database is easier to query and sort.
The result? We have valid .json files that are formatted to easily import into MongoDB.
MongoDB
For small projects, I have always been a fan of Google’s Firebase and Firestore as a Document database solution. It’s free for small projects, has a library for every popular language, and makes scaling a database very quick and easy.
However, their free limit is 20K records a day. This won’t work with millions of records (not for free at least).
Luckily our database needs are simple:
- We don’t need to have any users or external access since we are only using it locally within our containers.
- Since we rely on IMDb datasets as our source of information, we can just drop the entire database and re-create it every time.
The Docker mongo image is turnkey for us here. So all we really need to do to utilize it with our services
- API: Read and query data connecting locally through our Docker network
- Data: Clear out the old data and import our .json data chunks
Which makes connecting simple with just using our Docker network:
#./server/mongo.js
#this function will return 'client' we can use in our API
MongoClient.connect(mongodb://mongo:27017)
Then we can initialize our database every time we launch our Data
container service: server/data/mongoInit.js
And then connect and query data with our API
container service under server/api/api.js
that can group together the 3 IMDb datasets we started with.
Example: To get a single TV Show chart from IMDb datasets, we need data from the collections we created from our datasets. We also need to reference the parent/child relationship between a show and it’s episodes with IMDb using the tconst
IDs.
Here is how we can retrieve that data using our MongoDB setup:
const db = client.db('tvratings');
const episode = db.collection('episode');episode.aggregate([
{ $match : {'parentTconst': req.params.parentTconst} },
//get rating
{ $lookup: {
from: 'ratings',
localField: 'tconst',
foreignField: 'tconst',
as: 'rating'
}},
{ $unwind: "$rating" },
//get title info
{ $lookup: {
from: 'basics',
localField: 'tconst',
foreignField: 'tconst',
as: 'title'
}},
{ $unwind: "$title" },
//format output
{ $project: {
_id : 1,
tconst : 1,
parentTconst : 1,
seasonNumber: 1,
episodeNumber: 1,
averageRating: "$rating.averageRating",
numVotes: "$rating.numVotes",
title : "$title.primaryTitle"
}}
]).toArray().then(data => {
var results = (data) ? data : 500;
res.send(results);
if(client) client.close();
});
This will output an Array of Objects for all of our Episodes and their respective ratings+votes. Our React front-end will handle the sorting and combining of seasons in client/src/lib/api.js
which transforms this to a sorted Object with Season numbers as their keys since our data is sequential in nature.
Deploying — The Home Stretch
Well technically we are already live with the docker-compose up
command at the beginning. But for explanation purposes, the “go live” magic happens in the Nginx
container service.
I decided to build this like I would for an active application that has development and production builds. So the server/.env
variables will determine how the Nginx
service handles deployment of assets as well as the configuration files for the Nginx service on this container.
Assuming nothing is set, then the development build is very simple. The client
latest build is copied to the var/www/html
directory to serve at the root of the DOMAIN
variable set in server/.env
SSL Certificate
I also included a production
workflow that will use or generate a letsencrypt SSL certificate by running this script:
$ docker exec -it nginx ./certbot.sh
This will check for an existing certificate or generate a new one using the webroot
verification method:
certbot certonly --webroot -w /var/www/html -d $DOMAIN --email $EMAIL --non-interactive --agree-tos
CDN Hosting (AWS S3 + CloudFront)
For production
builds, another script will handle the file rewriting of all the front-end assets to use the CloudFront endpoint. Then it will sync all of the latest assets to the S3 Bucket.
$ docker exec -it nginx ./aws.sh
The sync method means it will check for existing or missing files, and mirrors source /var/www/html
to target S3 Bucket.
That’s it!
I built TVCharts as a way of displaying my technical abilities to companies looking to hire a frontend or full stack developer.
If these skills are a good fit for your development needs, please review the code on GitHub repo or email me directly ian@ianray.com