Design Decisions — Rick Flag

Links, votes, comments

Omar Elgabry

Published in

OmarElgabry's Blog

7 min readJul 28, 2018

Introduction

Rick Flag is a reddit-like application, hacker news, indiehacker, or combination of these.

Application features

Submit links (has title, url, votes)
Upvote and comment on links
Has a home page with Top & Recent links (across all groups)
Has groups (a set or related links; much like subreddit)
Each group has list of Top & Recent links

Basically, user submits a link in one of the groups.

Facts & Numbers

10M active users
100K queries per second are spent reads
10K of requests per second are spent on writes
50K different groups

Users consume more than they produce. But still, write is so high.

Architecture Overview

1. CDN

The CDN not only caches the asset files (images, CSS and JS), but the entire HTML pages.

It pings the website periodically, caches the content, and delivers the content to users instead of hitting the web server.

We can only cache the public pages (for non-logged in users), since everybody has his own personal page when logged in.

So, instead for the DNS to point to load balancer IPs, point to the CDNs. These CDNs will return a cached request (whole HTML page), even before sending the request to load balancer.

As the content is the same for all users, you try to push the content away from you, and closer to the user, thus, faster to deliver.

We can also have a cache even in front of the web severs, which will cache the entire HTML page (with CSS, JS). Options can be Varnish, and Squid.

2. Load Balancer (+ more)

3. Web app servers (+ more)

4. Job Queues & Servers (+ more)

Job queues store a list of jobs that need to be processed asynchronously. Mostly, “write” operations.

The job servers are dedicated servers that carry out the work by pulling jobs from the queue, write it to database, and cache also.

5. Caching

Caching is the name of the game here. That’s the most important piece.

What we cache? Everything. The user meta data, and list of links in groups and home page.

We’ll be using Cassandra for pre-computing home and groups’ pages (top & recent) each contains a list of links. Memcache is used for everything else.

Cassandra is a distributed NoSQL database. It has a ring of nodes, and distribute the data across these nodes, and data is replicated across nodes.

Why Cassandra? …

Concurrent read operations can be served from different nodes as result of replication.
If any node is down, then data is available on another node.
It uses “consistent” hashing. It basically map a key to set of nodes (instead of a specific node). When loosing a node, only data in this node gets redistributed, and not all the keys. Unlike with the “modular” hashing in Memcache, if we lost a node, then all the keys will rehashed, and reinserted.

Why we do caching for pages?

Because these pages will be requested over and over again. Solution. Pre-computed cache for pages.

6. Data processing

This is responsible for running batch processing every few minutes across huge amount of data to get the top links of a group, or across all the groups.

We can use MapReduce.

Map: Maps between a list of data, and a function. Reduce: Given the ‘Map’ (data, function), apply a(n aggregate) function to reduce the output.

The good news is. Amazon has Hadoop cluster.

7. Database

A relational database, like MySQL, would work fine. The good news is data is split based on the feature; users, links, votes, comments, and for each database is a master-slave. The salve replica is for availability.

No joins; It’s not needed. If needed, It’s done on application code. Easy to scale each independently.

8. Cloud storage (+ more)

The CDN caches the page, and page contains CSS and JS files, and so the CDN will fetch these files from, say, Amazon S3.

No static files stored on app servers. That’s an overhead and wasting resources for handling connections to fetch the static files.

Part I — Reading Pipeline

In our case, we have 4 major readings actions:

Top & Recent links in home page
Top & Recent links of a group
A link with it’s comments
User data.

Aright. So, we have caches sitting in front of the database. Two different caches serve different purposes.

The simple objects, like a user, a link are served from the Memcache. While home and group pages (top and recent links), are stored in Cassandra.

Initially, we’ll warm up the cache by running queries (the most popular queries). A similar way used to test the application is to save the latest, say 10K requests, and re-play them.
Database is the slowest piece. Try to avoid it. Simple queries shouldn’t hit the database.

So, How It works?

The home page & The group pages (top and recent links). A user goes to the home or group page, a request sent, usually, served from Cassandra.

It already has a list of links (top and recent) across all the groups for the home page and for a specific group.

In Cassandra, every key-value pair is stored in N replica number of nodes, and the size of the list is intentionally limited, say 1K links. It’s not likely that the user will keep going “next →” till page 100 (assuming each page has 10 links).

Displaying a specific link with it’s comment & User data. That will be served from Memcache.

What if there’s a cache miss?

Since Cassandra replicates the data across different nodes. Chances are, data is in the cache. And If a node is down, data still exists on another node.

The home page. Get a list of all (recent) links (maybe stored in an Index?), and then go to Memcache; given a list of link ids, return the links (title, date of creation, votes, etc). Then, update the cache.

The group pages. Same. Instead we need the recent links for a group.

This individual links are stored and sorted by date of creation.
Only for the Top links (home or group page) we can’t just simply fetch them from the database. It’s done periodically by MapReduce. We’ll get into that later.

The cache miss on Memcache will hit the database. That’s when someone trying to view a link (and it’s comments) that has been posted long time ago.

Part II— Writing Pipeline

The write pipeline is (and should be) async.

It means, when user writes or updates a link, or upvote, a job is inserted to a queue, queue servers pull out the jobs, and do the work at the background.

What’s the work involved?

Write to the database.

At the same time, update the cache, both; Cassandra & Memcache.

Cassandra updates the list of recent links for the group (and home page). Memcache stores the individual links data.

What about the top links in a group and home page?

That’s where the MapReduce comes into play.

Every few minutes, we get data from links database replica (detected for MapReduce). Then, we run map reduce jobs to compute the top links (could also be for home page recent links). Finally, the result stored in Cassandra.

We give amazon Hadoop cluster the data, it loads it, runs the jobs, return the output, and shut all the machines down.

Inserting or updating a link can be done by accessing that relevant precomputed page in Cassandra.This is a small change. No need to re-run query against the database.

Replication lag?

Updating the cache guarantees that the next request(s) will be fetched from cache, and not the database.

Cache stampede?

That’s when multiple cache misses coming from concurrent requests make high load on database.

Solutions? As we said:

Update cache at the same time when inserting or updating the database.
Initially, warm up the cache by running most frequent queries.

Race condition?

That’s when two different requests accessing same data at same time.

On database level, transactions can handle such a situation. But, cache can’t.

A solution could be:

Request 1 & 2 will get the value associated with a key. When you get a value for a key, you’ll also get a unique hash value; indicated latest change made (like commit id)
Request 1 will update the value passing that unique hash value.When updating existing value, the unique hash value will change; indicating a new change has been made.
Request 2 will try to update the value passing the unique hash value (now it’s outdated)
So, this operation will be refused. Request 2 to has to get(key) again (maybe in a while loop?), which in turn will return the new unique hashed.

get(key)                  // value, unique
cash(key, value, unique)  // T/F operation successful or not.

Thank you for reading! If you enjoyed it, please clap 👏 for it.