System Design Notes #1. Gradual Improvement. Part 1
How to scale a System from single to millions of active users. Part 1: Load Balancer, Web Server scaling, DB replication and caching.
This is a series of notes about system design. The main motivation to make these notes for me is to better understand the system design topics. You may also find them useful if you need to refresh some common topics or to get an outline of the subject.
Single Server
This is the point some software projects should use as a starting point. A single server system assumes that you have the only Web Server, even w/o a Database.
This is how this system works:
- A user wants to reach the “mrbalov.io” web page. The browser goes to DNS and retrieves the appropriate IP address.
- DNS responds with the IP address for the “mrbalov.io” domain.
- Then browser uses an IP address to reach the Web Server.
- The Web Server responds with the HTML page or JSON data.
Database
The previous system doesn’t have stored data. The only it does is respond with static data: HTML of JSON. To store data (e.g., users, welsh corgis, etc.), we need to add a Database to our system. Let’s do it!
There are 2 types of available databases:
- Relational(MySQL, PostgreSQL)
- Non-Relational, or NoSQL (Redis, Memcached, MongoDB)
Relational databases are useful for structured data that may have relations with each other. For instance, table “users” may have a “corgiId” field to point to a certain entity in the “corgis” table.
Non-Relational databases are useful for unstructured data. Also, they are useful when low latency is required. Or when an amount of data is huge.
Scaling
There are 2 types of scaling:
- Vertical (scale-up)
- Horizontal (scale-out)
Vertical scaling means that you add more power to an existing server: CPU, RAM, and so on. There are several major drawbacks to this approach:
- High cost
- Hard scaling limit
- Single point of failure (if a server goes down, nothing can replace it)
- Lack of redundancy (no “empty” space for a case of growing pressure
Horizontal scaling is about adding more servers that are working simultaneously. This approach solves the problems specified above. But it may be rather complex to set up. The first problem we need to address is routing to the right servers.
Load Balancer
The purpose of a Load Balancer is to distribute connections between Web Servers.
Make note that for a sake of security the Load Balancer uses private IPs.
This is how the Load Balancer can help our imagined System:
- If one of the Web Servers goes down, the Load Balancer will route traffic to the alive Server.
- When traffic increases suddenly, more Web Servers may be added to the pool.
So, now the Web Tier has the following properties:
- Failover — this is an ability to recover from a situation when a Web Server goes down.
- Redundancy — this is something like “empty space” that allows dealing with suddenly increased traffic.
The next step is to improve the Web Tier.
Database Replication
Database replication means having multiple copies of a database. The purpose of replication is to separate read operations from mutating operations (insert, delete, update). This is helpful as most systems have a big number of reading operations and much fewer mutating operations.
The master/slave approach is the most common. This is described in the diagram below.
There are advantages of Database replication:
- Performance (because of high read ratio over mutations)
- Reliability (if one of the DBs dies, there are other operational ones)
- Availability (when replicated DBs are in different locations)
Caching
This is a common technique for decreasing response time. We are going to add Cache Tear to our imagined System to improve the response time for reading data from Database.
There are some useful considerations for adding Cache Tier to the System.
- The cache is useful when data is read frequently.
- Think about the expiration policy. This is a time data is cached. This should not be either too short or too long.
- When the Cache Tier is distributed, there may be problems with consistency.
- The Cache Tier may be a Single Point of Failure (SPOF). As an option to avoid this, the Cache Tier has to be distributed.
- The eviction policy should be considered. This means removing redundant data when storage is full. Some popular eviction policies: Least Frequently Used (LFE), First In First Out (FIFO).
Intermediate Summary
This is the System diagram we have now. This is not the final solution. And, actually, the process of improving any System is endless! In the next chapters, we are going to improve the current state of things.
This was the first part of the “Gradual Improvement” paper. In the next part, we will take a look at the CDN, StatefulStateless Web Tier, Data Centers, DB scaling, and other topics.