Building for Billions: Mastering System Design for Massive Scale

9 min readFeb 15, 2024

Level 1. Foundational Structure:

Let’s begin with a basic setup: when a web browser or mobile app requests a website, it queries the Domain Name System (DNS) to obtain an IP address mapped to the site. For instance, entering 173.194.203.106 in the browser takes you to Google.com. Once the IP is retrieved, HTTP requests are sent directly to the web server, which returns HTML pages or JSON containing the stored data. It’s a straightforward process for a small user base.

Limitations :

1. Scalability: The basic setup described in Level 1 may not efficiently handle large traffic volumes or a rapidly growing user base.
2. Limited Functionality: The basic structure may support only static content delivery or simple data retrieval.

Hence with the growth of the user base, one server is not enough, and we need multiple servers: one for web/mobile traffic, the other for the database.

Level 2. Introducing Database:

Separating the web traffic server from the data server enables independent scaling. But which database should you choose? It hinges on your specific needs:

Relational Databases :

Structured Data: When your data is well-organized and follows a clear structure.
Complex Queries: For applications requiring complex queries or transactions.

Non-Relational Databases :

Massive Data Storage: Opt for non-relational databases when dealing with huge volumes of data.
Speed Matters: When your application demands lightning-fast response times, non-relational databases shine.

Limitations :

Single Point of Failure: It introduces a potential single point of failure. If the database server experiences downtime or becomes overloaded, it can disrupt the entire system’s functionality, impacting user experience and service availability.
Scalability Challenges

Having covered Level 2, scalability issues persist as user bases expand and the risk of database downtime looms. This is where horizontal and vertical scaling steps in.

Organization’s Toolkit for the same:

Relational databases: MySQL, PostgreSQL
Non-relational databases: MongoDB, Cassandra

Level 3. Vertical and Horizontal Scaling :

Vertical scaling, referred to as “scale up”, means the process of adding more power (CPU, RAM, etc.) to your servers.

Advantage: Simplicity and suitability for low traffic periods.
Disadvantages:
a. Hard limit on server resources.
b. Lack of failover and redundancy, risking complete downtime if one server fails.

Horizontal scaling, referred to as “scale-out”, allows you to scale
by adding more servers to your pool of resources.

Advantage: Overcomes limitations of vertical scaling.
Disadvantage: Requires more complexity but offers resilience and availability by distributing workload across multiple servers.

Scaling can address situations where a user cannot access a website because one of the web servers is offline. In such cases, additional servers can be dynamically provisioned or resources can be allocated to other servers to ensure continuous availability and smooth operation of the website. Additionally, a load balancer can redirect traffic away from the offline server to the functioning ones, further enhancing reliability and user experience.

Organization’s Toolkit for the same:

Vertical scaling: Amazon RDS, Google Cloud SQL
Horizontal scaling: MongoDB Sharding, Cassandra Cluster

Level 4. Addition of Load Balancer :

A load balancer evenly distributes incoming traffic among a set of web servers designated for load balancing.
Users connect directly to the public IP of the load balancer, eliminating direct access to the web servers by clients. This setup enhances security, as communication between servers occurs via private IPs. Private IPs are accessible only within the same network, ensuring they are not reachable over the Internet. The load balancer communicates with the web servers exclusively through these private IPs.

Limitations :

The limitation of the above setup is that while we have addressed the needs of the web tier, we have neglected the data tier. Up until now, we have been using only one database Although the database is not depicted in the diagram, we have been utilizing a single database thus far, which lacks support for failover and redundancy. Data replication is a common technique used to address this issue.

Organization’s Toolkit for the same:

Load balancers: AWS Elastic Load Balancing, NGINX Load Balancer

Level 5. Data Replications :

Data replication is the process of copying data from one location to another. There are various methodologies for achieving this, with one popular approach being the master/slave methodology.
In this, A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. All the data-modifying commands like insert, delete, or update must be sent to the master database.

Advantages:

• Better performance: In the master-slave model, write and update operations occur in master nodes, while read operations are distributed among slave nodes, enabling parallel query processing and enhancing performance.
• Reliability and high availability: Data replication across multiple locations ensures data preservation in the event of a disaster, ensuring continued operation of the website even if one database server is offline.

What if one database goes offline?

If a slave database goes offline, traffic is redirected to another available slave. If only one slave is operational, traffic redirection occurs to the master database.
In the event of the master database going offline, a slave is promoted to master. However, there could be discrepancies in data between the promoted slave and the previous master. Therefore, data recovery scripts should be executed to synchronize the data before the promotion.

Note: Database administrators or automated configuration tools manage the assignment of master and slave roles in database replication setups. Traffic redirection during database failures is typically overseen by the load balancer.

The addition of a CDN can further scale the above setup by addressing the limitation of static file delivery.

Organization’s Toolkit for the same:

Replication tools: MySQL Replication, MongoDB Replica Sets
Disaster recovery solutions: AWS RDS Multi-AZ, Google Cloud Spanner

Level 6. Addition of CDN and Caching layer:

A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc.

Working:

When a user accesses a website, the CDN server nearest to them will serve static content. If the content is not already cached on the CDN server, it retrieves it from the database and stores it locally for a specified Time-to-Live (TTL) period. This process optimizes content delivery, reducing latency and improving user experience.

Considerations of using a CDN:

Cost management: Ensuring the CDN expenses are optimized within budget constraints.
CDN Fallback: Implementing backup plans in case the CDN service encounters issues.
Setting appropriate cache expiry: Configuring cache lifetimes to balance the freshness of content with bandwidth optimization.

Organization’s Toolkit for the same:

CDN providers: Akamai, Fastly, Cloudflare
CDN services: Amazon CloudFront, Cloudflare CDN
Caching solutions: Redis, Memcached, Varnish Cache

Level 7. Stateful to Stateless architecture:

When scaling the web tier horizontally, it’s smart to move session data out of it. Storing session info in long-lasting storage like relational or NoSQL databases is a good idea.

Stateful architecture:

A stateful server remembers client data (state) from one request to the next.

Limitation:
Now, if a user A’s HTTP request needs authentication, it must be directed to Server 1. However, for other services like Server 2, user A’s authentication will fail.

Stateless architecture:

In stateless architecture, Each server operates independently without relying on a shared state. We’ll utilize relational or NoSQL databases to store and manage user state.

In this stateless architecture, HTTP requests from users can be sent to any web server, which fetches state data from a shared data store.

Once state data is extracted from the web servers, auto-scaling the web tier becomes straightforward by dynamically adding or removing servers according to traffic demand. As our website grows and attracts users from around the world, it’s important to have multiple data centers to ensure availability and provide a better experience for users in different places.

Organization’s Toolkit for the same:

Session management tools: Redis for session storage
NoSQL databases: MongoDB for storing state data

Level 8. Including Data Centers:

A data center is a physical location that stores and processes an organization’s data and applications.
GeoDNS, also known as geo-routed DNS, is a handy service that directs users to the nearest data center based on their location. It works by resolving domain names to IP addresses, ensuring users are routed to the closest server for faster access and better performance.

Additionally, these data centers serve as backup options in case another data center experiences downtime or encounters issues.

Challenges for Multi-Data Center Setup:

Traffic redirection: Utilizing GeoDNS to direct users to the nearest data center efficiently.
Data synchronization: Ensuring data consistency across multiple centers, even during failover, through replication.
Test and deployment: Crucial to test website/application performance across diverse locations and employ automated deployment tools for consistent service delivery.

To further scale our system, we need to decouple different components of the system so they can be scaled independently.

Organization’s Toolkit for the same:

Datacenter providers: AWS, Google Cloud, Microsoft Azure
GeoDNS services: Amazon Route 53 Traffic Flow, Cloudflare GeoDNS

Level 9. Message Queue:

A message queue, a durable component stored in memory, facilitates asynchronous communication. It acts as a buffer, distributing asynchronous requests.
The fundamental architecture is straightforward: Input services, termed producers or publishers, generate messages and publish them to the message queue. Subsequently, other services or servers, referred to as consumers or subscribers, connect to the queue and execute actions specified by the messages.
Decoupling the producer and consumer, enables independence and can be implemented in front of databases or servers.

Organization’s Toolkit for the same:

Message queue systems: RabbitMQ, Apache Kafka, Amazon SQS

Level 10. Logging, Metrics and Automation:

As now our website has grown to support a sizable business, it’s important to invest in tools for logging, metrics, and automation.

Logging:

Keeping an eye on error logs is crucial for spotting system issues. You can check these logs on each server or use tools to bring them all together for easier management.

Metrics:

Gathering different kinds of metrics helps us understand how well the system is doing.

• Host level metrics: CPU, Memory, disk I/O, etc.
• Aggregated level metrics: the performance of the entire database tier, cache tier, etc.
• Key business metrics: daily active users, retention, revenue, etc.

Automation:

As systems grow, automation becomes super helpful for boosting productivity.

As the data grows every day, your database gets more overloaded. It is time to scale the data
tier.

Organization’s Toolkit for the same:

Logging tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk

Level 11. Database scaling:

Similar to servers, databases can also scale both vertically and horizontally.

Vertical scaling:

Vertical scaling, or scaling up, involves increasing the power of a database server by adding resources like CPU, RAM, and disk space. For instance, Amazon Relational Database Service (RDS) offers database servers with up to 24 TB of RAM, enabling them to store and manage large volumes of data efficiently.

Horizontal scaling:

Horizontal scaling, also known as sharding, involves adding more servers to a database system.
Sharding breaks large databases into smaller parts called shards, each with its unique data. To locate specific data, the database server uses a hash function. While sharding helps scale databases, it comes with challenges:

Resharding data: Sometimes, a shard can’t hold more data due to rapid growth, requiring redistribution of data.
Celebrity problem: Heavy access to certain shards can overload servers, especially if popular data ends up on the same shard.
Join and de-normalization: Join operations become difficult across multiple shards, so databases are often simplified to speed up queries.

Organization’s Toolkit for the same:

Vertical scaling solutions: Amazon RDS with provisioned IOPS
Horizontal scaling solutions: Amazon DynamoDB, Google Cloud Bigtable

Ready to scale your systems like a pro? Thank you for reading to the end. Follow for more.

Building for Billions: Mastering System Design for Massive Scale

Level 1. Foundational Structure:

Limitations :

Level 2. Introducing Database:

Limitations :

Organization’s Toolkit for the same:

Level 3. Vertical and Horizontal Scaling :

Organization’s Toolkit for the same:

Level 4. Addition of Load Balancer :

Limitations :

Organization’s Toolkit for the same:

Level 5. Data Replications :

Advantages:

What if one database goes offline?

Organization’s Toolkit for the same:

Level 6. Addition of CDN and Caching layer:

Working:

Considerations of using a CDN:

Organization’s Toolkit for the same:

Level 7. Stateful to Stateless architecture:

Stateful architecture:

Stateless architecture:

Organization’s Toolkit for the same:

Level 8. Including Data Centers:

Challenges for Multi-Data Center Setup:

Organization’s Toolkit for the same:

Level 9. Message Queue:

Organization’s Toolkit for the same:

Level 10. Logging, Metrics and Automation:

Logging:

Metrics:

Automation:

Organization’s Toolkit for the same:

Level 11. Database scaling:

Vertical scaling:

Horizontal scaling:

Organization’s Toolkit for the same:

Written by Bhargav Sharma