System Design: Useful concepts for building services at scale!-Part 1

8 min readOct 29, 2021

Building large scale systems is not a one day job but it is a process involving numerous refinements and improvements. There are some common techniques / concepts that comes in play or can be taken into consideration while building an application. In this article we will be discussing some important Concepts and techniques that every developer should know for building the large scale distributed systems.

CAP

This theorem states that in a distributed System it is impossible to to have more then two out of Consistency, Availability and Partition Tolerance

Consistency: All nodes have same data at all times

Availability: All requests will be served with success or failure response

Partition Tolerance: System continues to work even if there is network breakdown between some of the nodes. NOTE: Partition in this case refers to the communication breakdown between nodes.

A system is consistent if data on all nodes is same, but if there is network issue, data will not be propagated to one or more nodes resulting in stale data on some nodes. Now in order make the data consistent, writes to the some of the nodes needs to be stopped to avoid the stale data resulting in reducing the availability of the system. Thus a right CAP choice needs to be made when building a large system.

Examples. CA: RDBMS, AP: Cassandra, CP: MongoDB.

Long polling vs Web-sockets vs SSE

In traditional polling approach whenever a client needs some information from the server it needs to call the downstream service periodically to check if the information is available or not. This approach will result in the http calls overload on downstream service as it will keep returning empty responses to the caller till the data is not available. Long-polling, web-sockets and SSE are some communication mechanisms that can be used for such scenarios.

Long Polling : In this approach the client opens a connection with the server over HTTP. So now when the server is ready with the data it will push the data to the client through this connection. After the response is received the client needs to send another long poll request to the server, if the client needs some more data. If the server is not able to respond within time the connection is timed-out.

Web-Sockets: In this a two way communication channel is opened between the client and the server over TCP. Web-socket handshake is used between client and server to establish the connection and the data can be exchanged between two parties at any time thus reducing the overhead.

Server Sent Events: In this approach the client establishes a one way connection with the server over HTTP and server will send the data to client as the data is available. In this case if the client wants to send some data to the server it has to be handled separately.

HTTP vs HTTPS

Http is the communication protocol that provides a way to communicate over the world wide web.

As a request-response protocol, HTTP gives users a way to interact with web resources such as HTML files by transmitting hypertext messages between clients and servers. HTTP clients generally use Transmission Control Protocol (TCP) connections to communicate with servers.

Note: HTTP is layer 7 protocol whereas TCP is the layer 4 Protocol. TCP protocol manages the data streams but http provides information on what does that data stream contains.

HTTPS — HTTP over secure socket layer. Communication over simple HTTP is susceptible for various type of network attacks as the data is sent in raw form over the network. So in order to make the communication more secure data exchanged is encrypted before it is sent over the network.

Ex -

Suppose you want to access https://youtube.com,

When you hit this request, you are requesting the secure pages from youtube.
Youtube’s web-server will send it’s public key with it’s SSL Certificate. These certificates are digitally signed by a third party called Certificate Authority (CA) by using their private key. Once the browser gets the certificate it verifies its digital signature. Digital signature is created by CA’s private key and all the browsers are already installed with all major CA’s public keys. Once the signature is verified, it can be trusted.
A green padlock in browser appears. It simply indicates that the web-server’s(youtube) public key belongs to youtube and does not belong to any other server.
Now since the trust is built, the browser will create a symmetric key (shared secret key), before it sends it to youtube web-server, the browser encrypts one copy of symmetric key with the public key of youtube web-server.
Youtube web-server decrypts the symmetric key using it’s private key
Now, the youtube has exactly the same key as the browser.
Now on all the communication will be encrypted and decrypted using the same key at the browser and server end.

Chain of Trust

The Chain of Trust refers to how the SSL certificate is linked back to a trusted Root Certificate Authority

Root Certificate — A root certificate is a digital certificate that belongs to the issuing Certificate Authority. It comes pre-downloaded in most browsers and is stored in what is called a “trust store.” The root certificates are closely guarded by the Certificate Authorities.

Intermediate Certificate — They act as middle-men between the protected Root certificates and the server certificates issued out to the public. There will always be at least one intermediate certificate in a chain, but there can be more than one.

Server Certificate — The server certificate is the one issued to the specific domain the user is needing coverage for,

How chaining works : Your browser is preinstalled with all major Root CAs public key. Whenever there’s need to be verify the certificate, it begins with lowest level certificate and t traces back to the Root CA and checks whether the digital signature of Root certificate can be verified using it’s preinstalled public key.

If it can’t be chained back to a trusted root, the browser will issue a warning about the certificate.

Streams Vs Message Queues

In distributed systems, services communicate using the events. Each service will generate some events, which can result in some action that should be taken on some other service. In order to pass these events between services either the streaming or queuing of the events is done.

Message Queues are usually used for point to point communications. There can be multiple producers and consumers of the messages, but what the queue does is “it evenly distributes” the messages to all the consumers and will guarantee that each message is consumed only once. In case all the consumers are busy, the messages will be stored in the queue it self and will be delivered once the consumer is free. Examples of message queues include RabbitMQ, ActiveMQ, and IronMQ.

In MQ if the message is consumed it is removed from the queue. So in case any failure to process the message the message cannot be replayed. Message queues are used when:

One message needs to be delivered once
Guarantee on ordering of the messages delivery is not needed
Consumers can be scaled independently.

On the other hand streams work like log managers. In these the events are stored in form of log files and can be read by multiple consumers. Unlike queues same message can be read by more than one consumers. If the setup is done in right way this will deliver the messages to all the subscribers in specified order. Examples are Kafka, kinesis etc.

Streams are used when:

Deliver same message to more then one consumer
Ordering of the message delivery is needed
Multiple processing needs to be done on same message.
Streams store the the message for a specified period so even if the message processing was failed it will be available in the stream for replay.
Generally the messages are segregated in the topics and messages on once topic can be consumed by only one consumer as the message delivery order is important. This might become bottle neck sometimes when dynamic scaling is needed.

Queues and streams can also be used to throttle the load and rate limiting. Both streams and queues provide the way to communicate between services by generating events, but it is very important to choose the right one as per the use cases.

Synchronous Vs Asynchronous flows

What do we mean when someone says “we are calling the downstream service in sync”

When the communication happening between two systems in real time it is called “in Sync”

whereas what is “async”

Communication that happens independent of time is called “Async”

From micro-services perspective when one service need to communicate with other service in real time either over http or tcp it is called sync flow. Whenever a service needs to do sequential processing or it cannot move forward without receiving the response from the downstream sync flows should be used. Example: For the checkout page say a payment needs to be collected for an order so it can only be done after verifying the availability of the items so payment flow can only be triggered after the item availability check hence the sync process.

But async is sort of fire and forget, where the communication flow is not completed in the main process, it can be assigned to some other thread or event is pushed to some stream or message queue. Thus the sending service is not blocked and continue doing its work. Example: audit generation flow for any update event can be implemented in async.

Redundancy and Replication

Failures are bound to happen at some point or the other in a distributed system. Redundancy and replication can be used to reduce the probability of single point of failure or reduce the up time in case of failure condition.

Redundancy: This is the duplication of some important components in a distributed system like load balancers, application servers, databases, streams etc. Redundancy will help to make the system more resilient by reducing the downtime of the system in case of the crisis situation. Having redundancy at all the components may be overkill and expensive so its important to choose which components needs to be duplicated and which not.

Replication: Share the information between the duplicate resources so that the data is in consistent state. This is mostly used for databases.

Factors to take into consideration for replication:

Criticality of data. Is it ok for the system to have eventually consistent data.
Cost
Need of data across regions
Performance of the system as writing data to replicas may increase the latencies.

RateLimiting and Throttling

Rate limiter is used to control the rate of traffic sent by a client or a service.

If the request count exceeds the threshold defined by the rate limiter, all the excess calls are blocked. Here are a few examples:

• A user can write no more than 2 posts per second.
• You can create a maximum of 10 accounts per day from the same IP address.

Benefits of using an API rate limiter:

• Prevent resource starvation caused by Denial of Service (DoS) attack . A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls.

• Reduce cost. Limiting excess requests means fewer servers and allocating more resources to high priority APIs. Example : If the API internally calls some third party service that is charging on per call basis than rate limiting can help to reduce the cost.

• Reduce server load by filtering out excess requests caused by bots or users’ misbehaviour.

We will continue discussing on some more concepts in part-2 of this..Stay Tuned.

You can also provide suggestion in comments for any other concept that we can discuss. Thanks For Reading!

System Design: Useful concepts for building services at scale!-Part 1

Chain of Trust

Written by umang goel