System Design: Useful concepts for building services at scale!-Part 2

8 min readDec 15, 2021

In this previous article we had discussed on some of the important concepts needed for building services at scale. If you haven’t read it please follow this link .

Now we will continue to discuss some more important concepts :

Database Locking

Why locks?

To protect the integrity and atomicity of the data when the database is getting concurrent read write requests for a single record.

Types of locking:

Pessimistic Locking: Take the lock so that no one else can read or write the record. This type of locking is more suitable when there are large number of write operations on a single record.

Pros: Maintain the integrity of the data in better way as no one will have access to stale data.

Cons: Might need to deadlock conditions if not handled properly.

Optimistic Locking: Allow multiple users to read the data, but while updating data use some version control mechanism to identify if the data that is updated is stale or not. If the data is stale update cannot be committed. This is more useful when the chances of concurrent update of the same record are lesser.

Pros: All the services can read the data do the processing at same time.

Cons: Additional effort and handling needed to rollback the error transactions.

NOTE: If you don’t care about the previous applied changes on the data, then simple update the data and no locking mechanism is needed.

Observability

Software systems have become the integral part of any organisation. As the organisation evolve the software systems also evolve and become more complex in nature. With multiple components coming into play in performing a task in distributed systems, it becomes more difficult to monitor the system. Monitoring the system involves looking into the health of the system, identify the application issues, track the complete end to end flow of a request etc. Different components may have different variety of monitoring tools and alerting mechanisms to monitor, discover, identify and debug any issue.

Observability refers to various mechanisms used to not only trace issues in the system but also monitor and track the overall correctness of any system whether it is a monolith or micro service based system.

Logging, metrics and traces are often used interchangeably when talking about observability but each of them work in a unique way and have different outcomes. Using these systems alone cannot guarantee an observable system but good understanding and use of these powerful tools will help building better systems.

Failures in distributed systems rarely happens because of one single event generated by any specific component. Any failure usually may involve multiple possible triggers from one or more components of highly disjoint components in the system. So just by looking at the point of failure might not give the right insight of the cause of failure. So in order to reach the root cause it would be needed to:

Analyse the symptom at granular level.
Track the request lifecycle across various components in a system
Analyse the interactions between various components.

Thus building a good observability in the system also becomes a very important aspect of software development.

Caching

Cache refers to storing the data in a way that it can be accessed with minimum latency. Data stored in cache is usually short lived but retrieval is faster as compared to getting the data from the data source. Caches are there at every level of the system architecture like CDN, Nginx, backend servers, frontend etc.

Application server cache : Cache is placed directly on request layer node this type of cache is fastest to access but resides in memory of the server itself. Now for the distributed environment where there are number of application server nodes a distributed(eg. Hazelcast) or a centralised cache(eg. elastic cache, Redis) can be used.

Caches usually store the data temporarily and are not the source of truth. So if there is any update in the datastore the data in cache also needs to be updated. Thus caches are mostly eventually consistent. There can be different mechanisms to update the data in cache:

Write-through Cache: Any data that is updated will be written to the database and cache at the same time.

Write-Around cache: Here the data is updated in the datastore and will be updated in the cache after the data in cache has expired or manually purged.

Write back: This is cache first system in which data is written to the cache first and periodically updated into the data-stores.

Bulk head patterns

This pattern refers to isolating the flows so that one flow doesn’t impact other flows in the system. Thus this enabled to control the damage to a specific part of the system and rest of the system remains intact.

A cloud-based application may include multiple services, with each service having one or more consumers. Excessive load or failure in a service will impact all consumers of the service.

Example:

An application has two APIS:

GET : v1/users/{user-id}
PUT : v1/users/{user-id}

Now the load on the PUT API is high as compared to the load on the read API. For a application server thread pool will be of fixed size say there are 100 threads. Now if due to some reasons the PUT API performance degrades and threads are not getting freed up the GET API request will also see the impact and might start degrading. So this pattern comes to rescue where-in we can fix the thread pool size for PUT and GET APIs and degradation of the PUT API will not lead to degradation of the GET API.

The benefits of this pattern include:

Isolates consumers and services from cascading failures.
Allows you to preserve some functionality in the event of a service failure. Other services and features of the application will continue to work.
Allows you to deploy services that offer a different quality of service for consuming applications. A high-priority consumer pool can be configured to use high-priority services.

Bloom filters

These are probabilistic data-structures which gives the 100% confidence of the data existence. But if the result is not yes then it is not a definite no i.e data can be or cannot be there.

How it works:

Use the hash functions to get the hash value for a object.
Mark the hash value on the list at the index equal to the hash.
To check if the object is present the incoming object is put in the same function as step 1 and list is queried to see if the index is marked. If the index is marked then the object is present.
But the above step doesn’t guarantee the existence as many objects can have the same hash-key. So in-order to increase this confidence multiple hash functions and lists are used.
Now each object is mapped on multiple list based on the corresponding hash functions.
To check if the object is present then it is passed to all the hash functions and checked in all the lists. If the result is yes the object is present.
In order to increase this probability more and more hash functions can be used depending on the type of data.

Usages: Weak password detection, blacklisting the urls, internet caching protocols etc.

Distributed transactions

For of monolith services usually the data stored in a single database. If there are multiple operations involved in a flow they can be easily organised in a transaction and if one of the operation fails whole of the data can be reverted.

In the above diagram there is only single transaction and in case of any error the operation will be rolled back. In case there are parallel transactions there will be lock on the records and data isolation and consistency will be kept intact.

But in micro-service based system when the data is distributed across the applications it will become difficult to do the operations in form of transaction.

Example: Service A-> payment service, Service B-> Order service

In this example both the services are independent of each other and maintains the data independently. Now there is possibility that the payment is completed but order is not and hence for the order id the payment also needs reversed. But as the data is there on different services there has to be some mechanism that can help to manage transactions in distributed environment as well.

Commonly used mechanism for maintaining the transactions across services are:

Two phase commit
SAGA pattern

SQL VS NO-SQL

While designing the system the some questions often arises around databse choices like “Which database shall I use SQL/NoSQL” , “Will the database I choose be able to handle the load and is it scalable” etc. There cannot be a generic guideline on the basis of which you can say that I will use this type of DB. There is no one fit solution to this problem and choice will vary on case to case basis.

RDMS structured and data is stored in form of rows and columns. example: SQL server, Oracle , MySQL etc.

Non-RDBMS are unstructured and can have dynamic schemas. These can be further categorised on the basis of the way they store the data:

Key Value stores: Redis, Dynamo
Document Based: MongoDB
Column DB: Cassandra, HBase
GraphDB: Neo4j

Characteristics of RDBMS:

Data is stored in rows and columns
Schema is fixed.
ACID compliant.

Characteristics of Non-RDBMS:

Data can be stored in different storage models like key-value, documents, graph etc.
Schemas can be fixed or dynamic
ACID is sacrificed for performance and scale.

Some reasons to choose RDMS :

ACID is the most important compliance
Data is structured and doesn’t change much
Data can be stored in normalised manner.

Some reasons to choose Non-RDBMS:

Storage large amount of data with little or no structure at all.
Scaling is needed
Rapid development pace: schema need not be prepared and well defined upfront.
Consistency can be sacrificed.

Partitioning and Sharding

Partitioning refers to divide the data across machines based on some key. This helps in improving the maintenance , availability and scalability of the datastore. With the partitioned data it becomes convenient to scale the data horizontally by adding more servers.

Partitioning methods:

Horizontal Partitioning
Vertical Partitioning

There can be number of partition criterion like:

Key based
List Based
Round Robin

Challenges with partitioning:

As the data stored across servers so executing joins becomes a major challenge
There is not relation between the data across the schemas and there are no foreign ket relationship constraints. So all the relationship logic has to be maintained by the application server itself.
Operations that include multiple records might be slow as they are executed on multiple servers.

Sharding and partitioning are used interchangeably at most of the times but there is a slight difference between these two concepts. When a database size increase beyond the point where it becomes difficult to manage database can be sharded to achieve horizontal scaling. Sharding is process that includes choosing a shard key, create the shards and distribute the data across the shards. Each shard can be an independent cluster in itself and how the data is distributed across shards can be either handled at the application level or at the database level if the database provides that mechanism.

You can also provide suggestion in comments for any other concept that we can discuss. Thanks For Reading!

References:

https://stackoverflow.com/questions/129329/optimistic-vs-pessimistic-locking

https://docs.mongodb.com/manual/sharding/

System Design: Useful concepts for building services at scale!-Part 2

Written by umang goel