Growing Your Small Systems to a Large Distributed System in a Reliable Way
When we start to build systems, especially in startups, It is very important to know how to scale your systems. It is even more important to make sure that your systems are able to scale up or down based on the traffic. But unfortunately, sometimes we focus more on delivering more and fast and compromises on various things for the long run like performance and scale.
Nowadays, growth could be very rapid and our systems overwhelm with the growing requirement of scale and performance. When our systems start to thrive, then everyone has to spend extra time fixing things and at that time the new development gets deprioritized, which is painful. Also, our customers get impacted by the slow performance of our application.
I have seen this rapid growth multiple times in my career and was a part of those changes. I have learned multiple lessons the hard way. So I will share a few of the lessons which I have learned.
Murphy’s Law is a familiar adage that is always true “Anything that can go wrong will go wrong”
So basically what matters to your system when you are growing. We also consider a few basic parameters to judge the stability of our system.
- Availability: Your system should have high availability. Uptime percentage is key for customer experience, not to mention that an application is not useful if no one can use it.
- Performance: Your system should continue to function and perform its tasks even under heavy loads. Furthermore, speed is very crucial for customer experience.
- Reliability: The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). In simple words, “continuing to work correctly, even when things go wrong.”
We can scale our services in 2 ways to quickly support increased traffic
Scale Vertically: Adding more power to your current machines to have more capacity to serve more requests.
Scale Horizontally: Adding additional nodes that enable the system to serve more traffic or process more data/events.
Now when we talk about scale, simple vertical or horizontal scaling wouldn’t suffice the requirement of scaling our services to a greater extent. We hit a roadblock after a certain limit. Also, we need to consider multiple things when we want to deal with increased traffic in an effective way.
This post is a collection of practices I’ve found useful to reliably operate a large system.
Monitoring
Before even reacting to increased scale and scaling up or fixing the systems, we need to first figure out the issues that our systems are having.
In order to have a reliable system that is also available, we need to be able to detect errors quickly and for that, we need to gain observability in our system. We should categorize our monitoring into multiple buckets.
- Infrastructure Metrics: Measure utilization of part of our infrastructure, for example, we can monitor the CPU usage, memory, and disk space used by our application. If one or more machines/virtual machines are overloaded, parts of the distributed system can degrade. Server health, bandwidth, CPU utilization, and memory usage are basics that are worth monitoring.
- Application Performance Metrics: APM is a set of actions and software that checks the well-being of an app based on its availability, responsiveness, and behavior. With APM, you can quickly identify potential problems before they affect your end-users. Answering the question “is this backend service healthy?” is a pretty common one. Observing things like how much traffic is hitting the endpoint, the error rate, and endpoint latency all provide valuable information on service health.
- Business Metrics: While monitoring services tells us a lot about how correctly the service seems to work, it tells nothing about whether it works as intended, allowing for “business as usual”. Identifying business events that the service enables and monitoring these business events is one of the most important monitoring steps.
Without monitoring, you are just firing shots without an aim. You should have to find the target to fire the shots properly.
Alerting & Auto Scaling
Monitoring is a great tool for people to inspect the current state of the system. But it’s really just a stepping stone to automatically detect when things go wrong and raise alerts that people can take action on. Setting proactive alarms really helps to take action proactively before things go wrong.
We also need to consider adding only relevant alarms which are actionable, otherwise, they get neglected.
Also based on those alarms, you can trigger autoscaling if you are reaching a certain threshold. This will help in automatically handling the momentary traffic.
Splitting Services
Breaking the functionality into multiple services can drastically improve performance, and scalability, and reduce costs. Rather than scaling the whole system, we can scale the most problematic services. Each one of the services can utilize the most optimal for the job architecture, programming language, and data stores.
There are a couple of things to consider before starting with Microservices. Microservices-based architecture introduces a new set of problems. Consider proper boundaries when you are splitting services or even designing from scratch.
Embracing Asynchronicity
When we make an asynchronous call, the execution path is blocked until a response is returned. This blocking has an overhead of resources, mainly the cost of memory and context-switch. We can’t always design our systems using async calls only, but when we can we make our system more efficient. This way we reduce chocking of our systems and also ensure that operations are completed with eventual consistencies.
It enables message producers and consumers to be scaled differently to achieve more speed with scale. You can add more processing servers to ensure that requests are served faster while still enjoying the benefit of a non-blocking client.
With asynchronous flow, fault tolerance is increased and the processing server can shut down, and requests from the client will still be handled. Once the server is back up, the message queue serves the pending requests to the server for processing.
Asynchronous processing introduces a non-blocking request handling process into your applications, giving you a more decoupled system that can be easily scaled for performance.
Timeouts, Sleep & Retries
Proactive time-outs will also help in waiting infinitely for a service to respond. Sometimes just a few of the calls get stuck during the congestions or network which could be avoided by dropping the call and retrying. This also happens when you have one faulty node create a problem in a cluster then as well retry will safeguard our system on retry.
Any network may suffer from transient errors, delays, and congestion issues. When the service calls another service, the request may fail, and if a retry is initiated the second request may pass successfully. That said, it’s important to not implement retries in a naive way (loop) without delay between reties.
The reason for that is that we should be conscious of the called service. There may be multiple other services that call the callee service simultaneously, and if all of them will just keep retrying, the result will be a “retry storm”. The callee service will get bombarded with requests which may overwhelm it and bring it down. In order to avoid a “retry storm”, it’s common practice to use an exponential backoff or use it with circuit breakers.
Circuit Breaker
It’s common for software systems to make remote calls to software running in different processes, probably on different machines across a network. If you have many callers on an unresponsive supplier, then you can run out of critical resources leading to cascading failures across multiple systems.
The circuit breaker pattern is the solution to this problem. A service client should invoke a remote service via a proxy that functions in a similar fashion to an electrical circuit breaker. When the number of consecutive failures crosses a threshold, the circuit breaker trips, and for the duration of a timeout period, all attempts to invoke the remote service will fail immediately. After the timeout expires the circuit breaker allows a limited number of test requests to pass through. If those requests succeed then the circuit breaker resumes normal operation. Otherwise, if there is a failure the timeout period begins again.
Kill Switch / Feature Flag
Feature Flag is a technique to turn some functionality of your application off, via configuration, without deploying new code. This practice doesn’t provide a 100% guarantee that our code is bug-free but it does have the effect of reducing the risk of deploying a new bug to production. Further, in case we enabled the feature flag and we see new errors in the system, it’s easy to disable the flag and “go back to normal” which is a huge win from an operational perspective.
Using this, we can also disable and enable some of the features which will help in managing our traffic better at a high scale.
Caching
Caching is key to scaling your applications. Caching is the most common way to improve the performance or scalability of your web applications. Caching is a very effective and proven way to improve performance, it’s due to the fact that most applications are read-heavy and perform a lot of repetitive operations. So application performance can be greatly improved by avoiding repeatedly performing resource-intensive operations several times. An operation could be resource-intensive due to IO activity such as accessing database/service or due to computation/calculations which are CPU intensive. I have personally observed and scaled the APIs to serve the scale up to millions of requests per minute with the help of caching with a latency of around 5 to 6 milliseconds. It also helps in better customer experience and satisfaction by improving the overall performance of APIs by reducing latency.
When you are thinking about scale and if you are using distributed caching then scaling your cache is also not a big hurdle. You can horizontally scale easily. Also, you can store the hot data in the cache and can serve it from your cache.
Consider the 80–20 rule which always help in boosting your overall system.
In case you are planning to use in-memory caching then please be very cautious and careful about caching. You have to scale it vertically which is challenging. You also need to implement proper eviction policies and have to manage them across instances in a distributed environment. It has a lot of benefits but also comes with lots of limitations.
Use Content Delivery Network (CDN)
The CDN servers keep cached copies of the content (such as images, web pages, etc.) and serve them from the nearest location. The use of CDN improves page load time for users as the data is retrieved at a location closest to it. This also helps in increasing the availability of content, since it is stored at multiple locations.
Scale Your Databases
- Ensure the right set of indexes are in place.
- Tune your database default parameters to gain optimal performance.
- Check for the notorious N+1 Queries.
- Upgrade the database version to get the best that DB can offer.
- Evaluate the need for Horizontal scaling using Replicas and Sharding.
- For relational databases, consider right partitioning strategy to boost the query performance.
Separate Databases For Reading And Writing Concerns
In most user-facing systems the number of data reads is orders of magnitude higher than the number of data writes. We can significantly improve systems like these by copying our data to a separate read-only database and using it to serve data requests. We can also use multiple reader/slave instances which can help in sharing the load. This way you can even do horizontal scaling easily even for your RDBMS.
Sharding
Sharding is another widely used scaling technique. In short, it’s a logical data partitioning based on a selected set of values, often called partition or shard keys. These values should allow us to split the data into multiple autonomous partitions. To get the most out of database sharding, we should be cautious when selecting sharding keys. The sharded partitions must have data distribution evenly. Otherwise, the largest unbalanced shards will slow our applications down.
Sharding enables us in horizontal scaling and manages data efficiently, the query load can be distributed across many processors. Throughput can be scaled by adding more nodes.
Now it can be applied in most of the databases like MySql, Mongo, Redis etc.
Quality From The Beginning
When we start to build our system, we try to focus mostly on building the systems quicker. But slowly we keep adding more and more features. In this process, we often miss keeping the quality, performance and scale in mind. With every change and every new feature release, we should consider all the NFR(Non-functional requirements). Should add proper unit test cases, follow TDD, and make sure your queries are optimized. Make sure that your code is maintaining the quality from the beginning.
Avoid External Dependencies
Nowadays everyone is trying to adapt to microservice architecture. But this causes in increase the external service calls. An external dependency is another cause of failure in your service that you add and also contributes to cascading failures. That’s also the reason that we need to consider proper boundaries during microservices design but we should also consider avoiding external calls as much as possible. Add an external call only when it’s the only way.
Load & Performance Testing
Load testing is important in the Software Development Lifecycle because of the following reasons:
- It simulates real user scenarios.
- It evaluates how the performance of an application can be affected by normal and peak loads.
- It helps save money by identifying bottlenecks and defects on time.
It is always advisable to test, analyze, and fix bugs during the Software Development Lifecycle before actually deploying an application in the real world and have it fail with end-users. As part of a continuous integration cycle, it is good practice to run automated load tests to see if code changes affect performance.
Load testing also helps us in identifying how much load we can serve. Based on that we can use a virtual waiting room, auto-scaling, and other alarms so that we can take appropriate actions to support the increased load. Otherwise, your services might end up like this:
Virtual Waiting Room
Virtual waiting room technology is used to prevent sites from crashing during periods of unexpected traffic spikes. These tools detect abnormal traffic and provide a waiting page where users can wait until the site is able to handle additional visitors. Companies use these tools to ensure their site is capable of maintaining availability in unexpected situations or during special events such as an online sale or breaking news story.
Capacity Estimation
Capacity estimations are very essential to scale. We should find out the number of requests expected in RPM or RPS and accordingly we should assess and plan our resources. Have proper calculation about the service need. Finding the right number of servers, CPU cores, RAM requirements, disk size, database size, cache size, network bandwidth, etc is very important.
We should estimate those things considering steady load and peak load both and should estimate the numbers accordingly.
During capacity estimations, we sometimes under-provision or sometimes over-provision. If we are under provision then we will lose the revenue and will leave our customers unsatisfied. But if we over provision then we might waste the money. To balance out this we can also leverage auto-scaling based on assessed alarms as mentioned above. But regardless, capacity estimation will be needed to have a stable environment.
Slow Rollout
Roll out your code to a subset of users and then after ensuring it’s stable on the production. Once you ensure that everything is going well and validate that if users are liking it, take proper feedback then you can gradually roll out the feature to the complete user base if the response is positive. During this process, we usually find bugs before it impacts all the users, we can fix those quickly, can test again with the limited set of users and can roll out quickly. This helps us in rolling out with confidence.
Test Your Service Resiliency
We should also test our services' resiliency thoroughly to check how our services behave or impact when any random server terminates. During peak loads, this happens and to be more resilient you need to examine.
We also call it Chaos Engineering.
Chaos Engineering is the discipline of experimenting with a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. To read more you can check out https://principlesofchaos.org/
One of the very common resiliency checks you can perform using Chaos Monkey. It randomly terminates instances in production to ensure that engineers implement their services to be resilient to instance failures. You can checkout https://netflix.github.io/chaosmonkey/
Even After All This, Systems Will Go Down. How To Act In Those Critical Times?
Analyze and take a decisive call after figuring out and looking at your monitoring tool and find out cause of the issues. The easiest way to provision more hardware and scale horizontally as well as vertically helps a lot to mitigate issues for short term.
Conclusion
There are various factors that play a role in our application in scaling and help our applications to cope with increased demand. I hope that the above tips will help you when you’re looking to take your application to the next level.