Three pillars of a “good” software product

Apurva Misra
Analytics Vidhya
Published in
6 min readJan 21, 2022

How many times has it happened that your manager mentioned terms like SLA or SLO and it went over your head?

I recently started reading this awesome book — Designing Data-Intensive Applications by Martin Kleppman and thought I would share a bit about what I am learning! I personally think this blog post will help you out if you are working in any capacity with a technical product. In this blog, I am going to go over the definition of Reliability, Scalability and Maintainability; why they are important and how we can ensure we get them right in our products.

Reliability

A system needs to keep working correctly even when faced with unexpected circumstances eg. hardware/software faults or human error. It is important to predict these faults and try to design a system that is fault-tolerant/ resilient. Kleppman does mention that there is no fault-tolerant system-

If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space

It is in our best interest, however, to trigger faults and make sure the error handling or bugs is taken care of. The library Chaos Monkey by Netflix let you terminate processes randomly to uncover faults to aid in building more resilient systems.

Hardware errors: Currently, a lot of companies rely on cloud platforms for their infrastructure needs. There are instances when the virtual machine provided through the platform goes offline randomly. A system that can work around the downtime and can come up online one node at a time is a way to make it resilient against hardware faults. Redundancy is also important, deploying a service on multiple machines, so if one of them has an issue the service doesn’t go offline completely. For example, deploying a pizza order receiving chatbot on multiple machines.

Software Errors: These are harder to anticipate and can have a correlation with other nodes leading to multiple failures. Pizza order receiving chatbot has a software bug that causes a failure. We would want to make sure the pizza request was saved somewhere so it can be handled when the system comes back online and the customer doesn’t have to spend the evening cursing your platform. Solution: using a messaging queue (also making sure the number of incoming messages equals the number of outgoing messages).

Human Errors: Humans are known to be unreliable and unforunately we are part of every system from design to operations. What can we do to make the systems resilient from human errors?

  • Designing systems that minimizes the opportunities for error. For example the Ops team updating infastructure using code (Infrastructure as Code) rather than doing it manually. However, if the interface is too restrictive humans tend to go around it. If one of the Ops team member does the infastructure change manually then the code goes out of sync which will lead to more errors down the road.
  • Providing sandbox environment to explore and experiment.
  • Quick rollback in case of errors and gradual deploy to gain time if there are any issues.

Scalability

Scalability is a term to decribe whether a system can handle an increase in load. A very frequent phrase that gets thrown around in technical teams is ‘performance testing’/ ‘load testing’ before any solution goes into production. Generally, there are numbers like requests/sec, ratio of reads to writes in a database etc that the team is trying to optimize around.

  1. Batch processing: In these systems we care about the throughput that is the number of records getting processed per second. For example, the cron jobs that run over night, we care about the time it takes to process a certain number of records.
  2. Online systems: In these systems we care about how long it takes for the processing to take. If we are doing online shopping we would want our recommendations to appear in sub seconds rather than waiting for them. Response time:- time it takes after a client sends a request to get the results back. In general, response time is not one specific number for a system but rather a distribution of values.

Latency vs Response time: Response time is what the client sees which also includes networking delays, queueing delays while latency is just the duration that the request is waiting to be handled. Therefore, reponse time is a better refelection of what the client experiences.

Locust is an open source tool which lets you carry out performance test on your API endpoints, given below is a screenshot of the response time statistics observed for an endpoint. One thing that requires our attention is the result being aggregated in percentiles. Why is that important? Why don’t we just take an average?

50%ile = 250 ms — this means 50% of the requests take less than 250ms and the rest 50% requests takes more than 250ms

Similarly,

99%ile = 670 ms — this means, out of 100 requests 99 of them take less than 670ms while 1 of them takes more than 670ms

Percentiles help us quantify user experience, high percentile of response time for example 99.9th percentile response time says that 1 out 1000 requests is above a certain threshold of time. This is where the company needs to ask if they are okay with one out of 1000 requests having a slower response. Percentiles are often used in Service Level Objectives (SLOs) and Service Level Agreements (SLAs) which define the expected performance and availability of a service.

Maintainability

How many times have you been made to work on a legacy system which is complex and has pre-built assumptions, also, anyone who has ever worked on it has already left the company.

Maintainability is about following best practices to enhance simplicity and evolvability (future alterations). For example, making sure that the right abstractions are used to simplify the system and also the code is modular, uses DRY.

Maintainability includes one more aspect — Operability. Making it easier for the Operations team to keep the system running. Such as, monitoring the health of the system, tracking down problems to their source, being aware of the inter-dependency in the system, keeping the system up to date with security patches (remember, the recent log4j security bug?) and more.

Simplicity doesn’t always have to mean that we reduce the functionality of the system. It just means that for a problem with a certain complexity we do not implement a solution with far more complexity. This specially is relevant to the Data Science field, when the model isn’t able to handle all the edge cases a lot of code is thrown in to fill that gap with assumptions. This situation leads to a hairball which the next employee is made to untangle and then he/ she decides that it is time that we write the whole thing from scratch.

To summarise, it is important as an engineer working on sofware to stress test your system and find faults that need to be handled, do load testing to make sure your clients are not left hanging and be the bigger human and write code that the person after you would love to work with.

Keep Testing & Keep Coding!

If you want to connect with me you can reach out to me on Linkedin

--

--