On Assessing a System Architecture Design

Published in

Cermati Group Tech Blog

17 min readOct 19, 2023

Back in 2022, I wrote this article on how to make technical decisions and how to solve problems based on the conversations I had with my team members at Cermati Fintech Group. The purpose of the article was to summarize the general steps to identify and approach problems to improve our abilities in identifying problems worth solving and devising appropriate solutions for the problems.

In September 2023, I had a discussion with a team member in our one-on-one session. I was asked the following question: suppose we have designed a solution for a problem we’re trying to solve, how do we know if the design is good in terms of aspects not explicitly defined in the project requirements, for example, security or extensibility?

I feel like I can expand the answer I gave during the session to be more comprehensive, so I decided to put it in a written form. And because many other people might be able to benefit from it, it’s made into this article.

As an appreciation for the person asking me the question, I’d like to start this article with this quote from Miyamoto Musashi, which I remembered when I decided to write this article based on the question.

“A man cannot understand the art he is studying if he only looks for the end result without taking the time to delve deeply into the reasoning of the study.”
- Miyamoto Musashi

Miyamoto Musashi as depicted by Inoue Takehiko in Vagabond.

Cost of Development and Operations

One of the most obvious things we can look at when checking if the system design is good enough is by looking at how much it would cost us to develop it, and how much it would cost us to operate it later.

The development cost can be estimated by:

The number of people we’re going to need in the development process.
The number of hours each person is expected to work on it.
The cost of hardware and software required for development.

Take the estimated salary of the people working on the project and estimate their hourly rate, and then multiply it by the estimated number of hours they’re expected to put on the project. So if the project requires a lot of people to put their hours into it, or if it requires very close supervision from people with very senior roles in the company, the project better be really valuable.

Similarly, the deployment cost can be estimated by:

The number of people we’re going to need for the operations.
The number of hours of training required for the operators to be able to operate it well.
The cost of infrastructure (supporting hardware, software, and other components) required for the operations.
The maintenance cost of the system.

The number of hours for training the operators is basically the hours where the operators aren’t working on operating the system, so we don’t get the value out of the system being operated by the operator-in-training during this period. To optimize it, we should make the training as cheap as possible, which translates to making the operations as intuitive as possible so the operators will require less training.

The cost of infrastructure can be estimated by the cost of hardware, the cost of software licenses we need to pay, and other things that should be put in place in order for the system to run and to be operable. Optimizing for the infrastructure cost usually comes with optimizing for hardware cost, network resource usage, and software license cost.

The maintenance cost can be estimated by calculating the estimated salary of the people we require to perform the maintenance tasks, how many hours we’re expecting these people to spend working on maintaining it, and how many people we need to work on maintaining the system. Optimizing it translates to making the system as low-maintenance as possible and setting the technical skill bar required for the maintainers to be as low as possible.

An example case we have done at Cermati Fintech Group regarding hardware cost optimization has been told in another article, hardware resource usage optimization usually involves making our software as efficient as possible in terms of computing resource usage, and software license cost optimization usually involves setting up the system in a way that minimizes the cost for software licenses we need to pay for it (if there’s any, but if we strictly use community open source software it’s usually free).

The following diagram shows the setup for the VoIP servers’ graceful shutdown solution we implemented for the case of a power outage as described in a section of the article mentioned in the previous paragraph, where we leveraged a Raspberry Pi device as a beacon to receive and respond to periodic pings from the VoIP servers, which we used to implement our own version of a graceful shutdown solution that deviated from the official solution recommended by the UPS brand — because the official solution was more complicated to implement, required more expensive hardware, and was designed to work primarily for that UPS brand (which means we might need to come up with another solution anyway if we purchase UPS from a different brand later).

The diagram shows how power was supplied to the beacon and the VoIP servers, and how the VoIP servers can detect if there’s a power outage by periodically pinging the power beacon.

Optimizing for development cost sometimes means sacrificing the operations cost and vice versa, so trade-offs might need to be made. Which one we should optimize for should depend on the success criteria for the project. But in the case referred to by the diagram above, we managed to cut both development and operations costs by coming up with our own solution, which was simpler to implement, cheaper in terms of required hardware, and easier to maintain by the call center network team.

Robustness of the Solution

The system will need to be deployed somewhere, and the environment where it’s deployed might be quite hostile with a lot of moving parts that can go wrong and with people who’re trying to break, circumvent, or exploit the system.

This brings us to the matters of reliability and security. For both matters, we first need to identify the potential problems and threats we need to anticipate based on how the system is designed in order to be able to devise a plan to address them.

A few examples of things we can check on our design for reliability:

If the load is too high to the point it processes everything very slowly, can we easily add more processing capacity to the system?
If any of the external services the system relies on are down or inaccessible, how will the system handle the situation?
If there are abnormalities in the system’s behavior (e.g. error rate too high, response time too slow), how will we handle the situation?
Have we set up necessary measures to ensure the system’s going to be up according to our planned SLA for the system?

And for security:

What are the sensitive operations performed in the system and have we ensured that those operations can only be performed by authorized users?
What sensitive data do we keep in the system and have we ensured that the data is properly handled when at rest, in transit, and in use?
What critical points are there in the system, and what measures have we prepared to secure those critical points?
Suppose the system is breached, what have we prepared to allow us to detect the incident, contain it, and investigate it?

Reliability and security overlap with each other. For example, the ability to receive an unexpectedly large amount of traffic which might happen due to legitimate traffic because of viral content or a promo on our site (falls under the reliability domain), or due to DDoS traffic from malicious actors (falls under the security domain).

What we usually consider a reliability problem can easily be reframed into a security problem when we introduce a malicious actor into the scene. This is due to reliability mainly dealing with the availability aspect — and in some cases, also the confidentiality and integrity aspects— of the CIA (Confidentiality, Integrity, Availability) triad of information security. But unlike the case with security where we usually need to assume the existence of malicious actors in the scene whom we need to detect and respond to, with reliability we don’t need to have an adversary in our model and usually there’s more emphasis on faulty components and unexpected scenarios that trigger the faulty behaviors and cause the system to behave in unexpected manners.

An example of this in a system architecture case is how we’re architecting our production environment to serve requests to our web applications.

An illustration of our load-balancing architecture.

In the diagram above, we can see that the traffic from the Internet will pass through a WAF-enabled Cloud Load Balancer provided by GCP. Behind the Cloud Load Balancer, we have two OpenResty pods on the Kubernetes cluster that will forward the traffic to the upstream service that’s supposed to serve the request.

The reliability considerations for this setup:

We’re using a GCP-managed Cloud Load Balancer instead of deploying our own load balancer (Nginx or OpenResty) on a VM and exposing it to the Internet because we expect the Cloud Load Balancer to have a lower chance of failure due to unexpected issues from our side during our day-to-day operations.
We’re positioning the OpenResty load balancer as a second layer load balancer and an API gateway deployed on top of Kubernetes behind the Cloud Load Balancer, where we made sure to have two concurrent pods so we have a backup pod if one of the pods is down due to a failure.
Due to the OpenResty pods being deployed on Kubernetes, any pod that fails will be restarted by Kubernetes to allow them to recover automatically.

As for the security considerations:

We can easily configure a WAF (web application firewall) on the Cloud Load Balancer while offloading the responsibility for maintaining the Cloud Load Balancer to GCP, so we can focus on monitoring malicious traffic and tuning the WAF rules.
With the Kubernetes deployment setup, we can update our OpenResty version relatively more easily with less risk of causing downtime whenever we need to update it to a newer version for security updates, which translates to faster security update rollouts.
We also added strict egress firewall rules from the VPC network to minimize the surface an attacker could use to exfiltrate data and open reverse shell connection to their servers, with Kubernetes handling the issue of sanitizing the runtime environment from the payload injected into the service’s containerized runtime environment by restarting the pods if they become unresponsive or during redeployment.

In the context of the software code implementation, one thing that’s usually tied to reliability and security is how we handle the requests from our users in our back-end service. For reliability, it usually involves making the code performant and scalable, while anticipating possible scenarios that might end up causing system failures and intermittent issues. For security, it usually involves adding the appropriate security controls to ensure only users who are authorized to perform the operations are allowed to do it, while also making sure that the technical implementation we’re going for isn’t known to be vulnerable to attacks by malicious actors.

Compliance with Regulations and Industry Standards

As Cermati Fintech Group is operating in the fintech space, which is quite heavily regulated, we have to consider what regulations and standards we need to comply with in order to operate within the financial industry.

Since we’re based in Indonesia, we mainly need to comply with the rules and regulations enforced by regulatory bodies associated with the Indonesian government. Aside from the government regulations, there are also industry standards we’re committed to comply with such as ISO 27001 and PCI-DSS. Compliance with the regulations and standards is one of the major aspects we need to consider when designing our system.

The regulations and standards usually cover:

What data we may and may not keep in our system, and how the data should be stored, accessed, and processed.
How the employees’ access to the system is managed and how the systems’ operators’ activities are monitored.
How long we should retain the data related to users’ transactions and our system logs for cases where the data is needed for investigation purposes.
What security components we must put in place for the system.

There can be multiple subsystems in the company, and the scope of compliance we’re dealing with might need to be localized just for a certain vertical or a certain network segment. Hence, how we design the software, the network, and the platforms is crucial in determining the difficulty for us to comply with the standards moving forward.

So, when we’re reviewing an architecture design from this perspective, some things to consider:

What regulations and standards does the system need to comply with?
What are the compliance requirements for the regulations and standards? Have they all been properly implemented in the system?
Which parts of the system should we expect to be regularly audited? How easy is it for us to collect evidence from the system for the auditors to review? How easy is it for us to demonstrate how the system works to the auditors if needed?

For an example case regarding how we’re approaching this, let’s take a look at a case study that illustrates how we designed our network to comply with PCI-DSS. We can’t really talk about the details of our overall system architecture design and the security we set up for PCI-DSS compliance here since we’d be revealing confidential information about our security implementation, so we’ll settle with an illustration that focuses on the network segmentation and how that impacts the compliance efforts and audit processes.

The following is an illustration of how the network architecture looks like for the vertical we’re pushing to comply with PCI-DSS.

An illustration of how the network was segmented on a high level.

In this setup, we deploy all of the services we need for the said vertical in the same network segment, and the database instances for all services are located in another shared network segment. Suppose we’re deploying our services and database instances that handle credit card payment processing and data storage in the same network, we must ensure everything that’s able to communicate directly with the payment-related services and database instances is properly secured and with very strict access policies.

It is a bit difficult to ensure the policies are applied correctly when all of the resources are located within the same network segment, and it is complicated for us to manage the compliance and audit process because there are many components unrelated to credit card payment within the network segment that we might not be able to exclude from the PCI-DSS audit because of the possible risk of unrelated components accessing the information related to credit card payment without us being aware of it.

The personnel we grant access to manage and maintain the components unrelated to credit card payment processing might need to follow some extra procedures we require for PCI-DSS compliance when accessing the network segment for maintenance. This situation also leads to the payment processing services and database instances being exposed to people who shouldn’t have access to payment-related services and databases. If the vertical has a lot of people working on it, we’ll have quite a lot of extra employees whose activities in the network we must monitor to ensure the integrity of the payment system.

We can make it easier to manage by designing the system as follows.

An illustration of a network that has been segmented to have the payment processing environment and database instances isolated from the rest of the network.

By isolating the payment processing services’ and database instances’ deployment environment from the rest of the services and database instances required by the vertical, we can easily separate the system components that deal with credit card payments. This enables us to easily identify which assets we need to put more emphasis on during audits.

With this setup, it’s also more convenient for us to separate network access policies for the payment processing environment from the network access policies we previously made for the application deployment environment. We can also focus our network access monitoring efforts for PCI-DSS compliance just to the DMZ and payment processing environment with this setup instead of monitoring everything that happens in the application deployment environment — where most of the activities performed should have little relevance to the payment processing system.

But for this infrastructure setup to work, the application business logic and service dependency might need to be adjusted accordingly to allow the payment services to be deployed separately from the rest of the services.

Solving Problems Strategically

In the previous sections, we mostly discussed the system’s technical design from an infrastructure perspective. This is due to the infrastructure decisions mostly not revealing that much information regarding the procedures and business logic we implement, which might reveal confidential information about how our company operates.

A lot of the trickier parts of the system are tied to the specific software functions that need to be implemented in order for the system to bring the best return on investment from its development based on the exact business problems we’re dealing with. Whereas for infrastructure setup, it’s mostly quite generic and context-agnostic towards the business processes running on it. So for this section, the case studies will be referring to some of the projects we have publicly shared in our previous articles.

When we’re reviewing the architecture design of a software service from strategic problem-solving perspectives, there are a few things we might want to check:

What is the purpose of the service? Do we foresee that it might need to do something outside the scope of its current intended purpose? If yes, should we consider expanding the scope of the service, or should we prepare it so it will be easier to split it into multiple services later?
How do we expect it to be extended later? How should we design it so it can be extended in the way we’re expecting with minimum effort?
Suppose we need to introduce changes to how the service behaves later, how do we ensure backward compatibility to give the service’s consumers enough time to adjust to the new behavior?
Are there components of the service that might be reusable somewhere else? Are the components from somewhere else that might be reusable for the service?

One example we can look at for this section is BCL, a CLI tooling development and distribution framework we created to build and distribute our internal CLI development tools.

The following diagram shows BCL’s workflow.

How BCL works when publishing and installing packages, it reads the build configuration from the BUILD file for publishing a package and reads the package list from BCLFile when installing packages.

BCL was built quite early when we were still trying to come up with how our development infrastructure was supposed to be set up. Because we didn’t have a clear vision about how our development infrastructure was going to be shaped at the time, we tried to avoid committing to using a technology stack that we needed to commit significant time and effort to learn and maintain at the time.

One of us had experience with a framework called Bash CLI and proposed to use it for some of our initial internal developer tools, but we found it a bit inconvenient to use due to the lack of automatic scaffolding capabilities to set up the project structure. It also didn’t come with the package management and distribution capabilities we needed in order to manage and distribute the internal tools we were going to build. So we decided to improve it further and make it into BCL, with the added automatic scaffolding and package distribution capabilities.

We built BCL by leveraging Git along with other basic commands commonly provided by default on Linux and other UNIX-based operating systems as much as possible in the implementation, so BCL’s users are only required to install Git and nothing else in order for BCL to be functional. We used Git to implement the package distribution capabilities of BCL, which includes publishing and installing packages published on a Git repository configured to be a BCL package repository. We decided to use GitHub for hosting the BCL package repository to avoid having to maintain extra infrastructure components, which allowed us to commit less to learning and maintaining any new piece of technology at the time.

While BCL was built with Bash, the BCL packages themselves are not bound to also be built with just Bash. As long as the developer sets it up in a way that it can be distributed with BCL, any other programming language can be used to build our internal tools to be distributed with BCL. This gives us the flexibility to use whatever technology we consider the most appropriate for the internal tools we build.

Another example we can look at is IAMD, an IAM (Identity and Access Management) dashboard we built to streamline our IAM workflow and scale our IAM operations. The following diagram shows how IAMD’s modules interact with each other.

An illustration of how IAMD’s components interact with each other.

IAMD is built with JavaScript to easily enable engineers outside the infrastructure platform team — which is my team — to contribute to the development of the system, mainly in extending and integrating it into new target platforms. The reason is that JavaScript is a relatively common programming language for people to use and the majority of our engineers are familiar with JavaScript, so we decided to build the IAMX (Identity and Access Management eXecutor) core and connector modules with JavaScript and the rest of the application followed the decision.

We wanted to be able to scale the development effort for the IAMX connector modules, which can be integrated into the system as some kind of plugin to IAMD via the IAMX core module. This is because we have a lot of target platforms we need to manage, and many of them are not owned by the infrastructure platform team. It’s more appropriate for the integration to be implemented by the team who owns the platform, so we designed it so the team can implement the IAMX connector modules themselves and let us know when they want to have it plugged into IAMD to enable IAMD to also manage their platform’s IAM.

We also added interface standard versioning capabilities to anticipate cases where we need to extend the connector modules’ functionalities in the future. This way, we can maintain backward compatibility with the connector modules that were implemented using the older version of the interface standards and might be lacking some of the functionalities we expect from the newer modules.

More details on this can be found in this article explaining IAMD’s implementation.

Conclusion

If we revisit the initial question of how we know if the system we designed is already good and secure, the answer is a bit nuanced.

In this article, we’ve discussed a few things we might need to consider when assessing whether the system architecture design we made is good enough for the case it needs to handle later when it’s deployed. Suppose we already identified and addressed every constraint we can think of and the design is already optimized for the strategy we’re planning to go for, does it mean that the system is already as good as it could be? Well, probably still no.

When designing systems, we’re primarily limited by our technical and contextual knowledge regarding how to build the system, how the system is going to be used, and what possible disturbances and threats to the system we need to address.

We can always improve our technical knowledge by expanding and solidifying our fundamental and technological knowledge and skills, and we can always improve our contextual knowledge by learning more about the business problems the system is supposed to solve and the general direction of the organization we’re building the system for. But it just isn’t reasonable for us to master every technical knowledge and skill there is, and the business contexts of the system we’re building will definitely evolve over time — not to mention there could be a lot of ambiguities regarding the context that we might need to clear up in the first place. Therefore, expecting perfection in the design isn’t reasonable.

However, we can incorporate technical and contextual knowledge we don’t have yet into the system’s design by leveraging the knowledge of other people in the organization to help improve our design, which we can do by asking for their help in reviewing our system design and giving us feedback. The resulting design might still not be the best it could be. But as long as it solves the organization’s problems within the acceptable budget allocation while making it as easy as possible for us to operate, extend, and replace it as the situation calls for it later to the best of our knowledge and capabilities at the time, we can consider it a good enough design given the constraints.