Security in Software Development and Infrastructure System Design
Nowadays, the concerns regarding security and privacy are growing among the users of technology. Considering that Cermati is a financial technology company, security is one of our main concerns when designing and implementing our system due to the amount of sensitive financial data we’re handling.
The idea of this article came from a coworker of mine — our engineering manager, Michaela Nathania. She told me that she’d like me to share about information security to our engineering team, either by talking in our internal tech talk or by writing. I consider myself a better writer than speaker, and I think writing it down will allow me to deliver the message in a more scalable way for the long term. So here it is in the form of an article.
Security itself is a broad field covering the aspects of people, process, and technology. We’re going to cover some parts in the technology aspects of information security.
A typical image that pops in many people’s heads when talking about security in digital systems is an attacker performing their tricks using various tools — such as Nmap as shown in the following screenshot.
But this article isn’t talking about that side of information security work. We’re going to focus on security in software development and IT infrastructure system design, which lies on the other side of the information security work.
Security and Risk Management
Before we dig deeper into the topic, we first need to understand the relationship between security and risk management. In terms of information security, we can consider a system secure when it fulfills the CIA (Confidentiality-Integrity-Availability) triad.
A system is considered secure when it fulfills the requirements regarding confidentiality, integrity, and availability.
- Confidentiality: Is the read access to resources correctly implemented so that they can only be read by the authorized users?
- Integrity: Is the write access to resources correctly implemented so that they can’t be written or overwritten without the consent of authorized users?
- Availability: Are the resources guaranteed to be available for access by the authorized users whenever needed?
Perfectly ensuring the confidentiality, integrity, and availability requirements of the data in a system can be very expensive. Sometimes we don’t have enough resource to secure all of the system components, especially when the system we’re trying to secure is huge and the resources we have at hand are limited. At times like this, we need to take a risk management approach to the problem to be able to effectively secure our system with available resource.
Security and risk management work side by side, where risk management principles are to be applied to relevant contexts in order to minimize the possible negative consequences from the abuse of a system.
To decide which part of the system needs to be prioritized, we first need to assess the risks we have to deal with on different parts of the system. As shown on the risk assessment matrix above, we can rate the severity of the risk based on the likelihood of the incident happening and the consequences following the incident should it happens.
After assessing the risks on different parts of our system, we can focus on parts that have higher risk severity levels when securing the system — since that’s where security breaches would have the most impact.
Secure System Architecture Design
The mindset of security and risk management can be applied starting on the design phase of the system. Translating the requirements — including the security requirements — into a workable system design before we proceed with the implementation is a good start for a secure system development.
The image above shows the security mechanisms at work when a user is accessing a web-based application. Common security concerns of a software system or an IT infrastructure system still revolves around the CIA triad as described in the previous section.
When designing a system, we first need to see the general architecture of the system that should be implemented for the business requirements to be fulfilled.
The above image shows the general architecture of a microservices-based web application — a common approach for today’s HTTP-based web applications and services.
Suppose we’re designing a microservices-based system and trying to plan for the system security from the architecture design. We started by performing a risk assessment to see which parts of the system have the highest risk. The system consists of an API gateway, an authentication service, a user configuration service, a payment service, and a transaction service.
The five services serve as different components and functions of the system, each carries their own risks. But let’s focus on the service that serves as the front-line defense of the system: the API gateway.
The API gateway is the one accepting requests directly from the public Internet, and the machine it’s deployed on is at more risk to be compromised by an attacker compared to the other services deployed on machines that are not directly exposed to the public Internet. The API gateway will need to parse and process request securely so attackers wouldn’t be able to exploit the request parser by sending a disfigured HTTP request.
Disfigured request that’s not properly handled may cause the API gateway to crash or to be manipulated to execute instructions it’s not supposed to execute. It’s a good idea to put the API gateway behind a firewall that can help filter out malicious requests and stop exploit attempts before they reach the API gateway — but the firewall itself might be exploitable, so pick something that’s already battle-proven and quick to patch whenever a vulnerability is found.
While the other services also have their own risks we should handle, the API gateway and the authentication service are to be prioritized due to the higher risks they pose to the whole system if compromised.
By putting API gateway as the front-line — with some extra protection such as firewall rules — we can avoid exposing every service from direct access. Since only the API gateway is hit with traffic directly from the public Internet, we can focus on securing the API gateway from any risks involving disfigured requests and ensuring the requests forwarded by the API gateway to each respective service are already safe.
Imagine if we let every single service to be directly accessible from the public Internet. We’ll need to ensure every single one of them has the same standards for implementation security regarding how to handle raw requests. This setup would be much more expensive to maintain as the number of services we have increases, as we need to secure every single one of them instead of just one key service that acts as a bridge between the public Internet and services in the internal network.
Poorly-planned system security on architectural level would leave us with the extra work of securing many things that we shouldn’t even bother with, if only we designed the system architecture properly from the start.
Risks in Lines of Code
Thomas Dullien a.k.a Halvar Flake gave a keynote on Black Hat Asia 2017, titled Why We are Not Building a Defendable Internet. The talk is available on Black Hat’s YouTube channel and can be found here. I took some points in the talk (starting from 22:12 in the video) as inspiration for this section.
Halvar talked about utilizing risk management principles when developing software and IT infrastructure. Working in the field of software development, we’re generally rewarded for writing code and shipping it to users. In the infrastructure management side of the work, we’re generally rewarded for adding new capabilities to the IT infrastructure that adds value to the business. The work of cleaning up after a messy old piece of code and keeping the system components up to date is often overlooked, as their values to the organization are not easily measurable from the business point of view.
Every line of code we wrote and every library or system software we installed that runs somewhere in our system adds more complexity to the system. The more complex the system is, the higher the likelihood we’re overlooking possible risks and attack surfaces that might one day bring ruin to the whole system. There’s a risk lurking in every line of code we’re running, yet we keep adding more and more code for our system to run.
When developing software, we generally translate requirements into lines of code. Our lines of code might do exactly as the business requirements need it to, but it might also be carrying some unintended bugs or unhandled edge cases. Being more thoughtful regarding the code we’re writing might help us to consider the edge cases that we might miss otherwise.
Similar to when developing software, when developing IT infrastructure we generally look for software and hardware configurations that can best solve our problems. Sometimes the easiest way to get things to work isn’t really the best way. We should consider the risks when configuring a server in a certain way, and why we would choose this particular setup compared with the others.
The unintentional side effects of the code and server configuration might put our organization, business, and even users at risk. We started with a few poorly-made decisions when writing code and configuring servers. After a while, we have a whole huge system born from one poorly-made decision to another.
At this scale, it would take a considerable amount of resource for us to review the decisions we’ve made in the past and see what risks it contains. Not to mention that some of the possible risks don’t come from an individual bad decision, but from a few bad decisions joined together — which might be harder to notice when reviewing module by module.
One way to manage the problem that might arise from complexity is to avoid complexity. By designing our system architecture to have the modules structured in a manageable way, we can minimize the chance of having an invisible risk that comes from a few modules working together in an unexpected way. We then can focus our effort on only a few critical parts of the system.
We can also avoid introducing new unanticipated risks by strictly analyzing the possible consequences of adding new modules to the system and removing the modules we no longer use.
Identifying Vulnerabilities
On the previous sections, we have been talking about how to minimize the risks that might arise from vulnerabilities in our system and how to avoid having the vulnerabilities built into it. But how do we identify a vulnerability?
One thing to remember is that not all vulnerabilities are bugs. A vulnerability might actually be an intended behavior of the system that exposes the organization or the users to a security risk. To see whether a certain behavior of the system is a vulnerability or not, we need to analyze the context where the component is going to be executed.
const _ = require('lodash');let deletedItems = request.body.objectIds || [];_.foreach(deletedItems, (itemId) => {
// assume ObjectModel is a sequelize model
// with paranoid: true
ObjectModel.findById(itemId).then(item => {
item.cancelPendingTasks();
item.destroy();
});
});
Suppose that the snippet above is intended for a web-based application’s back office function where an organization’s admin staffs delete invalid records of ObjectModel
inputted into the system by the application’s users. The code snippet is lacking in integrity, which allows a staff to delete the whole ObjectModel
records without the system recording who deleted the records. This can be abused by a malicious admin staff to delete the whole records.
Even if the ObjectModel
is configured with paranoid: true
option which turns the destroy()
function call into a soft delete mechanism (see Sequelize docs for reference), it can still be costly to the organization’s operations to have a malicious staff soft deletes the whole records over and over again through the endpoint containing the code snippet.
To avoid this, we can introduce a logging mechanism in the flow.
const _ = require('lodash');
const sessionManager = require('./sessionManager');
const actionLogManager = require('./actionLogManager');let deletedItems = request.body.objectIds || [];_.foreach(deletedItems, (itemId) => {
// assume ObjectModel is a sequelize model
// with paranoid: true
ObjectModel.findById(itemId).then(item => {
actionLogManager.log({
actor: sessionLogManager.activeUser,
object: item,
action: actionLogManager.actions.DELETE
});
item.cancelPendingTasks();
item.destroy();
});
});
By logging the activities in the system, we can keep track which staff performs delete operation to a certain ObjectModel
record. We put the logging before the actual tasks are executed so that we can keep track of the attempts even if the tasks to delete the records are failed in their executions. If we put the logging function in the flow after the records are deleted, we’re at risk of having the records successfully deleted but the logs are failed to be written — which violates the integrity rule since we can’t track the changes of the deleted records.
This flow might already be good enough if we go with the assumption that our admin staffs are the only ones who can run the code snippet. But if we’re allowing the web application’s public users to also access the same record deletion endpoint, we’re having an insecure direct object reference vulnerability there since the users can delete any records simply by providing the record’s ID without any validation on whether the user should be authorized to delete the records.
const _ = require('lodash');
const sessionManager = require('./sessionManager');
const actionLogManager = require('./actionLogManager');let deletedItems = request.body.objectIds || [];
let activeUser = sessionLogManager.activeUser;_.foreach(deletedItems, (itemId) => {
// assume ObjectModel is a sequelize model
// with paranoid: true
ObjectModel.findById(itemId).then(item => {
if (!activeUser.isStaff() && !item.ownedBy(activeUser)) {
return;
}
actionLogManager.log({
actor: activeUser,
object: item,
action: actionLogManager.actions.DELETE
});
item.cancelPendingTasks();
item.destroy();
});
});
This way, we can prevent a non-staff user to delete records not under their ownership. While the code snippet we have here doesn’t have it, we can also add logging when a non-staff user is attempting to delete records owned by other users to help us detect malicious behaviors in the system.
While we can perform analysis on each code snippet and test each configuration manually to find vulnerabilities, it might take some time to manually analyze everything when we’re checking the security of an existing system — especially if the system is big.
We can refer to OWASP’s listed vulnerabilities to get familiar with common web application vulnerabilities faster so we can avoid the same mistakes when developing our own web-based applications and services. We can also refer to Mitre’s CVE database to check for publicly known vulnerabilities when checking our dependencies and infrastructure systems.
Conclusion
Building a secure system is not easy, and there will never be enough resource to make a system perfectly secure. But by performing a risk assessment on the system we’re trying to secure, we’ll be able to identify which parts of the system need to be prioritized.
The risk assessment approach can be used for performing a security assessment on an existing system, but it’s also useful when we’re trying to design a system from scratch. By applying the principles to our system architecture design and adding mechanisms to mitigate possible issues, we can avoid possible severe risks in the system from the start.
Even for a system that’s designed with security in mind at the beginning, the system will grow more and more complex as time goes on. The complexity will add more risks to the system, as a more complex system’s behaviors tend to be more unpredictable. We can manage the system’s complexity by performing some system maintenance tasks by restructuring parts of the system in order to simplify the overall design and interaction between components, and also removing parts that are no longer used.
Still, some vulnerabilities might remain in the system. These vulnerabilities can be identified by reviewing the code and configurations we have — they need to be reviewed according to the context of their runtime environment and purposes.
An organization typically allocates more resource to development and operations compared to security. It’s understandable, since delivering products and services is the mean for the organization to move towards the fulfillment of its purposes. When perfectly securing the system isn’t economical for the organization to do, it’s a sensible decision to focus on securing the critical components of the system while keeping those components manageable.