Automating Configuration Management at Scale

Published in

DBS Tech Blog

12 min readNov 25, 2020

Security configuration management in DBS involves compliance checks and guidelines set by international organisations to enhance overall security. These security configuration definitions are converted into an automated capability to scan servers in the bank for the purpose of non-compliance reporting and rectification. The process is depicted in Figure 1.

Figure 1 Security Configuration Management Process

In this article, we will look at how we were able to address the challenges of security configuration management through automation. We developed an in-house security system, SecureSys, which will enforce the security conformance end to end. In particular, we will be focusing on the following:

a) The architecture that enabled the automation and allowed us to scale this configuration management.

b) Elaborate on the processes that govern the configuration management and how automation is applied across the process, from policy management, enforcement (auto-healing), reporting and handling of deviations.

Brief History of Security Configuration Management in DBS

Prior to 2018, security configuration management was a manual process consisting of regularly generating reports, reviewing them and subsequently remediating the security configurations in between cycle periods. This end-to-end effort is maintained approximately by a team of 13. Furthermore, as the number of servers we used increased every year, the total time required for this lengthy scanning and reporting process also increases. This led us to search and develop for a new solution that will be more efficient in managing the process.

SecureSys

SecureSys is the overarching framework that underpins the automation. It is a highly customised approach towards the management of security configurations. Figure 2 illustrates the 5 components within SecureSys, each tightly coupled with each other.

1. The web portal is the gateway for users to access the self-service capability offered by the three engines.

2. The policy engine handles the subscription of the node to a policy and the subsequent enforcements.

3. The reporting engine provides the collection and viewing of scan reports.

4. The deviation engine manages the workflow related to handling exceptions and deviations.

5. Puppet acts as the automation orchestrator engine and serves as the backend support for all the other components.

This automated capability is applied to approximately 18,000 operating systems (OS) and 15,000 sub-systems (i.e. databases and middleware software) for on-premise data centres across 6 countries and public cloud environment.

Figure 3 The complexity of systems and sub-systems requiring security configuration checks

Automation Orchestration (Puppet)

We discovered that to accommodate in-house processes, we would require a great deal of customisation in order to automate the security configuration management. Following multiple rounds of evaluation, we narrowed down our selections to one specific tool — Puppet. Puppet provides the capability of managing policy enforcement and reversion to baseline security configuration whilst also handling deviations.

Figure 4 illustrates the overall Enterprise Puppet architecture which is set up to support the regional architecture; it is simple and highly scalable. Puppet is split into 3 layers:

a) On the topmost layer is the Master of Masters (MOM), we have 2 instances of this component to provide high availability (active passive). The role of the MOM is to coordinate and manage the compile masters, host the certificate authority for agent registration and communication, run the orchestration service and manage the puppet database.

b) The compile masters on the second layer perform the building and compiling of the Puppet code into a catalogue. It handles the compilation of tasks that the agent needs to run on the endpoint. The results of the catalogue run and the catalogue are synchronised between these compile masters. These compile masters sit behind a load balancer that allows for horizontal scaling of the infrastructure to support more agents for catalogue requests. Should you need to expand the coverage of the Puppet infrastructure, it is as simple as adding a new compile master.

c) The third layer comprises the agents in the end points running on each server. Agents are authenticated through the internal certificate mechanism and run two services: the Puppet service, which is the main agent daemon, and the PXP service, which enables execution of actions on remote nodes.

With this 3-layered approach, we can scale the architecture easily as we implement it in the public cloud. Auto-scaling is easy with the public cloud as we can simply spin up and down the compile masters using AMIs baked with the application on the image itself (well, building of the AMI itself is done through an automated AMI pipeline).

On-premise scaling capability is semi-auto and taken care through capacity management. With the amount of checks we have, a compile master with a configuration of 4 CPUs and 16GB RAM can support 1000–1500 nodes.

Policy Engine

Referring back to figure 2, 3 different engines lie on top of the puppet automation orchestrator, the first of which is the Policy Engine. The agents allow the server to subscribe to a policy based on the group definition identified by a Puppet fact, run these policies against the server, report its state and automatically enforce the standards. With the strategy set, we have packaged Puppet by default as a layer product that is part of the server standard build. All OS platforms are equipped with the agent and are set to automatically register to the MOM and acquire the policies assigned to its profile.

While we have made a lot of reference to policies, let us now look at the context of these policies in relation to configuration management. For every software that is released in the production environment, there are checks that need to be carried out to ensure that it conforms to the security standards of the enterprise. These security standards are patterned after the best practices of other successful organisations or internationally recognised security standards, such as the Defense Information System Administration’s (DISA) Security Technical Implementation Guides (STIG). STIG is a compilation of guidelines aimed to standardise security protocols to improve security. These guidelines are then translated into the context applicable to DBS, specifically regarding the configuration and settings set on the software build. This ensures that configuration settings- are consistent across all deployment. The number of policies ranges from 5 to 200, depending on the complexity of the platform. Majority of the policies are executed and enforced every 30 minutes, though this parameter is configurable and dependent on the amount of tolerance to risk you can take.

Example:

To see it in action, let us look at one of the STIG guideline based on finding V-72301 and is considered as “High” severity, it states that, “The Red Hat Enterprise Linux operating system must not have the Trivial File Transfer Protocol (TFTP) server package installed if not required for operational support”

There are 2 folds in the resolution of the finding above:

1. Package must not be installed on the server — As all server builds are standard and automated, the TFTP package is already removed from the image itself.

2. To add an additional layer of security, we need to ensure that even if by any chance the package gets installed, the system must not allow for the enabling of the service. One way to do it is to ensure that the service is disabled (commented) in the /etc/services file.

Let us examine the code in figure 5, the code is generally divided into 3 main sections,

A. The first section (lines 14–17) is the keys section, where constant values are defined in a global configuration file known as hiera. The value that is set in the hiera file acts as the standard configuration value that is compared against the runtime value of the system.

B. The next section is to check the desired operation mode for this check. There are 2 modes defined, ‘noop’ short for no operation and is usually executing the ‘notify’ action in puppet that allows for extraction of the actual configuration values to be used for reporting. The ‘fix’ mode executes the ‘auto-healing’. In this mode, any drift on the configuration from the baseline or approved deviated values will be corrected. These modes allow us to control with high specificity which checks must be auto-healed and which ones do not.

C. The last section deals with the actual actions to be carried out for the ‘noop’ and ‘auto-healing’ mode. Note that in the auto-healing clause, it calls a subroutine named replace_matching_line_with_backup, which in this example, takes a backup of the /etc/services file and comment out the TCP and UDP lines for the TFTP service.

Now that we are familiar with the anatomy of the code, the next challenge is to code (for noop and auto-healing actions) for more than 1,000 configuration checks across different platform. To help this process, the development and deployment follows the CI/CD framework. Developers and SMEs use internal bitbucket repo, sub divided per platform, to store their Puppet codes. There are multiple branches to support targeted deployment across different environment. All merging of source codes are reviewed and once approved (merge code), a web hook is triggered to deploy the codes to the MOM for later replication to the compile masters, automating the deployment process. This capability allows for faster time-to-market of new or enhanced policies and for parallel deployments.

Reporting Engine

One of the features of the Enterprise Puppet is the management console. It is a web application used for managing the entire infrastructure, including the certificate authorisation and the database. Exposing this console to users in the organisation introduces complex role-based access control. Using just default permissions, we cannot control what the users can view and execute. Our objectives for a user console would be to enable users to view security reports on systems that they are managing, allow them to activate the auto-healing capabilities (for servers still in noop) and manage deviations. As such, our team has developed a derivative of the console which we named SecureSys (Security Systems). The SecureSys portal mainly handles the reporting and deviation management capabilities and is built on top of the Puppet database. Multiple user personas are defined in the portal, from executive management, to auditors, information security personnel, risk managers, system administrators, tech support and application teams. This enables us to provide custom console views based on location and business units. The portal allows for a single view of the reports and the required fixes (if needed) on each endpoint.

The table above shows the summary reporting per node on the compliance status. Note that for the mandatory policies, the number of deviations is 0 as auto-healing is enabled and enforced in these machines. Currently, the mandatory (high risks checks) checks are set to auto-revert whilst best practice checks are on noop mode. By the end of this year, all OS-related best practices checks will be re-classified as mandatory checks, thus auto-healing will be enforced on these checks as well.

A sample report is shown in Figure 7.

The report shows the description of the policy, what configuration is being checked and the actual values that are scanned in the server. If the baseline and actual values match, then the configuration is reported as compliant.

Figure 8 is an example of a non-compliant setting of a best practice (noop) check related to file permission on /var/log/secure. The policy requires that the permission is set as ‘600’ for this file, however when Puppet scans for the runtime value, it determined that the permission is set to ‘644’. This triggers a notification of non-compliance and is flagged in the database entry.

Figure 8 Non-compliance finding

The recommended course of action is to either remove the excess permission (read for group and others) or to apply for a deviation if the setting is required for the application to work. Raising deviations can also be done through the portal and the details are explained in the next section.

Deviation Engine

We are cognizant of the fact that not all applications can comply fully to existing configuration settings, especially if the application is an off-the-shelf product. There will nearly always be 1 or more settings that need to vary from the baseline values for the application to work. Within DBS, we may allow for these deviations if an explicit approval is sought from the security team and a corresponding authorisation from the head of the department is granted for the acceptance of risks. This entire process of discovering deviations (when the agent is in a ‘noop’ mode), to raising a deviation request, routing to approval and actual implementation is done through the SecureSys portal.

It is important to note that to discover and allow for the discrepancies, the agent must not be executing the auto-healing of the baseline value, otherwise the values will be reverted to its original value following each Puppet cycle. So, you might be wondering, how do we manage to keep the new value as part of the routine auto-healing check without affecting the rest of the nodes that use the same policy? The answer lies with how Puppet manages the hierarchy of data sources. This hierarchy is maintained in the hiera.yaml file and can be structured depending on how you organise and group the nodes.

In the diagram above, you will see that Puppet looks up 6 different sources in the hierarchy. Puppet will follow the order of how the sources are written, i.e. ‘by hosts’ value takes precedence over the ‘by OS’ and the value is taken from the source with the highest precedence.

The code snippet below describes the check on the values defined in /etc/securetty.

Let’s say that all baseline value settings are defined in the “Per OS” yaml, in this case Linux.yml file and the baseline value for compliance_linux::2_1_2::x_2_01_02_02_comp is set as “console,tty” as per figure 11:

Figure 11 Hiera values for /etc/securetty check

Assume that the application will not work on these terminal types and will only work on pseudotty (pty). The challenge is to prevent the configuration from being auto-healed and reverted to the original settings after the next Puppet cycle. In addition, an ‘intentional change’ alert will be generated and reported. One possible solution could be to update the value definitions in the hiera (on a platform level), however, any changes made to file will affect all nodes that are subscribed to that configuration.

To address this, we need to make use of the hierarchy of data source to define where the value will be taken. Look again at the definitions in the hiera.yaml file in figure 9 and notice that the first entry is described as “Per Node”. This definition has the highest precedence, therefore all settings defined here supersedes all the definitions on the files below it. It is also worth noting that the filename of the configuration file is set as the certificate name (i.e. the host name of the endpoint), making all the configurations specified in the file applicable only to the node that has the same certificate name as in the filename . In practice, we can now set the host yaml file as follows:

Figure 12 New value in host.yaml file as part of deviation

Once the deviation strategy is identified, we can then automate this process by integrating the workflow in SecureSys and have the backend execution carried out by a Puppet task. Once the request is approved in the portal, it will automatically call an API which then executes the following tasks:

1. Fetch the information regarding the deviation (host name, key, value)

2. Create the <host>.yaml in MOM file and append the key-value pair

3. Trigger a file-sync to replicate the file to all compile masters

Changes will take effect after the next run, all executions moving forward will refer to the <host>.yaml file first for values to be used for auto-healing, thus allowing deviations.

We have made significant progress with our SecureSys framework, moving from a monolithic and manual approach of doing configuration management to the automated and scalable solution that we have today. We have seen substantial benefits after implementing this approach, having reduced the equivalent effort of 13 staff to 3. All drifts in mandatory configurations are being auto-healed, removing toil on both infrastructure and application teams who no longer need to coordinate with each other for fixes. Reporting overheads are also reduced, and reports are can be generated nearly in real-time, with scans running every 30 minutes. Although there is definitely still room for improvement, with all these frameworks in place, I believe we can keep up with the ever-changing needs of our ever-growing organisation.

(This article is also translated into Traditional Chinese language:
6國跨雲15000套系統組態管理如何自動化？星展大公開)

Automating Configuration Management at Scale

Written by Edwin Caliwag