How to make your Accounts Engine highly available?

Published in

ING Blog

7 min readOct 11, 2022

Whenever our Accounts Engine becomes unavailable, customers are not able to view details from their accounts. This includes insights in transactions and balances as well as booking new transactions. Working at Core Bank at ING in The Netherlands, we are responsible for the Accounts Engine (AE).

Current landscape

In a time before ‘everything’ was ‘always’ online and before transactions were done ‘now’ instead of ‘in a couple of hours’, an AE did not have to be available all the time. It could go down for maintenance or repairs, and human intervention was always necessary. Nowadays, customers want to have access to their data at all times.

Within our domain, teams are always looking for improvements to ensure (or improve) availability of our services. Our system was retrofitted with a high-availability (HA) solution based on Apache ZooKeeper, NGINX, custom-built agents and a dashboard. The solution decides where to send traffic based on load, availability and various application-specific parameters, like the state of data replication.

In our situation, we have one server acting as ‘primary’ host, processing all mutational requests. Besides, we have multiple servers available for various other purposes. Data replication allows us to have the same data on all of them. Therefore, we are able to allocate one of the others for inquiry requests. All servers are able to act as primary or inquiry host, allowing us to assign either responsibility to them.

The solution consists of the following components:

ZooKeeper cluster, as a distributed application toolkit that connects all components
NGINX as a configurable (reverse) proxy.
‘HA Engine’: Agents for monitoring, decision-making and managing the AE layer
‘HA Client’: Agents for reconfiguring NGINX on an API level
‘HA Dash’: Monitoring and management dashboard

As we are relying on data replication between the different AE hosts, the primary and inquiry ones should always contain the same data. If this is not the case, E.g., due to network latency, customers might miss transactions which is not acceptable. Our HA Engine compares the data on the inquiry host with the expected data from the primary host to ensure the replication is completed entirely. Whenever the engine detects a processing backlog or missing data, a ZooKeeper event is triggered towards the ZooKeeper cluster to force all traffic towards the primary host. The HA Client is connected via a listener to the same ZooKeeper cluster and receives the created event. Based on this event, it will update the local NGINX configuration to route the inquiry traffic also towards the primary host.

In case of maintenance, the state of the system can be changed manually via our management dashboard (HA Dash). By clicking buttons on the dashboard, events towards the ZooKeeper cluster are triggered which will be picked up by the HA Client in the same way as described before.

It is also possible that some disaster happens, E.g. losing a server or even worse, an entire datacenter. Our implementation should ensure availability for our customers in these cases as well. For example when the inquiry host goes down, a different server should take over this functionality or force all requests towards the primary host. We have already encountered this in real life and our solution ensured availability for our customers.

Why change?

ING’s Container Hosting Platform (ICHP) is a platform based on Kubernetes. It allows its users to use all of Kubernetes’ features, e.g. self-healing, horizontal scaling and automated rollouts or rollbacks. Details on this platform can be found in this Case Study.

We, too, are planning to migrate to ICHP, but this comes with a few challenges. To start with, we currently run our components, API, HA Client and NGINX all on the same server. After migrating to a container-based solution, we did not like the thought of hosting so many components in a single container. Secondly, ICHP containers are immutable, preventing us from updating files within them. Our current solution uses NGINX, and relies on the ability to update NGINX’s configuration file on the server. This will no longer be possible in ICHP. Finally, during NGINX restarts a small number of failed requests occurred. After a retry, these requests would proceed successfully, meaning none of our customers ever experienced any impact from this issue. Nevertheless, we wanted to remove the possibility of this happening.

In the new solution, we solve the first issue by moving the functionality of the HA Client and NGINX into the API itself. The second problem is solved by simply getting rid of NGINX, removing the need for updating its configuration file. This also takes care of the final issue: not having NGINX means far fewer (if any) NGINX restarts.

In addition, getting rid of the HA Client and NGINX will be a significant simplification of our IT landscape, which is a continuous goal of ours.

Hackathon time!

To allocate dedicated time for this new setup, we conducted a hackathon with all involved colleagues. As we require input from various angles, the group contained colleagues from both business and IT. This allowed us to come up with requirements first and start the implementation of the first proof of concept afterwards.

Gathering requirements

Implementing a new solution will change the overall architecture and setup of our core business. As such, we had to come up with requirements that needed to be proved inside our proof of concept implementation.

1. Should be backwards compatible and activated by a configuration change
During the implementation of the new client, all existing functionality must remain usable. Once released, we should be able to change the application configuration in order to start using the new solution. Our application framework, used for connecting the API towards our AE, has to support this.

2. Should respond to ZooKeeper events
Whenever the ZooKeeper cluster indicates a change of state in the available nodes, this has to be processed by the client. This should happen in the background to prevent impact on online requests. After establishing new connections, the obsoleted ones are to be replaced.

These events will include:

Change of primary server
Change of inquiry server
Disable dedicated inquiry server

After a state update, the new state must be stored in a configuration file on the server. In case the connection towards ZooKeeper breaks, the last known configuration remains in use. Once the connection towards ZooKeeper has been restored, normal operating mode should continue.

3. Should start application despite connection issues with ZooKeeper cluster
In normal operating mode, during startup, the application connects to the ZooKeeper cluster and retrieves the primary and inquiry hosts. However, in case of issues with connecting towards ZooKeeper, it still needs to be able to start. In this case, the last known configuration file from the server will be used for creating connections.

4. Should support operating modes
Despite the requirement of being always online, sometimes we need to do maintenance to our systems. During this time, we need a different type of behavior from our client. Therefore, some configuration should be in place to cater for this.

We have introduced the following operating modes:

5. Should be secure
As always, we want to process all data securely. For this, all connections have to be encrypted. In our case, this applies to connections both towards the ZooKeeper cluster and towards our primary/inquiry servers.

6. Should display the current connections
We want to visualize which connections are currently in use by the application. Every state change will trigger an update from the application towards the ZooKeeper cluster. These are processed by our monitoring dashboard to visualize the current status at all times.

Designing the solution

Based on these requirements, the new solution should look like this. All required logic resides in the ZooKeeper client (HA Client) which is part of the application itself. Our scope was limited to the application part, therefore the setup of our primary and inquiry hosts remains unchanged.

First version

During the hackathon, we decided to focus on the first three requirements from the list above. The proof of concept implementation was done in our own development environment, so we agreed that the other requirements were of lesser importance at this moment. In case we could not prove the feasibility of the others, the entire proposal was not a viable solution anyhow.

At the end of the hackathon, we concluded that we managed to fulfill all three requirements. Our sandbox environment allowed us to simulate the various ZooKeeper events, and they were processed correctly by our implementation. On top of that, we were already able to implement the monitoring requirement as well. This allowed us to easily check the dashboard for the current state of the connections.

Next steps

Since the hackathon, we have continued development of our new ZooKeeper client. The remaining security requirement was a burning topic, and we have now managed to encrypt all connections towards the ZooKeeper cluster and the AE servers. With these changes, we have fulfilled all requirements we set upfront. Nevertheless, there is still a significant improvement we would like to make.

Currently, we have multiple inquiry instances available but only actively use one. The other instances are only used when a ZooKeeper event occurred that caused the configuration to update. It would be better to take advantage of all active inquiry hosts instead, and distribute our load across them. In case of issues with one host, the client could then automatically route traffic to the others instead. This will contribute to always being available for our customers.