On past days KIE Foundation Team achieved an important milestone: Building a production-ready HA infrastructure for Business Central.
This journey began many months ago and our main goal was to provide HA in a simplified cluster setup, leveraging on the provided infrastructure, especially in a containerized environment like OpenShift.
This effort required a new architecture for the core components of Business Central, including file-system, indexing, messaging and locking mechanisms.
Before diving on tech details, let’s take a quick look on what is HA for Business Central perspective and how can I setup it on my own environments.
What is High Availability on BC perspective?
High Availability could potentially mean different things depending on the context. That is why is important to make it clear what is currently supported on Business Central.
Our current HA implementation has two major goals:
- All the data committed in one node is reflected on others and it is always preserved;
- Provide an infrastructure that let the user work in a fail-safe environment. If one server goes down, users will be able to continue working in any other server/pod available.
 In some corner case extreme circumstances, we can still fail during a new project/asset creation. This work is tracked on AF-2142.
 UI state not committed will be lost, but the user will be able to keep working on a different node, for instance, if you are working/modifying an asset and node fails you will be redirected to the home screen (but in a different node).
Overview of the architecture
Before diving in some details, let’s have a quick overview of the new clustered setup of Business Central and do a quick setup of our infrastructure.
There are four main components of this architecture:
- Business Central: our authoring web application that allows you to operate over individual components in a unified web-based environment: to create, manage and edit your Processes, to run, manage, and monitor Process instances, generate reports, and manage the Tasks produced, as well as create new Tasks and notifications.
- Message System (based on AMQ): allow us to notify all members of the cluster about important events that occur in the other nodes, like “A new file system has been created”, giving us the capability of refreshing UI in real-time instead of waiting for a User refresh to see the new information available.
- Indexing Server (Infinispan): allow the user to quickly search for any Business Central information, including asset and project content;
- Shared File System (based on NFS/GlusterFS): A shared file system based on NFS/Gluster provides a unique source of truth where all projects and configuration is stored.
 Ceph will be the default storage for OpenShift 4.
Business Central HA Deep Dive
As we already mentioned, there are four main components of this architecture. Let’s keep dive on implementation details of each of them and discuss some infrastructure configuration and requirements and share some fine-tuning tips and finally what is currently not supported.
Lucene is the default indexing engine. For standalone deployment, Lucene will index assets content and properties so the user can navigate/search faster in Business Central library, simple as that.
But Lucene is not suitable for a clustered environment. It has some limitations like “just one indexer per index”, that means you will not be able to write or read in the same index file (because lucene stores information in its own file system structure) in different nodes of the cluster because the first node that opens the index will keep it until it closes it.
So this was not suitable for HA environment. We needed to index assets from different nodes at the same time and keep the speed we had in standalone mode, so we decided to use Infinispan, a distributed in-memory key/value data store that uses Protobuf schemas and HotRod protocol for high-speed communication.
Infinispan also uses Lucene under the hood as an indexing engine so it was perfect for us. Now we can query the same index from different nodes and add new entries or modify them. Of course, we needed to modify our code to adapt to this new indexing engine, so created an abstraction layer called MetadataIndexEngine and IndexProvider.
The first one contains all the methods we need to create, update, delete assets in index engine, and the second one is an interface that lets us implement how we communicate with the different index engines, so we have one implementation for Lucene (LuceneIndexProvider) and another one for Infinispan (InfinispanIndexProvider).
If in the future we need different technology, we can simply create a new IndexProvider to implement the new protocol, transformation, etc. we need for that index engine.
Shared File System
Our source of truth is Git, and git is based on the file system so if we need to keep it synchronized across the cluster we need a shared file system mounting point.
In this case, we have two flavours, NFS and GlusterFS. Both are very similar, but for On Premise installation NFS is easier than GlusterFS, but for Openshift/Cloud GlusterFS is the right choice.
Later in this post, you will find out how to NFS and Gluster needs to be configured to work with Business Central, Gluster has a few more steps because we need to configure it to behave like a file system for databases.
Last but not least the messaging system. In this case, I will redirect you to read this blog post from Eder Ignatowicz, the responsible to set the basis for this magic to work.
All clustering stuff won’t be possible without this piece that let us replicate CDI events across all nodes in the cluster. If a new project was created, or indexing had finished, if permissions changed or an asset was locked depends on this. So for this to happen we need ActiveMQ, multi-protocol, Java-based messaging service that will let us concentrate all the events sent and broadcast them to all active nodes.
Business Central had several changes too, it wasn’t only a bunch of external components that magically made it work in a clustered environment. We had to modify many things under the hood to achieve this goal.
We needed to improve our locking system to lock each space individually instead of the whole niogit filesystem, improve some caches, understand if the events received were from the same node that generated it or from a different one, change the way projects are deleted to let us dispose all the filesystem and git infrastructure before executing the right deletion process, or change the way we start indexing process so two nodes don’t execute the same procedure wasting resources.
It was an interesting and hard time, but I’ve learned a lot about this high concurrency, distributed ecosystem, and to achieve this goal is very VERY rewarding.
Infrastructure Configuration and Requirements
Let’s explain some key aspects of HA infrastructure configuration and requirements.
Those are the hardware requirements, configuration and fine tuning for Business Central.
For each BC node we recommend:
- CPUs: 4 cores
- RAM: 8Gb
Business Central has its own user management capabilities, you can create and delete users, roles and groups. But this solution has a single Business Central instance in mind. If you want a fully HA Solution you will need an SSO Server. Does that mean that Business Central will not work without SSO? No, it will work but you will lose the capability of credentials replication across all the nodes. If you are in node A and it goes down or the load balancer determines that your next request will be in a different node you will need to log in again. SSO solves all those problems and also gives you other login strategies like social network login.
Limit Threads (Advanced Configuration for Business Central)
By default, there is a limit for Executor Services (they let us execute operations asynchronously in multiple threads) but you can modify them. Managed and unmanaged should have the same limit. A limit of zero means that there is no limit.
One of the key components of the cluster infrastructure is the message queue. It is in charge of CDI events replication across the entire cluster.
Those are the hardware requirements, configuration and fine-tuning for Message Queue.
These are the default parameters. If appformer-jms-connection-mode is NONE (what is configured by default), message queue is not enabled, so the app will not work in clustered mode.
There are 3 different appformer-jms-connection-mode:
- NONE: as we explained before. Cluster is disabled.
- REMOTE: Cluster is enabled.
- JNDI: Cluster is enabled.
Those are the hardware requirements, configuration and fine tuning for Business Central.
- CPUs: 2 cores
- RAM: 4Gb
Don’t forget to also increment -Xmx and -Xms to 4Gb!
Infinispan doesn’t need any change to work with Business Central, but if the user changes username or password you should use these System Properties to modify the default values already configured.
Shared File System
Business Central source of truth is stored in the file system. To have that information in each node we will need to share that FS and the solution is NSF/GlusterFS. They basically offer the same solution and are both viable options.
In /etc/exports file of your nfs server you should have a like like this, where /opt/nfs/kie is the path to be shared and “*” are the possible IPs that can access that shared folder.
In each client, the shared FS should mounted in a directory that already exist:
On more consideration: in standalone-full-ha.xml you will need to add the property to bind .niogit directory with nfs shared dir. This will be also valid for GlusterFS configuration.
GlusterFS is a software scalable network filesystem, a replacement for NFS. This is the prefered file system for Openshift 3.x. The parameters specified below are mandatory. If they are not set we can’t ensure the right behavior of the file system. Those configurations are VERY IMPORTANT.
How can I quickly check if my cluster is ready and working?
Open business central on two different machines, import Mortgages project from samples and open the same file on both nodes (i.e. Dummy rule.drl). As soon as you start editing the file on one node it will lock the file on the other node. Locking a file is one of the cluster messages use cases that we will explore details in the next session. As soon as you save the file in one node, you will get the updated asset on the other.
Your Own Docker Image
For sure I will not be responsible for this! But if you want to create your own docker image, this is a Docker Compose example that will let you deploy a cluster of two Business Central instances, Infinispan and ActiveMQ.
Use it at your own risk
What can users expect from Business Central (and future work)
There are still some features and corner cases that we need to improve. One of them is to recover the state of a project that was being created while one node of Business Central went down. That means if Business Central stops and a project was being created, it will remain in a corrupt state (and impossible to delete it via UI). Those are the Jiras we open for fail recovery tasks AF-2143 and AF-2144.
Currently, HA in Openshift only works on 3.11 version. We are working to validate and integrate it with OCP 4.2+. The major impact is that we rely on Ceph Storage instead of GlusterFS (OCP 3.11).
Nothing should affect the user from the storage point of view. We are also planning to work on some performance improvements, including sharing the indexing building and processing across all the nodes in the cluster
The end of the road
It was a very long journey, but we finally arrived to our destination. It wasn’t easy, modify an existing application with more than 10 years of code to adapt it to the new generation infrastructure demands many changes, we depend heavily on the file system and detaching it and making is distributed was a very hard task.
I will probably write about the experience of doing that because going to the cloud and to microservices from a monolith is not a simple task as many suggest.