RDFox High-Availability Setup using Kubernetes

Nick Form
Nick Form
Sep 18, 2020 · 10 min read

As of today, users have the option to run RDFox in Docker using official images from Oxford Semantic Technologies. In this article I will describe how to deploy these images to a multi-zone Kubernetes cluster to achieve a high-availability, read-only configuration.

RDFox is a high-performance knowledge graph and semantic reasoning engine. It is an in-memory solution, which allows flexible incremental addition and retraction of data, and incremental reasoning. It is mathematically validated at the University of Oxford. Since v3, RDFox also offers the ability to incrementally save updates to persistent storage for easier restarts.

Previously, customers wishing to run RDFox in Docker have had to build their own images from the official release distributions, resulting in additional application development and maintenance effort. This morning we announced some good news on that front:

Kubernetes is the most popular container orchestration platform, with -as-a-service offerings from all three major cloud vendors and a large ecosystem of supporting tools. Although originally best at orchestrating stateless containers, the platform has gradually added support for workloads which require stable, persistent storage through the StatefulSet resource type. Using this, developers can ensure that each replica within a set has its own stable storage and network identity, better matching the requirements of replicated data stores.

The Goal

In this article, I will walk through how to build and deploy a high-availability, read-only RDFox service to a Kubernetes cluster. The target cluster used to test the setup was built using the Amazon EKS Architecture Quick Start, which provisions one Kubernetes node in each of the three availability zones within the chosen region. To achieve our desired setup, we will define a StatefulSet specifying three RDFox replicas. Kubernetes will automatically distribute these across the region’s three availability zones, provisioning an Elastic Block Store (EBS) volume in the correct zone to act as each replica’s server directory.

Although tested on AWS, the configuration should be readily adaptable to other cloud providers or on-premise Kubernetes clusters. To help with this, I will point out the parts of the configuration which are specific to AWS services.

Note that the described setup is intended as an example only and omits details which would be important in a production setup such as security controls and resource limit configuration. With that caveat out of the way, let’s dive in to some YAML!

Defining the Objects

As noted above, Kubernetes’s support for stateful workloads is via the StatefulSet resource type which will be our main resource. Every instance of this type of resource requires its own headless Service object to be responsible for the network identities of the pods in the set. In addition, we will define a second, load-balanced Service for clients that don’t care which instance they’re talking to. Finally, we will define an Ingress resource to expose the load-balanced service to the outside world for a quick test.

In addition to the four main objects, which we will define in YAML, we will manually add secrets to the cluster to hold role credentials and a license key. We will also use a pre-populated Elastic File System (EFS) volume containing the data to be loaded into each replica during the initialisation stage. EFS is chosen for this role because, unlike EBS, its file systems are accessible across availability zones, enabling us to share a single copy of the initialisation data with all three replicas.

To start with, let’s examine the headless Service which we’ll name rdfox-set. It is defined as follows:

The purpose of this service is to define a network domain within which the Pods belonging to our StatefulSet will be assigned stable host names. On lines 9 and 10, we specify that the service should listen on port 80 and that requests should be routed to the port named rdfox-endpoint on the selected Pods. We will define this port later, in our StatefulSet. Line 11, specifying clusterIP: None is what defines this as a headless service. The selector specified on lines 12–13 tells Kubernetes that we want traffic for this service to be routed to Pods labelled with app: rdfox-app. All pretty simple so far but here comes the big one…

Our StatefulSet object is where the bulk of our configuration lives. It begins as follows:

The above lines declare the StatefulSet’s type, name, Pod selector, the name of the headless Service we created for it and finally the number of replicas we want. The definition continues as follows:

This section of the definition defines the template for the Pods that the StatefulSet will manage. It begins by ensuring that they all carry the label
app: rdfox-app so that they are matched by the selectors defined in the earlier part of the StatefulSet and ulimately in both Services. After that we begin, on line 5, the spec field for the template which determines what each replica Pod will contain.

The containers field beginning on line 6 defines the main rdfox container using the official Docker image for RDFox v3.1.1, oxfordsemantic/rdfox:3.1.1. It exposes the default port for the image (12110) with name rdfox-endpoint, matching the definition in our headless Service resource. It also specifies a volume mount for the server directory on line 18 to the default server directory location of the image. Since we need each replica to have a different logical volume mounted in this role, the name used here refers not to one of the existing volumes, declared in lines 45–54 of this section, but to a PersistentVolumeClaim declared in the volumeClaimTemplates field in the last section of this resource’s definition below.

The initContainers field beginning on line 20 declares an initialisation step for each Pod that belongs to the the StatefulSet. This container, named init-server-directory, must complete successfully before the StatefulSet controller will attempt to start the main rdfox container within each Pod. It specifies oxfordsemantic/rdfox-init:3.1.1, the companion for oxfordsemantic/rdfox:3.1.1, as its image. The companion image is provided to make it easy to prepare the server directory before mounting it to containers using the main image. This includes changing the ownership of the directory to the default user for the image and initialising the directory using RDFox. Some data store containers include scripts within their main image to make their initialisation step invisible to users. Although this is slightly more convenient, it means that the containers must be started as root and then retain superuser capabilities throughout their lifetime even though they are only used as the container starts up. For RDFox, we recommend running only the companion image as root with CAP_CHOWN, CAP_SETUID and CAP_SETGID capabilities and then running the main image as its default non-root user.

In order to be able to prepare the server directory for the main container the init-server-directory container mounts the Pod’s server directory in exactly the same way as the rdfox container. Another feature of the companion image is to look for a file at container path /data/initialize.rdfox and, if present, pass it to the contained RDFox process which will then attempt to execute it in the RDFox shell. To take advantage of this, our initialisation container mounts the pre-populated EFS file system mentioned earlier, which contains such a script, to the default shell root container path /data. The initialize.rdfox script in this EFS file system is as follows:

This creates a data store called family and populates it with the example data and rules from the Getting Started guide for RDFox which are also loaded inside the mounted volume. It also creates the special guest role and allows it to read all of the server’s resource. This will allow us to make calls to the REST service anonymously. All of this is persisted to the server directory which, when mounted to the main rdfox container, then has everything needed for RDFox to load the data store and access control policies in daemon mode.

The final thing to discuss from the above block of YAML is the approach to mounting the license which is done in different ways for the rdfox and init-server-directory containers. Our official recommendation for mounting the license is to bind-mount it to /opt/RDFox/RDFox.lic so that it will be found by the executable in the same directory. This works well when launching containers using docker run but Kubernetes does not allow mounting of single files to existing directories so trying this approach leads to a situation where the image’s entrypoint executable is hidden by the mount and the container can’t start. To work around this, our definition mounts the license volume (defined on lines 46–51) to the rdfox container at path /license and then overrides the default CMD for the image to explicitly set the license-file server parameter as /license/RDFox.lic. In future, RDFox will accept the license via an environment variable RDFOX_LICENSE_CONTENT, avoiding the need to override the default command in most circumstances. The companion image used in the init-server-directory container already accepts this argument and lines 34–38 of stateful-set-pt-2.yml instead map the rdfox-license secret into the container via this environment variable.

The last part of the definition of our StatefulSet definition is the volumeClaimTemplates field discussed earlier. It looks like this:

Here we find our first piece of AWS-specific configuration in the use of the gp2 StorageClass. The gp2 resource, which is installed by default onto the clusters built by the EKS Quick Start template, relates to the Elastic Block Store. Using it in the template for our server-directory PersistentVolumeClaim tells Kubernetes to create a new EBS volume in the same availability zone as the node that is running the Pod to fulfil this role. To port the example configuration to another cloud provider, set up the most suitable equivalent StorageClass for that provider on your cluster and use it in place of gp2 in this template. The official Kubernetes documentation for the StorageClass resource type contains details of many alternatives.

The complete definition of our StatefulSet resource is visible here.

Our load-balancing Service definition will be responsible for distributing requests to the replicas. In essence, this is our high-availability service. It is a pretty vanilla Kubernetes Service. As with our headless Service, it routes traffic to the port named rdfox-endpoint on pods labelled app: rdfox-app. Unlike our headless service, though, we set its type to NodePort. The definition is:

We now have a service that could be used by other containers within the cluster which is sufficient for many use cases. For the purposes of demonstration though, we also define the following Ingress resource to allow us to reach the service from the outside world at the imaginary domainrdfox-kubernetes.example.org:

This resource is another place where AWS-specialisation is seen — specifically in the annotations on lines 5–8. These configure the behaviour of the alb-ingress-controller, a component which makes our desired Ingress definition a reality using an Application Load Balancer on AWS. Deleting these lines would still leave us with a valid Ingress resource however other providers may need equivalent custom annotations. For the above declaration to work correctly on AWS we would need to have a TLS certificate for the stated domain in CertificateManager.

Deploying

We now have four files defining our main resources which, for convenience we gather in a directory called RDFoxKubernetes on a host where we have kubectl configured to control our target cluster. Before we push our resource definitions to our cluster, we first need to create the secrets they depend on.

To create the rdfox-license secret, we add a valid, in-date RDFox license key to file RDFox.lic within our working directory and run:

kubectl create secret generic rdfox-license --from-file=./RDFox.lic

Likewise, to create the credentials for the first role, we add the desired role name to file rolename and the desired password to file password, both within our working directory, and then run:

kubectl create secret generic first-role-credentials \
--from-file=./rolename --from-file=./password

Finally we create our StatefulSet and accompanying resources with:

kubectl apply -f RDFoxKubernetes

The StatefulSet controller on our cluster will now set about bringing our cluster into the desired state we have declared in the manifests. For each replica, this will involve provisioning a fresh EBS volume to fulfil the server-directory PersistentVolumeClaim declared in our StatefulSet’s template, running the initialisation container to populate the new volume and finally launching the main RDFox container. The replicas will be assigned integers from 0 to 2 and the controller will not attempt to create replicas with higher indices until all lower-indexed replicas are up and healthy.

We can follow the state of this process as follows:

$ kubectl get statefulsets
NAME READY AGE
rdfox-stateful-set 1/3 1m

When this shows that the rdfox-stateful-set has 3 out 3 pods ready, we can lookup the name assigned to our ingress with:

kubectl get ingress

The entry under the column headed ADDRESS for the rdfox-ingress resource is the name of a public-facing load balancer created specifically for the ingress. We can set this as the value of a DNS A record for our imaginary rdfox-kubernetes.example.org domain and then, allowing some time for DNS records to update, call our service from any host with internet access. For a simple test, let’s curl the API that lists the server’s data stores to check that our family data store is present as expected:

$ curl https://rdfox-kubernetes.example.org/datastores?Name
"family"

Success! 😁

Cleaning Up

Once we’re done with our test deployment, we can clean up the resources with:

kubectl delete -f RDFoxKubernetes

This will delete all the resources we explicitly declared but not the PersitentVolumeClaims that were created from the template in our StatefulSet. According to the official Kubernetes documentation this choice was made

…to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.

These can be listed with kubectl get pvc and deleted with kubectl delete pvc <pvc-name>. We can also now delete the DNS record we added.

Conclusion

We’ve seen that RDFox can be deployed into a high-availability, read-only setup using Kubernetes. The newly-published official Docker images from Oxford Semantic Technologies help make this a convenient deployment option and we look forward to hearing from users about their experiences of running RDFox in this way.

Want to learn more? You can request an RDFox license here and try this for yourself. Alternatively, you can learn more about RDFox here or on our medium publication.

Team and Resources

The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Innovation (OSI) and Oxford University’s investment arm (OUI). The author is proud to be a member of this team.

Oxford Semantic Technologies

A high performance knowledge graph and semantic reasoning engine

Nick Form

Written by

Nick Form

Software Engineer at Oxford Semantic Technologies

Oxford Semantic Technologies

Oxford Semantic Technologies develop RDFox, the first market-ready high-performance knowledge graph designed from ground up with semantic reasoning in mind. Founded in 2017 as a spin-out of the University of Oxford with a mission to bring cutting-edge research to industry.

Nick Form

Written by

Nick Form

Software Engineer at Oxford Semantic Technologies

Oxford Semantic Technologies

Oxford Semantic Technologies develop RDFox, the first market-ready high-performance knowledge graph designed from ground up with semantic reasoning in mind. Founded in 2017 as a spin-out of the University of Oxford with a mission to bring cutting-edge research to industry.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store