As of today, users have the option to run RDFox in Docker using official images from Oxford Semantic Technologies. In this article I will describe how to deploy these images to a multi-zone Kubernetes cluster to achieve a high-availability, read-only configuration.
RDFox is a high-performance knowledge graph and semantic reasoning engine. It is an in-memory solution, which allows flexible incremental addition and retraction of data, and incremental reasoning. It is mathematically validated at the University of Oxford. Since v3, RDFox also offers the ability to incrementally save updates to persistent storage for easier restarts.
Previously, customers wishing to run RDFox in Docker have had to build their own images from the official release distributions, resulting in additional application development and maintenance effort. This morning we announced some good news on that front:
Kubernetes is the most popular container orchestration platform, with -as-a-service offerings from all three major cloud vendors and a large ecosystem of supporting tools. Although originally best at orchestrating stateless containers, the platform has gradually added support for workloads which require stable, persistent storage through the StatefulSet resource type. Using this, developers can ensure that each replica within a set has its own stable storage and network identity, better matching the requirements of replicated data stores.
In this article, I will walk through how to build and deploy a high-availability, read-only RDFox service to a Kubernetes cluster. The target cluster used to test the setup was built using the Amazon EKS Architecture Quick Start, which provisions one Kubernetes node in each of the three availability zones within the chosen region. To achieve our desired setup, we will define a StatefulSet specifying three RDFox replicas. Kubernetes will automatically distribute these across the region’s three availability zones, provisioning an Elastic Block Store (EBS) volume in the correct zone to act as each replica’s server directory.
Although tested on AWS, the configuration should be readily adaptable to other cloud providers or on-premise Kubernetes clusters. To help with this, I will point out the parts of the configuration which are specific to AWS services.
Note that the described setup is intended as an example only and omits details which would be important in a production setup such as security controls and resource limit configuration. With that caveat out of the way, let’s dive in to some YAML!
Defining the Objects
As noted above, Kubernetes’s support for stateful workloads is via the StatefulSet resource type which will be our main resource. Every instance of this type of resource requires its own headless Service object to be responsible for the network identities of the pods in the set. In addition, we will define a second, load-balanced Service for clients that don’t care which instance they’re talking to. Finally, we will define an Ingress resource to expose the load-balanced service to the outside world for a quick test.
In addition to the four main objects, which we will define in YAML, we will manually add secrets to the cluster to hold role credentials and a license key. We will also use a pre-populated Elastic File System (EFS) volume containing the data to be loaded into each replica during the initialisation stage. EFS is chosen for this role because, unlike EBS, its file systems are accessible across availability zones, enabling us to share a single copy of the initialisation data with all three replicas.
The Headless Service
To start with, let’s examine the headless Service which we’ll name
rdfox-set. It is defined as follows:
The purpose of this service is to define a network domain within which the Pods belonging to our StatefulSet will be assigned stable host names. On lines 9 and 10, we specify that the service should listen on port 80 and that requests should be routed to the port named
rdfox-endpoint on the selected Pods. We will define this port later, in our StatefulSet. Line 11, specifying
clusterIP: None is what defines this as a headless service. The selector specified on lines 12–13 tells Kubernetes that we want traffic for this service to be routed to Pods labelled with
app: rdfox-app. All pretty simple so far but here comes the big one…
The Stateful Set
Our StatefulSet object is where the bulk of our configuration lives. It begins as follows:
The above lines declare the StatefulSet’s type, name, Pod selector, the name of the headless Service we created for it and finally the number of replicas we want. The definition continues as follows:
This section of the definition defines the template for the Pods that the StatefulSet will manage. It begins by ensuring that they all carry the label
app: rdfox-app so that they are matched by the selectors defined in the earlier part of the StatefulSet and ulimately in both Services. After that we begin, on line 5, the
spec field for the template which determines what each replica Pod will contain.
containers field beginning on line 6 defines the main
rdfox container using the official Docker image for RDFox v3.1.1,
oxfordsemantic/rdfox:3.1.1. It exposes the default port for the image (12110) with name
rdfox-endpoint, matching the definition in our headless Service resource. It also specifies a volume mount for the server directory on line 18 to the default server directory location of the image. Since we need each replica to have a different logical volume mounted in this role, the name used here refers not to one of the existing volumes, declared in lines 45–54 of this section, but to a PersistentVolumeClaim declared in the
volumeClaimTemplates field in the last section of this resource’s definition below.
initContainers field beginning on line 20 declares an initialisation step for each Pod that belongs to the the StatefulSet. This container, named
init-server-directory, must complete successfully before the StatefulSet controller will attempt to start the main
rdfox container within each Pod. It specifies
oxfordsemantic/rdfox-init:3.1.1, the companion for
oxfordsemantic/rdfox:3.1.1, as its image. The companion image is provided to make it easy to prepare the server directory before mounting it to containers using the main image. This includes changing the ownership of the directory to the default user for the image and initialising the directory using RDFox. Some data store containers include scripts within their main image to make their initialisation step invisible to users. Although this is slightly more convenient, it means that the containers must be started as
root and then retain superuser capabilities throughout their lifetime even though they are only used as the container starts up. For RDFox, we recommend running only the companion image as root with
CAP_SETGID capabilities and then running the main image as its default non-root user.
In order to be able to prepare the server directory for the main container the
init-server-directory container mounts the Pod’s server directory in exactly the same way as the
rdfox container. Another feature of the companion image is to look for a file at container path
/data/initialize.rdfox and, if present, pass it to the contained RDFox process which will then attempt to execute it in the RDFox shell. To take advantage of this, our initialisation container mounts the pre-populated EFS file system mentioned earlier, which contains such a script, to the default shell root container path
initialize.rdfox script in this EFS file system is as follows:
This creates a data store called
family and populates it with the example data and rules from the Getting Started guide for RDFox which are also loaded inside the mounted volume. It also creates the special
guest role and allows it to read all of the server’s resource. This will allow us to make calls to the REST service anonymously. All of this is persisted to the server directory which, when mounted to the main
rdfox container, then has everything needed for RDFox to load the data store and access control policies in
The final thing to discuss from the above block of YAML is the approach to mounting the license which is done in different ways for the
init-server-directory containers. Our official recommendation for mounting the license is to bind-mount it to
/opt/RDFox/RDFox.lic so that it will be found by the executable in the same directory. This works well when launching containers using
docker run but Kubernetes does not allow mounting of single files to existing directories so trying this approach leads to a situation where the image’s entrypoint executable is hidden by the mount and the container can’t start. To work around this, our definition mounts the
license volume (defined on lines 46–51) to the
rdfox container at path
/license and then overrides the default
CMD for the image to explicitly set the
license-file server parameter as
/license/RDFox.lic. In future, RDFox will accept the license via an environment variable
RDFOX_LICENSE_CONTENT, avoiding the need to override the default command in most circumstances. The companion image used in the
init-server-directory container already accepts this argument and lines 34–38 of
stateful-set-pt-2.yml instead map the
rdfox-license secret into the container via this environment variable.
The last part of the definition of our StatefulSet definition is the
volumeClaimTemplates field discussed earlier. It looks like this:
Here we find our first piece of AWS-specific configuration in the use of the
gp2 StorageClass. The
gp2 resource, which is installed by default onto the clusters built by the EKS Quick Start template, relates to the Elastic Block Store. Using it in the template for our
server-directory PersistentVolumeClaim tells Kubernetes to create a new EBS volume in the same availability zone as the node that is running the Pod to fulfil this role. To port the example configuration to another cloud provider, set up the most suitable equivalent StorageClass for that provider on your cluster and use it in place of
gp2 in this template. The official Kubernetes documentation for the StorageClass resource type contains details of many alternatives.
The complete definition of our StatefulSet resource is visible here.
The Load-Balanced Service and Ingress
Our load-balancing Service definition will be responsible for distributing requests to the replicas. In essence, this is our high-availability service. It is a pretty vanilla Kubernetes Service. As with our headless Service, it routes traffic to the port named
rdfox-endpoint on pods labelled
app: rdfox-app. Unlike our headless service, though, we set its type to
NodePort. The definition is:
We now have a service that could be used by other containers within the cluster which is sufficient for many use cases. For the purposes of demonstration though, we also define the following Ingress resource to allow us to reach the service from the outside world at the imaginary domain
This resource is another place where AWS-specialisation is seen — specifically in the annotations on lines 5–8. These configure the behaviour of the alb-ingress-controller, a component which makes our desired Ingress definition a reality using an Application Load Balancer on AWS. Deleting these lines would still leave us with a valid Ingress resource however other providers may need equivalent custom annotations. For the above declaration to work correctly on AWS we would need to have a TLS certificate for the stated domain in CertificateManager.
We now have four files defining our main resources which, for convenience we gather in a directory called
RDFoxKubernetes on a host where we have
kubectl configured to control our target cluster. Before we push our resource definitions to our cluster, we first need to create the secrets they depend on.
To create the
rdfox-license secret, we add a valid, in-date RDFox license key to file
RDFox.lic within our working directory and run:
kubectl create secret generic rdfox-license --from-file=./RDFox.lic
Likewise, to create the credentials for the first role, we add the desired role name to file
rolename and the desired password to file
password, both within our working directory, and then run:
kubectl create secret generic first-role-credentials \
Finally we create our StatefulSet and accompanying resources with:
kubectl apply -f RDFoxKubernetes
The StatefulSet controller on our cluster will now set about bringing our cluster into the desired state we have declared in the manifests. For each replica, this will involve provisioning a fresh EBS volume to fulfil the
server-directory PersistentVolumeClaim declared in our StatefulSet’s template, running the initialisation container to populate the new volume and finally launching the main RDFox container. The replicas will be assigned integers from 0 to 2 and the controller will not attempt to create replicas with higher indices until all lower-indexed replicas are up and healthy.
We can follow the state of this process as follows:
$ kubectl get statefulsets
NAME READY AGE
rdfox-stateful-set 1/3 1m
When this shows that the
rdfox-stateful-set has 3 out 3 pods ready, we can lookup the name assigned to our ingress with:
kubectl get ingress
The entry under the column headed
ADDRESS for the
rdfox-ingress resource is the name of a public-facing load balancer created specifically for the ingress. We can set this as the value of a DNS
A record for our imaginary
rdfox-kubernetes.example.org domain and then, allowing some time for DNS records to update, call our service from any host with internet access. For a simple test, let’s
curl the API that lists the server’s data stores to check that our
family data store is present as expected:
$ curl https://rdfox-kubernetes.example.org/datastores?Name
Once we’re done with our test deployment, we can clean up the resources with:
kubectl delete -f RDFoxKubernetes
This will delete all the resources we explicitly declared but not the PersitentVolumeClaims that were created from the template in our StatefulSet. According to the official Kubernetes documentation this choice was made
…to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
These can be listed with
kubectl get pvc and deleted with
kubectl delete pvc <pvc-name>. We can also now delete the DNS record we added.
We’ve seen that RDFox can be deployed into a high-availability, read-only setup using Kubernetes. The newly-published official Docker images from Oxford Semantic Technologies help make this a convenient deployment option and we look forward to hearing from users about their experiences of running RDFox in this way.
Team and Resources
The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Innovation (OSI) and Oxford University’s investment arm (OUI). The author is proud to be a member of this team.