As more workloads are being run in Docker containers, using orchestration environments like Kubernetes has spiked in popularity.
UPDATE: June 2020; added references to the helm chart, and corrected some details.
Databases in Container Orchestrators
Orchestration environments though introduce some special challenges for databases, and Neo4j in particular. This article aims to cover some of those and the extra considerations you might want to keep in mind. Most of the examples here will talk about Kubernetes specifics, but architecturally, the considerations are the same whether it’s Kubernetes or any other orchestration manager.
We’ll take a look at the architecture of a Neo4j deployment, both from “inside of Kubernetes” and “outside” perspectives, and talk about how that impacts querying.
Questions? Comments? Come discuss on the Neo4j Community Site on this thread.
Looking at Neo4j inside of Kubernetes
The diagram below shows how the internal deploy of a typical Neo4j cluster in Kubernetes works. Each pod is placed in a StatefulSet depending on whether it is a core or read replica node. Each pod has a persistent volume claim which maps to an underlying disk which stores the data.
Looking at Neo4j from outside of Kubernetes
Within Kubernetes, everything is straightforward; the pods can be connected to directly via internal DNS. The “app2” in the box below has no trouble dealing with the cluster because the auto-assigned DNS routes for it.
The outside app1 though has a different set of issues. The Kubernetes cluster manager has some public ingress IP address, and we need a way of routing traffic into the right cluster node, and this is where things become a bit tricky.
This section assumes some familiarity with how bolt+routing works in Neo4j. If you need a brush up on any of the ideas of how Neo4j query routing works, please have a look at Querying Neo4j Clusters. In that article, we described how the “public advertised address” of a node interacts with the Routing Table shown in the diagram above.
In Kubernetes setups, there is a specific problem: internal DNS in the table above cannot be routed by
app1 outside of the cluster. So even if there is an IP ingress, attempts to use
bolt+routing Neo4j drivers will fail. App1 will get a routing table with un-routable information, and will fail to connect.
The external exposure instructions posted in the Neo4j Helm repository chart a way of solving this.
- Request / allocate static IP addresses for each member
- Send custom configuration to the pods to tell them to advertise themselves according to that static IP address
- Configure a load balancer for each pod that exposes that pod by the static IP address
Once we can get this far, we can also register DNS and associate the static IP addresses with that DNS, and do things like HTTPS/SSL if we need.
Now that we’ve covered Neo4j networking and bolt+routing specifically, let’s get into other things that make databases different and what to look out for.
Orchestration Managers and Databases
As orchestration managers were born and grew up, they typically handled workloads that had a lot of stateless microservices. Developers would create small containers that held a single service. Typically that service did not store any data (if it did, it talked to a database deployed outside). These services, because they were stateless, could be deployed in fleets, and willy nilly killed and restarted. Typical deployments might have many replica containers, fronted by a load balancer. Clients would talk to the load balancer, and end up getting forwarded to any of the microservice instances, it didn’t matter which.
Orchestration manager features play directly to this sweet spot: Kubernetes with the concepts of
ReplicaSets(a set of replicas of a given container) and
LoadBalancers(allowing access to any replica through a single network ingress).
Databases are Different
Databases (including Neo4j) are different in three key ways here!
They’re Stateful: They must store data, that’s their whole reason for being. So in orchestration managers we need to think about concepts like disks, persistent volume claims, and so on.
Their Pods are Not Interchangeable — They either host different services, or have different capabilities. In the Neo4j architecture, for example, there are leaders and followers, where leaders need to process the writes.
If you need to talk to a particular cluster member though, this limits the usefulness of abstractions like load balancers, because they don’t know or care about any differences. In several databases, this necessitates the whole idea of a “routing driver” that handles this logic at the application or driver level. Which in turn complicates the orchestration networking configuration.
They scale differently — with micro services, suppose I have a function
hello which returns “hello world”. I can scale this service up and down very quickly. Launching the container takes seconds, and there’s no stored data to synchronize. With databases, when you “scale up” you may need to copy most or all of the database contents to the new node before it can fully participate in the cluster. Doing that “scale up” may also place extra load on the other cluster members while they replicate data and check in with their new peer. When you scale down, your cluster finds that a member has gone missing, and it will recover but each of the other instances needs to update its member list.
None of that is happening with your typical micro service. As a result, up-front capacity planning for databases is more important, as it isn’t going to be a good idea to rapidly add/remove 10 instances of a 5TB database.
Noisy Neighbors, Co-Residency, and HA
In an orchestrator it can also be important to think about Affinity and Anti-Affinity rules to spread cluster members out across physical hardware. With microservices with many copies this may matter less — but if you’re running a 3-node clustered database for high availability, it is very important to ensure that not all 3 pods are running on the same physical server. If they are, and the physical server dies, then the database is completely down despite your attempts to guarantee HA!
Spreading things out (and avoiding co-residency) also helps to insulate you from so-called “Noisy Neighbors” or other workloads on the same machine that are soaking up shared resources.
Disks: Thin vs. Thick Provisioning
In virtualized environments, whether in the cloud or on-prem, when you create a persistent volume claim you take for granted that you get a disk and it “just works”. But how does that actually work?
Under the covers, your virtual disk is usually allocated from some kind of storage solution, or SAN. Many of those tend to use thin provisioning by default.
Over-allocation or over-subscription is a mechanism that allows a server to view more storage capacity than has been physically reserved on the storage array itself. This allows flexibility in growth of storage volumes, without having to predict accurately how much a volume will grow. Instead, block growth becomes sequential. Physical storage capacity on the array is only dedicated when data is actually written by the application, not when the storage volume is initially allocated. The servers, and by extension the applications that reside on them, view a full size volume from the storage but the storage itself only allocates the blocks of data when they are written.
This is great for a wide range of applications but very bad for databases. It’s possible in some storage environments for your database to try to write data and to fail, because the underlying storage isn’t available. This can occur when the storage layer becomes over-subscribed and can be a catastrophic problem for the database, because its core underlying assumption is that it has a disk that it can use — which turns out not to be true. If you have a clustered Neo4j setup and this is only happening to one node, you probably won’t lose data (because the HA capabilities of the database are keeping the data safe on the other two nodes) but you’re going to see lots of strange errors which make no sense, and you’ll have operational issues.
The fix is to ensure that when you allocate disk, you’re doing it in a “thick” provisioning way, meaning that when the Persistent Volume is created, you’re guaranteed to have an exclusive hold on it. How to do this will differ between cloud environments and storage solutions; and so depending on your kubernetes provider or method of hosting, this may or may not apply to you, but it’s something to keep a careful eye on.