How to Restore Neo4j Backups on Kubernetes and GKE

This article is the companion to its sister, How to Backup Neo4j Running in Kubernetes. So if you’re looking for resources on creating backup sets in the first place, that’s a place to start.

To maintain a Neo4j cluster running in Kubernetes over time usually means to have a backup schedule in case things go wrong. So that thing has now gone wrong. How will you recover? This article will step you through how to do that.

If you haven’t tried Neo4j before you can launch it directly on GKE with the Kubernetes marketplace. The code for how things work on GKE, and examples we provide you can find here to adapt to your needs.

Approach

To restore a backup in Neo4j, we use the standard neo4j-admin restore tool, just that in a Kubernetes environment we’re doing it inside of a container. Typically backup sets are stored as either a .tar.gz file, or as a raw directory. When taking backups, the neo4j-admin tool writes it as a directory of files, but often people will compress those sets and upload them to cloud storage.

Using an initContainer to restore data to Neo4j

We will use a specialized Docker container with the neo4j-admin tool as an initContainer to our pods in kubernetes. The init container’s tasks are simple:

  • Mount the data drive where Neo4j expects to find its data (/data)
  • Download the backup set from google cloud storage, uncompressing it if needed
  • Restore the graph database to /data using neo4j-admin

In this way, when the Neo4j docker container starts, it finds its graph database right where it expects it, and the Neo4j containers themselves remain unmodified.

The approach I implemented assumes the backup sets are stored on Google Storage, but with some simple modifications the script can run with any cloud storage provider.

The Restore Container

It expects a few environment variables as parameters:

  • GOOGLE_APPLICATION_CREDENTIALS — path to a file on disk where the JSON service key can be found, that permits access to the backup set.
  • REMOTE_BACKUPSET — the URL of the set we want to restore, for example gs://my-bucket/my-backup.tar.gz

That’s all that’s required. There are some optional parameters you can read more about in the restore container’s README file. They allow you to control behavior like where to find the backup in a compressed set, and whether or not to force-overwrite an existing database if one is found.

Configuring the Restore Container

To make that initContainer function, we need two pieces: the initContainer spec itself, and one extra shared volume. Let’s take them individually:

Init Container

  - name: restore-from-file
image: gcr.io/neo4j-k8s-marketplace-public/causal-cluster/restore:3.4
imagePullPolicy: Always
volumeMounts:
- name: datadir
mountPath: /data
- name: restore-service-key
mountPath: /auth
env:
- name: REMOTE_BACKUPSET
value: gs://my-cloud-bucket/my-backupset.tar.gz
- name: BACKUP_SET_DIR
value: my-backupset
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /auth/credentials.json
# CAUTION: Read documentation before proceeding with this flag.
- name: FORCE_OVERWRITE
value: "true"

If you’re running in your environment, you’ll want to build the restore container yourself and change the image reference here to the container registry of your choice. What this does is create the restore container, mount the Neo4j container’s /data into itself (this is what allows it to manipulate the persistent volume used by Neo4j), and then specify some environment variables to configure the restore process.

Here we’ve specified a BACKUP_SET_DIR. This allows the restore tool to interpret the contents of the backup set. When it is uncompressed, it knows to look in that subdirectory for the database backup. We have customers who use a variety of different scripts and methods to produce backups, and the naming here isn’t always consistent within archives, so it can be specified. If you copy an entire uncompressed backupset to cloud storage as a raw directory, this you can specify REMOTE_BACKUPSET to be the directory, and simply skip BACKUP_SET_DIR.

Service Account Key

A very important bit here is the /auth mapping. This is how we’ll get our google application credentials into the restore container. The application credentials themselves are a generated Google Cloud Service Key, which we downloaded as JSON. When creating the service key, of course we have to give it permissions to read from the bucket where the backup set is stored. We’ll create a kubernetes secret containing the auth key like this:

MY_SERVICE_ACCOUNT_KEY=$HOME/.google/my-service-key.json
kubectl create secret generic restore-service-key \
--from-file=credentials.json=$MY_SERVICE_ACCOUNT_KEY

This creates a kubernetes secret called “restore-service-key” which you’ll see mounted to /auth inside of the restore container. When we mount it, that drive will appear to have one file called credentials.json with the contents of our local key. The last part necessary is to add that secret as a volume to the core container, like this:

volumes:
- name: "restore-service-key"
secret:
secretName: "restore-service-key"

A variant of this can be found in the Neo4j on Google GKE marketplace templates in github.

With this volume in place, when the initContainer mounts that secret to /auth, it can see the contents of the service key, and thus have protected access to backup sets.

Running the Restore Container

With the initContainer in place, your Neo4j cluster is now self-healing in several senses — if a pod crashes or dies, it will be of course restarted by Kubernetes, and the initContainer will restore it to the last backup that you’ve specified.

As part of a good maintenance routine, you can have a “latest backup” in a known spot on Google Storage (for example) and always point the initContainer to that latest backup set. In the event of a crash, the node will be back up and running in no time, with the data that you expect.

Restore Considerations

Neo4j Causal Clusters

A common way you might deploy Neo4j would be restore from last backup when a container initializes. This would be good for a cluster, because it would minimize how much catch-up is needed when a node is launched. Any difference between the last backup and the rest of the cluster would be provided via catch-up.

Single Node Installs of Neo4j

For single nodes, take care here. If a node crashes, and you automatically restore from backup, and force-overwrite what was previously on the disk, you will lose any data that the database captured between when the last backup was taken, and when the crash happened. As a result, for single node instances of Neo4j you should either perform restores manually when you need them, or you should keep a very regular backup schedule to minimize this data loss. If data loss is under no circumstances acceptable, simply do not automate restores for single node deploys.

Auth Files

As of Neo4j 3.4 series, data backups do not include authorization information for your cluster. That is, usernames/passwords associated with the graph are not included in the backup, and hence are not restored when you restore.

This is something to be aware of; when launching a cluster typically you’re providing startup auth information and separate configuration anyway. If you create users, groups, and roles you may want to separately take copies of the auth files so that they can be restored when your cluster starts up.

Alternatively, users may configure their systems to use LDAP providers in which case there is no need to backup any auth information.