Automating FaunaDB backups

Jeremy Hall
9 min readOct 2, 2020

FaunaDB is an amazing database for serverless applications. It’s fully managed, with global immediate transactions and a native GraphQL layer — simplifying many use cases for developers. In this article, I’m going to explain how I automated point-in-time backups of my Fauna database using Fauna’s FDM tool and Github Actions.

But first, I want to explain why I use FaunaDB, and why I am talking about needing backups:

The advantage of a managed and serverless database

Being a completely managed database, I don’t need to worry about running my own servers, performing server-maintenance, load balancing, performance tuning, deploying bug-fixes and updates, or many of the other time-consuming and scary tasks related to dev-ops of non-managed databases. I don’t even need to worry about regions, since it is globally distributed and replicated — pulling data from the region nearest to the client requesting it.

Fauna has their own operations team that worries about the reliability of their service so I don’t have to — including behind the scenes service-wide backup infrastructure, offsite-store of backups, multi-region data replication, and multi-cloud deployment. This means that you are unlikely to lose data due to an issue with FaunaDB.

In fact, the FaunaDB has a very interesting “temporality” feature which allows you to time travel your database’s structure and data. This means you can query your database as it was in the past, and potentially recover from issues with your database using this feature. The length of time that document history is stored is user-configurable on a per-collection (FaunaDB’s equivalent of a table in normal DB-speak) basis, and defaults to 30 days.

Why do I need to backup FaunaDB at all if it has so much redundancy?

While I’m not too worried about losing data due to an issue on Fauna’s end due to their layers of redundancy, I am worried that an update I push could delete or corrupt data. By storing snapshots of my database before updating, I know I can revert back to a known safe-point-in-time.

I could choose to set FaunaDB’s temporality feature to store my data indefinitely and use it to revert to a past-state if something were to happen. However, I may choose to use that feature for something other than backups in the future, in which case having incorrect or corrupted historical-data would cause me headaches. Also, the events in Fauna’s temporality feature are mutable. This is an amazingly powerful feature which potentially allows rewriting of corrupted data and errors. I prefer my backups to be immutably-stored point-in-time snapshots however, both for their simplicity, and in case things go really wrong.

What are Github Actions?

Github Actions (documentation link here) are scripts that live in your Github repository, and are triggered by actions on Github such as pull requests, repository pushes, scheduled timers, etc. I use Github Actions to automatically build and deploy my frontend application and update my FaunaDB’s schema whenever changes are pushed.

In their simplest form, Github Actions are scripts that live in a specific part of your project directory (.github/workflows). When you have a .yml file in that directory, it will automatically be recognized by Github as a workflow file. You can have multiple of these, and each one can contain multiple jobs, which can contain multiple steps.

Workflows, Jobs, and Steps are defined in yml. In the end, you are organizing things to run steps which can be bash commands, scripts, or third-party actions shared by Github, or the Github community. These steps run in the context of an environment that you define at the beginning of a job using a runs-on declaration. This context is essentially a docker container which your commands run in. Github provides some containers (including ubuntu, macos, and windows server images — click the “included software” links on that page to explore what is available in each environment) which already have a ton of software and tools installed for you out of the box.

The Fauna Data Manager

The FaunaDB Data Manager (FDM) is a utility provided directly by the FaunaDB team that both has a GUI and a CLI for reading/writing batch data to and from your FaunaDB.

The FDM is a Java-based application which makes it easy to download a snapshot of a database. I need to mention that the FDM is currently in ‘Preview’, and isn’t recommended to use on your production databases.

The backup process

The FDM provides a command line interface (CLI) which I can use to run it as a command in a bash shell to download my database into an export folder:

./fdm -source key=$FAUNA_SERVER_KEY -dest path=../fauna_export

The above command uses a fauna server key as the source (since the key actually embeds the information about which server it’s referring to in addition to authentication information), and a path to save the backup data to. There are several options available which you use to tailor this process to your use-case including:

Sources (docs link)

  • The local file system (with collections and documents stored in JSON or CSV format)
  • A Fauna database using a key (as as shown above)
  • An AWS S3 bucket containing JSON or CSV data
  • A JDBC URL to allow pulling of data directly from a different database

Destinations (docs link)

  • The local file system (outputs collections and their rows in JSON format)
  • A Fauna database

Additionally, the dryrun option can be used to test if a source can be read without writing any data.

Finally, schema and document policies can specify what to do if existing data is being written to. This means you can choose to prioritize existing data, replace it with the source data, or do nothing.

The data format of the FDM as explained by Fauna:

When the FaunaDB Data Manager creates a backup of a FaunaDB database to a filesystem, it creates one file per collection in the source database named after the source database collection.

Each exported file contains one JSON document per line, representing all of the documents that exist in the source database.

Basically, we end up with a bunch of JSON files in the export directory which can then be backed-up:

This is an example export of a Fauna database using the FDM. Each of these files represents a collection from Fauna database (with the exception of the fauna_schema file)

And the contents of a json file, in this case the MonthlyPropertyRoom.json file:

Each row represents a document stored in the database’s MonthlyPropertyRoom collection

Automating the backup process

Now that we know how to backup our database with the FDM, we can write a workflow file to automate this on a schedule. The basic steps are as follows:

  1. Some basic housekeeping, including setting the trigger for our workflow, checking out the Github project files, setting up Java on the environment, and setting up Google’s Cloud platform so that we can save the backups to storage buckets.
  2. Optional: download and unzip the FDM application (I’ve chosen to save it to my project instead of downloading it each time)
  3. Perform the backup
  4. Zip the folder and it’s files up
  5. Save the zipped backup payload to Google cloud storage

I’ve chosen to use Google’s cloud storage to save the backups to, but you can basically store them anywhere accessible on the internet.

Here is the workflow file to perform the scheduled backup:

To create repository secrets that you can access from a workflow file like secrets.FAUNA_ADMIN_KEY, you can do that from the settings tab in your repository (you need admin-level permissions).

Documentation for the GoogleCloudPlatform/github-actions/setup-gcloud action can be found here. One of the parameters is the service_account_key, which is a base64-encoded key for a service account that has permission to write to your chosen cloud storage bucket.

The restore process

Here is an example of restoring from FDM-written json files on the filesystem:

./fdm -source path=../fauna_export -dest key=<admin_key>

Currently, the FDM can only restore collections and their associated documents when using the file system as a source. This isn’t a problem if your target database already has a schema setup, and just needs to import backed-up data. If you are using a new database, you will need to import your graphql schema and setup any additional roles, functions, and indexes. You can either accomplish this:

  • With a script using the Fauna created fauna-shell tool or library
  • Manually by using Fauna’s provided web-dashboard shell
  • Manually using the NEW INDEX, NEW FUNCTION, and NEW ROLE web-dashboard UI tools

FDM tool caveats

Due to the fact that the FDM is in ‘preview’, and a relatively new tool, there are some limitations in its use. I think these five are what most users should be aware of:

* Document history is not processed. Only the most recent version of each document is exported or copied.

* Document credentials will be lost

* Child databases are not processed. To process a child database, run the FaunaDB Data Manager with an admin key for that child database.

* When exporting a FaunaDB database to the local filesystem, only collections and their associated documents are exported. A copy of the schema documents describing collections, indexes, functions, and roles is copied to the file fauna_schema. Currently, that schema file cannot be used during import.

* GraphQL schema metadata is not fully processed. This means that if you import an exported database, or copy one FaunaDB database to another, you need to import an appropriate GraphQL schema into the target database in order to run GraphQL queries.

Now I’d like to break down why these aren’t issues for my own use-case:

  • Document history is not processed

FaunaDB has a robust event-history feature built-in. You can basically time-travel in your database for as long as you have setup history to be saved (default of 30 days). I don’t really use this feature, so it’s not an issue to lose this information during the backup process.

  • Document credentials will be lost

Fauna’s security system — called Attribute-based access control (ABAC) relies on associating credentials in the form of passwords with documents. This credential metadata is not stored in the backups, therefore after restoring a database we may need to do something like emailing users to create new passwords to login.

I use a third-party authentication service (Firebase Authentication) and non-user generated passwords, which means I could regenerate the credentials if needed.

  • Child databases are not processed

I store all my data in a single database, and don’t make use of the child-database functionality.

  • Indexes, Functions, and Roles aren’t able to be restored from backup (although they are written to the fauna_schema file when backing up).
  • GraphQL schema metadata is not fully processed

I have automated my FaunaDB schema setup, including pushing the graphql-schema file to Fauna when deploying.

Please be aware that your own situation might be different from mine, so you need to evaluate if the current state of the FDM would allow you to use it or not.

Billing implications

The FDM tool is a wrapper for accessing the database, the same as if you were accessing it through a self-written script. This means that you can incur a lot of reads when downloading every document in your database. These reads count against your quota the same as any other, and can quickly spike your usage above the free-tier. I have spoken with Fauna representatives who assure me that they are aware of how undesirable this situation is, and that they are working on a solution.

Conclusion

I’ve shown that you can in fact backup your FaunaDB data in a few easy steps. A lot of the complexity in setting up this process is related to storing in Google’s cloud storage. You should be able to adapt this script to your own needs and save to another provider such as Amazon’s S3 without much difficulty. While the FDM is a very recently released tool, it still covers the basics for backing up and restoring your database.

Once the Fauna team adds the the ability to completely backup and restore your database with FDM — including document history, the database schema, and GraphQL schema/metadata — it will become even easier to recover from mistakes made in the development and deploy process. In the meantime however, it’s easy enough to workaround those issues.

If you have questions about Fauna that you’d like answered by the community or team members you can visit the forums at https://forums.fauna.com/ or join the Slack channel by getting an invite here https://community.fauna.com/

--

--

Jeremy Hall

A fullstack developer that enjoys working with the latest serverless technologies and web frameworks to enable small teams to develop huge systems.