Consistent AEM backups

Published in

VRT Digital Products

8 min readJan 10, 2019

Here at VRT, we run several high-traffic websites like https://vrtnws.be, https://sporza.be and https://vrtnu.be. Under the hood, these websites are all running on Adobe Experience Manager (AEM).

One of our major architectural requirements, is recoverability. To reach that goal, we need to be able to restore a backup in a reliable way.

As AEM keeps its data in separate stores, it’s hard to figure out if your backup is restorable as your data needs to be consistent.

This failed quite a number of times in the past, so we automated that consistency check and made it part of our backup procedure.

Because our websites are mainly news related, and news never stops, our AEM author instance is running nearly 24x7. That means our backups have to be created while AEM is up and running as well. We do that using AEM’s online backup.

Storage configuration

But before we dig into the details, a bit of background on our underlying storage configuration. AEM has a lot of possible storage configurations, but we’re following Adobe’s best practices, that is the TarMK segment store combined with a shared Amazon S3 backed data store. It looks like this:

As we are running AEM on Linux servers on AWS, our backups are easily scheduled through cron where a Bash script takes care of the necessary steps.

The backup runs at night, when there is little or no repository activity. When you would try to create a backup at a time when you have a lot of write activity, one of the last stages in the backup may fail because Jackrabbit Oak (the storage engine of AEM) can’t block writes long enough.

Backup order

We backup all our AEM publish and author instances. But here you have to think about consistency as well. To make sure data is consistent between author and publish instances after a restore, you must create the backups in the correct order.

If you would backup them at the same time, you might risk to have content in your publish instance backup, that is not in your author instance backup. This means you can’t remove the content by de-publishing, as it simply doesn’t exist on the author instance.

If you would backup the author instance before the publish instance, the same applies.

You must backup your publish instances before your author instance.

The content on your publish instances will be a bit older, but all content is available in your author instance and you can republish the newer content from your author instance.

Exposing JMX

AEM has the option to start an online backup through means of a JMX call. Calling JMX directly is hard, but there is a better option. We can expose JMX through a HTTP REST API by starting our AEM instance with the Jolokia JVM agent.

Now the backup script can easily call the correct JMX operation(s) through curl calls.

In the snippets below, we’ll be using some other tooling as well:

jq, a command-line JSON processor,
oak-run, an offline tool for handling Jackrabbit Oak repositories

Summary

In short, we take an incremental online backup, verify the consistency of the segment store and the data store, compress and upload the segment store backup as a ZIP file onto an Amazon S3 bucket, and finally we rotate the backup files stored in that S3 bucket.

I won’t be posting the complete script here, but rather highlight the important bits and pieces.

Speeeed!

To execute the online backup as fast as possible(although it will consume more system resources during the backup), we first set the backup delay to zero.

curl -s -S -f -m 60 "http://localhost:8778/jolokia/write/com.adobe.granite:type=Repository/BackupDelay/0/"

Start the backup

To start the online backup, you can execute this call:

escaped_backup_dir=$(echo "/data/backup/aem" | sed 's/\//\!\//g')
curl -s -S -f -m 60 "http://localhost:8778/jolokia/exec/com.adobe.granite:type=Repository/startBackup(java.lang.String)/$escaped_backup_dir"

This will start on online backup of our AEM instance and place an uncompressed copy of all files into /data/backup/aem.

We leave the whole /data/backup/aem on disk, even after the backup is completed, verified and uploaded to S3. We do this because this will massively speed up consecutive backups as AEM is smart enough to only backup the difference. The extra cost for storage is a small price to pay in regards to the speed of consecutive incremental backups. We have about 45GB of data to backup, but it only takes about 5 minutes for such an incremental backup to complete.

Wait until it’s finished…

As starting the online backup is an asynchronous operation, our next step is to wait until the operation is completed. The backup process creates a backupInProgress.txt file inside the backup folder you specified, in our case /data/backup/aem/backupInProgress.txt. Once the backup is finished, this file is removed again. This can be easily done by 2 waiting loops.

# wait max 60 seconds for backup to start
timeout=60
while [[ ! -f /data/backup/aem/backupInProgress.txt ]] && [[ $timeout -gt 0 ]]
do
  sleep 1
  timeout=$((timeout - 1))
done# wait max 10 minutes for backup to finish
timeout=10
while [[ -f /data/backup/aem/backupInProgress.txt ]] && [[ $timeout -gt 0 ]]
do
  sleep 60
  timeout=$((timeout - 1))
done

Was it OK?

Next, we check if the backup was successful with this call:

curl -s -S -f -m 60 http://127.0.0.1:8778/jolokia/read/com.adobe.granite:type=Repository | jq -e -r '.value.BackupWasSuccessful'

If that command returns successfully, we can now continue with the most important parts of this process, i.e. the verification of the backup consistency.

As our storage is split into both a segment store and a data store, we need to verify both parts. Important note: we do this on the data stored in our backup folder and not in the files used by our online AEM instance. If we would use any of the offline tools on the files our online instance, we would ruin the repository, leaving us a broken AEM instance.

For these checks, we use the oak-run tool. It is very important to use the same version of the oak-run tool as the version of your oak-core bundle in the AEM instance. Experience has thought me you can use a later patch version as well. As we are running AEM 6.3.3.1 which comes with oak-core 1.6.14, we use oak-run 1.6.14 as well.

First, we do an offline consistency check of the segment store:

cd /data/backup/aem/crx-quickstart/repository/segmentstore
java -jar /data/tools/oak-run-1.6.14.jar check

Next, we do a consistency check of the data store in regards to the references found in the segment store.

Here we can’t use the oak-run runnable jar as such, because we need to add extra libraries to the java class path to use the S3 data store connector. The jar’s and the configuration of this connector are installed in the subfolders of the /data/apps/aem/crx-quickstart/install folder.

extra_classpath="$(find /data/apps/aem/crx-quickstart/install/ -name '*.jar' | tr '\n' ':')"
classpath="${extra_classpath}/data/tools/oak-run-1.6.14.jar"java -classpath "$classpath" org.apache.jackrabbit.oak.run.Main datastorecheck --consistency --store /data/backup/aem/crx-quickstart/repository/segmentstore --repoHome /data/backup/aem/crx-quickstart/repository --s3ds /data/apps/aem/crx-quickstart/install/crx3/org.apache.jackrabbit.oak.plugins.blob.datastore.SharedS3DataStore.config > /tmp/oak_run_datastorecheck.log 2>&1

Unfortunately this command does not give a correct exit value (hopefully this will be fixed in some later version), so we have to check the command output using grep.

grep -F "Consistency check found 0 missing blobs" /tmp/oak_run_datastorecheck.log

Once we checked that both stores are consistent, we finally have a verified consistent backup.

UPDATE June 7, 2019:
As we are preparing our upgrade to AEM 6.4, we noticed the consistency check of the data store failed. Were we missing data? Or was the reporting wrong? We contacted Daycare support and after a lot of discussion and debugging, we found the culprit. In AEM 6.4 and later there is a second datastore garbage collection mechanism to clean up Lucene index blobs on a faster pace than the regular datastore garbage collection. The reporting shows these as missing as they are in the oldest generation of the segment store, but they are already removed on the data store. After disabling this functionality by adding -Doak.active.deletion.disabled=true to your JVM startup options, the inconsistencies were no longer reported.

Finish up

The next step would be to upload the /data/backup/aem folder as a dated and compressed archive to Amazon S3.

Once that is completed successfully, the last thing to do, is list all compressed archives on our backup S3 bucket and remove the ones older than 14 days.

Data store backup?

The attentive reader might have already found a major loophole concerning the restore of our data. We have 14 days of full backups of the segment store, but we don’t have any backup of our data store.

At VRT we store a lot of data in our AEM instance, not only text, but also audio snippets and images (for video we only store metadata). This means we have a lot more data in our data store than in our segment store. At the time of writing, we have about 2 TB of data in our data store. As this is a vital part of our backup, we must be sure never to lose this data. But due to the volume of data, it’s not possible to follow the same method to backup the data store as we do the segment store.

You might assume we take into account the 99,999999999% durability measure that is provided by the S3 service. But no, we went a little bit further.

After running data store garbage collection on your AEM instance, all unreferenced objects in S3 get deleted by the process, thus rendering older segment store backups unusable.

To avoid that, our data store S3 bucket is versioned, so we can do point-in-time restores of the data in the bucket. Unfortunately, AWS S3 doesn’t provide that functionality, but s3-pit-restore provides a nice open-source alternative.

In real-life, we haven’t even used s3-pit-restore on our production data. We would never even restore a backup older than 1 day (as we would lose too much content by such an operation). If we would restore a backup as recent as that, the data store garbage collection procedure wouldn’t even have removed those references as it only deletes data older than 24 hours by default.

As we keep 14 days of segment store backups, we also need to keep 14 days of S3 version history. That can be easily implemented using an S3 versioning lifecycle rule.

And because our S3 bucket is versioned, we also have the option to use AWS S3 cross-region replication. For both the segment store backups as the datastore bucket, we have set up a replica of these S3 bucket in another AWS region. This mitigates our issue of losing all our backup data when an entire AWS region would fail.

This was my first post on Medium. I hope you enjoyed it. If you have questions, or would like to discuss further, leave a comment or connect via Twitter @wimsymons.