Elasticsearch Backup: Snapshot and Restore on AWS S3

Federico Panini
Oct 18, 2016 · 9 min read
Image for post
Image for post
Elasticsearch backup snapshot and restore on AWS S3

Image for post
Image for post
A new updated tutorial is available that covers modern ES features

An updated version is available here, it is an updated version that covers the same topic but for the latest Elasticsearch versions.


Elasticsearch Snapshot & Restore

just in case you start reading this tutorial/article… beware … It is very long… the idea is to explain how to do a snapshot and restore for Elasticsearch on AWS S3. The article is divided into these sections :

  1. ES Repositories
  2. ES cloud-aws plugin
  3. AWS S3 IAM Role
  4. AWS S3 User Policy
  5. ES setup backup Repository with AWS IAM ROLE
  6. ES setup backup Repository with AWS User
  7. ES create a snapshot
  8. ES restore a snapshot
  9. ES restore a snapshot on different cluster

have a nice read!

Elasticsearch has a smart solution to backup single indices or entire clusters to remote shared filesystem or S3 or HDFS. The snapshot ES creates is not so resource consuming and are relatively small.

The idea behind these snapshots is that they are not “archive” in a strict sense, these snapshots can only be read by a version of Elasticsearch that is capable to read the index version stored inside the snapshot.

So you can follow this quick scheme if you want to restore ES snapshots :

  • A snapshot of an index created in 2.x can be restored to 5.x.
  • A snapshot of an index created in 1.x can be restored to 2.x.
  • A snapshot of an index created in 1.x can not be restored to 5.x.

So pay a lot of attention when you create a snapshot from 1.x, you cannot restore directly from a 5.x cluster you should import with a 2.x cluster and then you can use reindex-from-remote available in the new 5.x release (the link could change due to a new release of ES not yet released!!).

To start backing up indices you must know the syllabus behind it :

  • repository: the repository is just a logical aggregator inside which you will store real backup data (snapshot). a repository can contain multiple snapshots.
  • snapshot: is the backup of the data

Elasticsearch Repositories

Every backup inside Elasticsearch is stored inside a so-called “snapshot repository” which is a container defined to setup the filesystem or the virtual filesystem features the snapshots will be stored in. When you create a repository you have many options available to define it. You can define a repository with a :

  • Shared filesystem
  • AWS S3
  • Hadoop HDFS
  • Microsoft Azure

In this tutorial, we will use AWS S3 as a repository to store our snapshots.

Elasticsearch cloud-aws plugin

Plugins are a way to expand Elasticsearch functionalities, in this case, the cloud-aws plugin allows us to set up a repository on AWS S3.

Image for post
Image for post

This plugin is not only useful for Elasticsearch snapshot & restore functionalities but also during the cluster setup on AWS, to allow the cluster to auto-discover new members of the cluster when they turned-On.

The plugin directory is placed here :

/usr/shares/elasticsearch/plugins

and the cloud-aws is a plugin’s subdirectory :

/usr/shares/elasticsearch/plugins/cloud-aws

If you see the above directory, means that you have already installed that plugin otherwise you need to set it up ;)

is pretty easy just type :

sudo /usr/share/elasticsearch/bin/plugins install cloud-aws
Image for post
Image for post
Elasticsearch 2.4.1 cloud-aws plugin install
if you see and error during plugin install (@ WARNING: plugin requires additional permissions @) don't worry, it comes from the Java security manager. It's not an error, it's just asking you to confirm that you want to give the plugin the necessary permissions to run in the context of the security manager.

In my case I have Elasticsearch 2.4.1 the plugin installed is version 2.4.1. At the time of writing, plugins follow Elasticsearch’s version, so if you have installed elasticsearch 2.4.1 you have to use the corresponding plugin version (but don’t worry the installer will do it for you). Now type:

/usr/share/elasticsearch/bin/plugin listInstalled plugins in /usr/share/elasticsearch/plugins:
cloud-aws

The Elasticsearch plugin application has 3 commands :

  1. install
  2. remove
  3. list

Remember to stop Elasticsearch before installing/removing plugins. Ok, now the plugin is installed before you set up the repository and then starting doing a snapshot of your indices you have to analyze two different strategies to access Amazon S3.


Use AWS S3 (Authorise bucket access from your application)

Image for post
Image for post
AWS: IAM Roles VS Users

In Aws, you have many options to allow [users|servers] to access AWS resources. The idea we have is to set up a snapshot repository on Amazon Aws S3 and doing a restore from that specific location. To do that, the servers that have to access S3 must be authorized. Thankfully to Amazon, we have many options :

  • using IAM (Identity Access Management) Roles
  • using a specific user with specific roles/policies

The first approach is the one I suggest to use, as it is more reliable, we don’t need to set up anything on the server than the AWS cli command, and it’s not mandatory at all because the Elasticsearch plugin does the job pretty well. The difference between the two approaches is :

  • IAM Role is attached to the EC2 instance at the moment of starting it up.
  • USER needs its credentials whenever you need to access S3.

So basically if you need to start a new EC2 or a new fleet of EC2’s use an IAM ROLE, if you want to set up an Elasticsearch “snapshotting” to an already existing ES cluster you have to use a USER with a specific policy.

AWS S3: Policy

Before entering the details of IAM Roles or Users we need, both cases, to define a Policy. A Policy is a set of rules you can define and use for all the services AWS offers, and with many levels of granularity, to grant access to these resources.

“A policy is a document that formally states one or more permissions.”

If you want to understand what a policy is and how to use it, I suggest these links to the well written AWS documentation :

There are many pre-defined Policies for the whole AWS service stack. In our case we want to grant access to our EC2 instances to S3 services :

AmazonS3FullAccess
Image for post
Image for post
Default AWS S3 policy — allow everything policy :)

This pre-defined policy is quite easy to understand … allows doing everything from all AWS resources. Easy peasy… but this kind of approach is not the one we want on AWS. If you decide to go deeper inside the AWS documentation you will quickly learn that it’s better to define access only to those resources we need. AWS Policies allow us to be more fine-grained and define a more detailed policy.

Image for post
Image for post
The policy which grants access to a specific S3 bucket

This second custom Policy defined shows how we can go more deeply inside what users or roles can do in our AWS VPC/environment. You can even go to more details: define what specific actions are allowed and so on. Read carefully the documentation, this is on AWS a very important topic.

Another important TIP when you work with S3 policies it to allow: ListAllMyBuckets policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::bucket-snapshot",
"arn:aws:s3:::bucket-snapshot/*"
]
}
]
}

Elasticsearch setup backup Repository with AWS IAM ROLE

Image for post
Image for post
IAM role : “elasticsearch-to-s3”

When you set up a new EC2 instance on AWS, you have an option where you can set up the IAM Role for the machine to use. In our case we set up an IAM role, with a policy defined previously; the role name is elasticsearch-to-s3 (the name that you define where you create the custom Policy).

Remember : IAM role is possible to use only at EC2 startup, you cannot add a new role to the EC2 instance after the instance is started. You can see as a reference an AWS forum thread about IAM ROLE attached to existing EC2 instances.

Ok now we are ready to create a new Elasticsearch snapshot Repository :

curl -XPUT 'http://localhost:9200/_snapshot/s3_repository?verify=false&pretty' -d'
{
"type": "s3",
"settings": {
"bucket": "bucket-snapshot",
"region": "eu-west-1"
}
}'

As you can see we used very few parameters:

  • type: we used “s3” to specify AWS S3 service
  • settings.bucket : AWS S3 bucket name
  • settings.region : EU-west-1 (the default )

Well done your Elasticsearch Snapshot Repository (s3_repository is the name) is created! Now you can backup Elasticsearch!


Elasticsearch setup backup Repository with AWS User credentials

After you create a custom policy you can attach it to a newly created user. To create a new user (the access key and secret access key are not valid so don’t try to use them, are shown only for the tutorial purpose ;) ):

Image for post
Image for post
Create a new AWS user.

and attach the already created policy to id:

Image for post
Image for post

now you are ready to set up your Elasticsearch snapshot repository.

curl -XPUT 'http://localhost:9200/_snapshot/s3_repository?verify=false&pretty' -d'
{
"type": "s3",
"settings": {
"bucket": "bucket-snapshot",
"region": "eu-west-1",
"access_key": "...",
"secret_key": "..."
}
}'

As you can see we used very few parameters:

  • type: we used “s3” to specify AWS S3 service
  • settings.bucket : AWS S3 bucket name
  • settings.region : eu-west-1 (the default )
  • access_key: user’s access key
  • secret_key: user’s secret key

The difference from IAM Role is that you have to specify accessKey and secretKey, which is not very good… if you decide to change the user’s credentials for whatever reason you have to update all of your repositories.

Well done your Elasticsearch Snapshot Repository (s3_repository is the name) is created! Now you can backup Elasticsearch!


Elasticsearch do a Backup: create a Snapshot

curl -XPUT "http://localhost:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true"

The command to generate a snapshot is pretty simple: it is an easy HTTP PUT request against a REST endpoint. The above endpoint will create a snapshot of all clusters and indices. You can add some parameters to create a snapshot only of the desired indices.

curl -XPUT "http://localhost:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true" -d'
{
"indices": "products, index_1, index_2",
"ignore_unavailable": true,
"include_global_state": false
}
  • wait_for_completion: this directive tells the command to wait for the snapshot to complete before returning status information, which could be a problem if you are doing a snapshot of a lot of data. If you omit this parameter the command will return immediately.
  • indices: specify which indices to backup.
  • ignore_unavailable: if the index doesn’t exist skip to the next index in the list otherwise break execution if it is set to false.
  • include_global_state: setting it to false prevents Elastic to put the global cluster state from being put in the snapshot, this allows to restore the snapshot on another cluster with different attributes.

if you want to check the status of the snapshot just type :

curl http://localhost:9200/_cat/snapshots/s3_repository?v

you will have as output the snapshot list available to be restored.


Elasticsearch Restore the Snapshot

curl -XPOST http://localhost:9200/_snapshot/s3_repository/snap1/_restore

The /_restore will do a whole restore of the indices in the cluster, you have access to more options to be more accurate during the restoring phase.

curl -s -XPOST --url "http://localhost:9200/_snapshot/s3_repository/snap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
"rename_pattern": "(\\w+)",
"rename_replacement": "$1_dev"

}'

These options allow renaming indices on restore, matching a pattern, and then apply a substitution. In this particular example, a regex matches “any word character” and appends the capturing group with “_dev”.

Elasticsearch Restore to a different cluster

The interesting part is that if you want you can restore the snapshot on another cluster. What you have to do is to register the repository where the snapshot is on the new cluster and starting the restore process.

We have implemented this solution, we are snapshotting our production cluster and we are using these snapshots for restoring our cluster environment on development VM’s. It’s working pretty good and it’s super fast.


Image for post
Image for post

You can find an example of a restore script here, and an example of AWS S3 policy here. If you need more Elasticsearch reference please have a look here on the ES website.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store