Elasticsearch Backup : Snapshot and Restore on AWS S3

Elasticsearch backup snapshot and restore on AWS S3

Elasticsearch Snapshot & Restore

just in case you start reading this tutorial / article … beware … is very long… the idea is to explain how to do a snapshot and restore for Elasticsearch on AWS S3. The article is divided in these sections :

  1. ES Repositories
  2. ES cloud-aws plugin
  3. AWS S3 IAM Role
  4. AWS S3 User Policy
  5. ES setup backup Repository with AWS IAM ROLE
  6. ES setup backup Repository with AWS User
  7. ES create a snapshot
  8. ES restore a snapshot
  9. ES restore a snapshot on different cluster

have a nice read!

Elasticsearch has a smart solution to backup single indices or entire clusters to remote shared filesystem or S3 or HDFS. The snapshot ES create are not so resource consuming and are relatively small.

The idea behind these snapshots is that they are not “archive” in a strict sense, these snapshots can only be read by a version of Elasticsearch that is capable to read the index version stored inside the snapshot.

So you can follow this quick scheme if you want to restore ES snapshots :

  • A snapshot of an index created in 2.x can be restored to 5.x.
  • A snapshot of an index created in 1.x can be restored to 2.x.
  • A snapshot of an index created in 1.x can not be restored to 5.x.

So pay a lot of attention when you create a snapshot from 1.x, you cannot restore directly from a 5.x cluster you should import with a 2.x cluster and the you can use reindex-from-remote available in the new 5.x release (the link could change due to a new release of ES not yet released!!).

In order to start backing up indices you must know the syllabus behind it :

  • repository : the repository is just a logical aggregator inside which you will store real backup data (snapshot). a repository can contain multiple snapshot.
  • snapshot : is the backup of the data

Elasticsearch Repositories

Every backup inside Elasticsearch is stored inside a so called “snapshot repository” which is a container defined to setup the filesystem or the virtual filesystem features the snapshots will be stored in. When you create a repository you have many options available to define it. You can define a repository with a :

  • Shared filesystem
  • AWS S3
  • Hadoop HDFS
  • Microsoft Azure

In this tutorial we will use AWS S3 as a repository to store our snapshots.

Elasticsearch cloud-aws plugin

Plugins is a way to expand Elasticsearch functionalities, in this case cloud-aws plugin allow to setup a repository on AWS S3.

This plugin is not only useful for Elasticsearch snapshot & restore functionalities but also during the cluster setup on AWS, to allow the cluster to auto-discover new members of the cluster when they turned-On.

The plugin directory is placed here :

/usr/shares/elasticsearch/plugins

and the cloud-aws is a plugin’s subdirectory :

/usr/shares/elasticsearch/plugins/cloud-aws

If you see the above directory, means that you have already installed that plugin otherwise you need to set it up ;)

is pretty easy just type :

sudo /usr/share/elasticsearch/bin/plugins install cloud-aws
Elasticsearch 2.4.1 cloud-aws plugin install
if you see and error during plugin install (@ WARNING: plugin requires additional permissions @) don't worry, it comes from the Java security manager. It's not an error, it's just asking you to confirm that you want to give the plugin the necessary permissions to run in the context of the security manager.

In my case I have Elasticsearch 2.4.1 the plugin installed is the version 2.4.1. At the time of writing, plugins follow Elasticsearch’s version, so if you have installed elasticsearch 2.4.1 you have to use the corresponding plugin version (but don’t worry the installer will do it for you). Now type:

/usr/share/elasticsearch/bin/plugin list
Installed plugins in /usr/share/elasticsearch/plugins:
cloud-aws

The Elasticsearch plugin application has 3 commands :

  1. install
  2. remove
  3. list

Remember to stop Elasticsearch before installing/removing plugins. Ok now the plugin is installed, before you setup the repository and then starting doing snapshot of your indices you have to analyze two different strategies to access Amazon S3.


Use AWS S3 (Authorise bucket access from you application)

AWS : IAM Roles VS Users

In Aws you have many options to allow [users|servers] access AWS resources. The idea we have is to setup a snapshot repository on Amazon Aws S3 and doing a restore from that specific location. In order to do that, the servers that have to access S3 must be authorised. Thankfully to Amazon we have many options :

  • using IAM (Identity Access Management) Roles
  • using a specific user with specific roles/policies

The first approach is the one I suggest to use, as it is more reliable, we don’t need to setup anything on the server than the AWS cli command, and it’s not mandatory at all because, the Elasticsearch plugin do the job pretty well. The difference between the two approaches is :

  • IAM Role is attached to the EC2 instance at the moment of starting it up.
  • USER needs its own credentials whenever you need to access S3.

So basically if you need to start a new EC2 or a new fleet of EC2's use a IAM ROLE, if you want to setup an Elasticsearch “snapshotting” to an already existing ES cluster you have to use a USER with a specific policy.

AWS S3 : Policy

Before entering the details of IAM Roles or Users we need, both cases, to define a Policy. A Policy is a set of rules you can define and use for all the services AWS offer, and with many levels of granularity, in order to grant access to these resources.

“A policy is a document that formally states one or more permissions.”

If you want to understand what a policy is and how to use it, I suggest these links to the well written AWS documentation :

There are many pre-defined Policies for the whole AWS service stack. In our case we want to grant access our EC2 instances to S3 services :

AmazonS3FullAccess
Default AWS S3 policy — allow everything policy :)

This pre-defined policy is quite easy to understand … allows to do everything from all AWS resources. Easy peasy… but this kind of approach is not the one we want on AWS. If you decide to go deeper inside the AWS documentation you will quickly learn that it’s better to define access only to those resources we need. AWS Policies allow us to be more fine grained and define a more detailed policy.

Policy which grant access to a specific S3 bucket

This second custom Policy defined shows how we can go more deeply inside what users or roles can do in our AWS VPC/environment. You can even go to more details : define what specific actions are allowed and so on. Read carefully the documentation, this is on AWS a very important topic.

Another important TIP when you work with S3 policies it to allow : ListAllMyBuckets policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::bucket-snapshot",
"arn:aws:s3:::bucket-snapshot/*"
]
}
]
}

Elasticsearch setup backup Repository with AWS IAM ROLE

IAM role : “elasticsearch-to-s3”

When you setup a new EC2 instance on AWS, you have an option where you can setup the IAM Role for the machine to use. In our case we setup a IAM role, with a policy defined previously; the role name is : elasticsearch-to-s3 (name that you define where you create the custom Policy).

Remember : IAM role is possible to use only at EC2 startup, you cannot add a new role to the EC2 instance after the instance is started. You can see as a reference an AWS forum thread about IAM ROLE attached to existing EC2 instances.

Ok now we are ready to create a new Elasticsearch snapshot Repository :

curl -XPUT 'http://localhost:9200/_snapshot/s3_repository?verify=false&pretty' -d'
{
"type": "s3",
"settings": {
"bucket": "bucket-snapshot",
"region": "eu-west-1"
}
}'

As you can see we used very few parameters:

  • type : we used “s3” to specify AWS S3 service
  • settings.bucket : AWS S3 bucket name
  • settings.region : eu-west-1 (the default )

Well done your Elasticsearch Snapshot Repository (s3_repository is the name) is created ! Now you can backup Elasticsearch!


Elasticsearch setup backup Repository with AWS User credentials

After you create a custom policy you can attach it to a new created user. So create a new user (the access key and secret access key are not valid so don’t try to use them, are shown only for the tutorial purpose ;) ):

Create a new AWS user.

and attach the already created policy to id:

now you are ready to setup your Elasticsearch snapshot repository.

curl -XPUT 'http://localhost:9200/_snapshot/s3_repository?verify=false&pretty' -d'
{
"type": "s3",
"settings": {
"bucket": "bucket-snapshot",
"region": "eu-west-1",
"access_key": "...",
"secret_key": "..."
}
}'

As you can see we used very few parameters:

  • type : we used “s3” to specify AWS S3 service
  • settings.bucket : AWS S3 bucket name
  • settings.region : eu-west-1 (the default )
  • access_key : user’s access key
  • secret_key : user’s secret key

The difference from IAM Role is that you have to specify accessKey and secretKey, which is not very good… if you decide to change user’s credentials for whatever reason you have to update all of your repository.

Well done your Elasticsearch Snapshot Repository (s3_repository is the name) is created ! Now you can backup Elasticsearch!


Elasticsearch do a Backup : create a Snapshot

curl -XPUT "http://localhost:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true"

The command to generate a snapshot is pretty simple : is an easy HTTP PUT request against a REST endpoint. The above endpoint will create a snapshot of all cluster and indices. You can add some parameters to create a snapshot only of the desired indices.

curl -XPUT "http://localhost:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true" -d'
{
"indices": "products, index_1, index_2",
"ignore_unavailable": true,
"include_global_state": false
}
  • wait_for_completion : this directive tells the command to wait for the snapshot to complete before returning status information, which could be a problem if you are doing a snapshot of a lot of data. If you omit this parameter the command will return immediately.
  • indices : specify which indices to backup.
  • ignore_unavailable : if index doesn’t exist skip to the next index in the list otherwise break execution if is set to false.
  • include_global_state : setting it to false prevent Elastic to put the global cluster state from being put in the snapshot, this allow to restore the snapshot on another cluster with different attributes.

if you want to check the status of the snapshot just type :

curl http://localhost:9200/_cat/snapshots/s3_repository?v

you will have as output the snapshot list available to be restored.


Elasticsearch Restore the Snapshot

curl -XPOST http://localhost:9200/_snapshot/s3_repository/snap1/_restore

The /_restore will do a whole restore of the indices in the cluster, you have access to more options in order to be more accurate during the restoring phase.

curl -s -XPOST --url "http://localhost:9200/_snapshot/s3_repository/snap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
"rename_pattern": "(\\w+)",
"rename_replacement": "$1_dev"

}'

These options allow to rename indices on restore, matching a pattern and then apply a substitution. In this particular example a regex match “any word character” and append the capturing group with “_dev”.

Elasticsearch Restore to a different cluster

The interesting part is that if you want you can restore the snapshot on another cluster. What you have to do is to register the repository where the snapshot is on the new cluster and starting the restore process.

We have implemented this solution, we are snapshotting our production cluster and we are using these snapshot for restoring our cluster environment on development VM’s. It’s working pretty good and it’s super fast.


You can find an example of a restore script here, and an example of AWS S3 policy here. If you need more Elasticsearch reference please have a look here on ES website.