Elasticsearch 7.x Backup — “Snapshot & Restore” a cluster on AWS S3

Federico Panini
Jun 4, 2020 · 13 min read

Set up snapshot and restore functionalities on Elasticsearch.

Elasticsearch 7.x snapshot and restore your cluster on AWS S3
Elasticsearch 7.x snapshot and restore your cluster on AWS S3
Elasticsearch Backup — Snapshot and Restore on AWS S3

In 2016 I wrote an Article about Elasticsearch Backup, it had and still has quite good interests from people. I decided to start a new series of articles with the Backup topic as the main argument.

The old article covered Snapshot & Restore functionalities based on Elasticsearch 2.4.x and the upcoming version, the 5.0. As it was 4 years ago I choose to refresh this tutorial and making it the first of a series of more.

I will prepare a small article on how to use the snapshot & restore functionality with different cloud-provider. This article is based on Elasticsearch 7.x, it doesn’t mean it couldn’t work on older versions but I focused on the latest one.

Elasticsearch Snapshot & Restore

Elasticsearch has a smart solution to backup single indices or entire clusters to remote shared filesystem or S3 or HDFS. The snapshot ES creates does not so resource consuming and is relatively small.

The idea behind these snapshots is that they are not “archive” in a strict sense, these snapshots can only be read by a version of Elasticsearch that is capable to read the index version stored inside the snapshot.

So you can follow this quick scheme if you want to restore ES snapshots :

Snapshots of indices created with ES 1.x cannot be restored to 5.x or 6.x, snapshots of indices created in 2.x cannot be restored to 6.x or 7.x, and snapshots of indices created in 5.x cannot be restored to 7.x or 8.x.

So pay a lot of attention when you create a snapshot from 1.x, you cannot restore directly from a 5.x,6.x, 7.x. You should follow the previous table and then you can use the cluster you should import with a 2.x cluster and then you can use reindex-from-remote available since the 5.x.

To start backing up indices you must know the syllabus behind it :

Elasticsearch Snapshots

The data backed up is stored in a structure called: SNAPSHOT. Snapshots are incremental which means they are not part of the last snapshot. The incremental nature of snapshots allows making them frequently without creating a lot of overhead.

Elasticsearch snapshots
Elasticsearch snapshots
Snapshots are incremental

Elasticsearch Repository

Every backup inside Elasticsearch is stored inside a so-called “snapshot repository” which is a container defined to setup the filesystem or the virtual filesystem features the snapshots will be stored in. When you create a repository you have many options available to define it. You can define a repository with a :

In this tutorial, we will discover AWS S3 as a repository to store our snapshots.

Elasticsearch Repository
Elasticsearch Repository

Elasticsearch S3 plugin

Since my last article the AWS-cloud plugin has been split into two different plugins:

This plugin during the time has changed a lot and has been improved. We will try to analyze it in detail and dig into every possible configuration.

How to install the plugin

Since my last article, the plugin has been improved and changed a lot. Elastic move from cloud-aws plugin to a new set of repository plugin which abstracts the filesystem. You have a filesystem plugin which I suggest to use only for test purposes, then you have many other solutions; here we will look at the S3 plugin.

The plugin installation is quite easy just a quick simple command executed in a terminal window:

sudo bin/elasticsearch-plugin install repository-s3
Elasticsearch plugin install — repository-s3
Elasticsearch plugin install — repository-s3
Install the repository-S3 plugin

The Elasticsearch plugin application has 3 commands :

  1. install
  2. remove
  3. list

Remember to stop Elasticsearch before installing/removing plugins. Now the plugin is installed before you set up the repository and then starting doing a snapshot of the indices you have to analyze two different strategies to access Amazon S3.

Configure S3 Plugin

The repository-S3 plugin provides a repository type named S3, which may be used when creating a repository. The repository defaults to using ECS IAM Role or EC2 IAM Role credentials for authentication or you can use user credentials.

The only mandatory setting is the bucket name:

curl -X PUT "localhost:9200/_snapshot/my_s3_repository?pretty" -H 'Content-Type: application/json' -d'
{
"type": "s3",
"settings": {
"bucket": "my_bucket"
}
}
'

Before creating the repository you have to be sure to have setup correctly the client settings, which allows the repository to store the data on the right filesystem: in our case AWS S3.

Client Settings

Again, since my last article things improved a lot and a new concept has been created from ES 5.5: client configuration. A client exposes the configurations used to connect to an external system for backup purposes, in our case S3.

By default, there’s a client called default and you can access its properties with this form s3.client.CLIENT_NAME.SETTING_NAME; using the default one access to the properties would be something like this:

s3.client.default.max_retries
s3.client.default.protocol
s3.client.default.endpoint
....

The client settings should be specified in the elasticsearch.yml, all the configs except the secure settings: access_key and secret_key, for them you have to use Elasticsearch key store.

For example, our sample file looks like this (at the end of the article there’s a link to a GitHub repository with a full sample):

cluster.name: "docker-cluster"
network.host: 0.0.0.0
s3.client.default.endpoint: s3-eu-west-1.amazonaws.com

This above is just a simple configuration, for example only, but the minimum config required for the client is setting: endpoint, access_key, secret_key settings.

If you want to define a custom client with a different name, or you need to define more than one, you specify it in the creation phase of the repository:

curl -X PUT "localhost:9200/_snapshot/my_s3_repository?pretty" -H 'Content-Type: application/json' -d'
{
"type": "s3",
"settings": {
"bucket": "my_bucket",
"client": "my_alternate_client"
}
}
'

and then set custom client settings this way:

s3.client.my_alternate_name.max_retries
s3.client.my_alternate_name.protocol
s3.client.my_alternate_name.endpoint

If instead, you want to use the instance role or container role to access S3 then you should leave these settings unset.

If you have set up an instance role or a container role to access S3 you can leave these settings unchanged, as they are by default.

Bucket S3 access — IAM ROLE, user credentials
Bucket S3 access — IAM ROLE, user credentials
Different ways to access AWS S3

AWS S3 repository Authentication

The article is about to create a snapshot on a repository which, in our specific situation, is AWS S3. How could it be possible to access S3? How should we allow Elasticsearch to access AWS S3?

There are many possibilities and it depends on your needs:

In both cases, you have to implement a custom policy or use a default one already defined into your AWS account.

By default, AWS exposes a policy called AmazonS3FullAccess which is an easy way to attach the policy to a Role or a User but the downside is that you are attaching a full access policy when you just need a small subset. My suggestion is to stay strict with permissions as much as you can.

IAM Roles

The best way you can grant access to S3 to save Elasticsearch snapshot is by implementing an IAM Role. Through the IAM Roles, you can define a policy or a set of policies attached to it and then grant permission to entities you trust.

A very good example would be to create an IAM ROLE and attach to it the S3 policies you need (in our case a custom policy with the minimum actions needed for allowing Elasticsearch to manage snapshots).

There’s a good explanation of this on this official blog post of AWS. From that blog post:

IAM roles enable your applications running on Amazon EC2 to use temporary security credentials. IAM roles for EC2 make it easier for your applications to make API requests securely from an instance because they do not require you to manage AWS security credentials that the applications use.

Elastic Key store

In the last couple of Elasticsearch major versions have been released a CLI tool called elasticsearch-keystore (it’s in the /bin directory). The key-store it’s node-specific, so you have to apply the config on every node of the cluster.

This tool is really powerful and allows you to manage in a real safe way all the security objects you need to work with while you are setting up the service.
Let’s take an example, for the sake of this tutorial we need to use AWS IAM key, and secret key.

bin/elasticsearch-keystore add s3.client.default.access_key
bin/elasticsearch-keystore add s3.client.default.secret_key

The tool has a lot of options you can use, in this case instead of applying the AWS secrets on every node, I preferred saving these values in the Dockerfile.

#Dockerfile
...
...
RUN echo ".." | bin/elasticsearch-keystore add --stdin --force s3.client.default.access_key
RUN echo "..." | bin/elasticsearch-keystore add --stdin --force s3.client.default.secret_key

A custom AWS IAM Policy

I won’t explain in this tutorial what is an AWS IAM Policy and how it works as it is not in the scope of it. If you want to understand better you can follow the AWS Policies and Permission documentation.

I will show you how to define a custom policy with only the action needed for enabling the snapshot to work.

In the AWS Console go to the AWS Security credentials and access the “Policies” menu item.

AWS IAM policies
AWS IAM policies
Access the “Policies menu item”

You will find a lot of system policies by default defined, with which you can play with, again my suggestion is to define a more granular policy only for the snapshot thing we are dealing right now.

Create a new Policy, you can use the visual editor or enter directly the JSON in the editor.

AWS IAM custom policies
AWS IAM custom policies
You can add a custom JSON or you can use the visual editor

Now it follows the JSON I use to grant access to the AWS S3 Bucket I previously created.

This is the gist link https://gist.github.com/p365labs/1542d6382e21ad5b4cdf1b82ef12d0fc

In the Resource field, you could specify, for a more granular definition, the bucket name you are allowing ES to access for the snapshot purpose.

Let’s take our snapshot

Before start playing with Elasticsearch you should make sure you have created an AWS S3 bucket, with public access turned off. The bucket name for this example is: es2s3

AWS S3 bucket
AWS S3 bucket
I named it “es2s3”

Again make sure, on bucket settings, to block all public permissions. Just simply click on the bucket name and access the Permissions tab. Make sure everything has “block public access” ON.

AWS S3 permissions
AWS S3 permissions
Public access has been denied to this bucket

Build the containers the first time.

docker-compose build

then run the containers

docker-compose up

now if you check the Elasticsearch version you will see something like this

curl localhost:9200

and this will be the response:

{
"name" : "es01",
"cluster_name" : "es-docker-cluster",
"cluster_uuid" : "A9AICCuxTi2lITqr2OJS2w",
"version" : {
"number" : "7.6.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
"build_date" : "2020-03-26T06:34:37.794943Z",
"build_snapshot" : false,
"lucene_version" : "8.4.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

if you don’t see errors in the logs, the client configured will be able to access the AWS S3 bucket. Now you can set up a repository where you can do your snapshots.

curl -X PUT "localhost:9200/_snapshot/mycustom_s3_repo?pretty" -H 'Content-Type: application/json' -d'
{
"type": "s3",
"settings": {
"bucket": "es2s3"
}
}
'

If everything goes well this will be the response:

{
"acknowledged" : true
}

Now you are ready “snapshotting” your Elasticsearch indices; check out the repository configuration:

curl "localhost:9200/_snapshot?pretty"

and this should be the response:

{
"mycustom_s3_repo" : {
"type" : "s3",
"settings" : {
"bucket" : "es2s3"
}
}
}

Next step is trying to create a snapshot with the right curl:

curl -X PUT "localhost:9200/_snapshot/mycustom_s3_repo/snapshot_1?wait_for_completion=true&pretty"

With the wait_for_completion parameter, the request wait until the snapshot completes, if you set it to false it will return immediately the response.
This is the response after creating the snapshot:

{
"snapshot" : {
"snapshot" : "snapshot_1",
"uuid" : "WiNVFShuRzmNBmnkxoC20A",
"version_id" : 7060299,
"version" : "7.6.2",
"indices" : [ ],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-05-30T11:44:43.972Z",
"start_time_in_millis" : 1590839083972,
"end_time" : "2020-05-30T11:44:44.173Z",
"end_time_in_millis" : 1590839084173,
"duration_in_millis" : 201,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}
}

The message tells us the snapshot has been done successfully, you can also test on S3 what there’s inside the bucket. There are some files… have a look:

AWS S3 — ES snapshot creation
AWS S3 — ES snapshot creation
S3 Bucket after the first snapshot executions

Ok but.. we have no indices … so it’s pretty easy :) Now let’s add some data to a simple index and let’s discover how to restore the indices:

curl -X POST "localhost:9200/person/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index":{} }
{ "name":"john doe","age":25 }
{ "index":{} }
{ "name":"mary smith","age":32 }
{ "index":{} }
{ "name":"robin green","age":15 }
{ "index":{} }
{ "name":"fred white","age":68 }
'

This will create a simple index, person, with just 4 documents. Let’s do a new snapshot and see what’s happening:

curl -X PUT "localhost:9200/_snapshot/mycustom_s3_repo/snapshot_2?wait_for_completion=true&pretty"

Now to validate the idea of having a backup system in place let’s delete the index and restore it.

curl -X DELETE localhost:9200/personthen execute acurl localhost:9200/_cat/indices

The cluster right now will be empty!

curl -X POST "localhost:9200/_snapshot/mycustom_s3_repo/snapshot_2/_restore?pretty"

If now you try to look for the indices and make a query you should have a meaningful result

curl localhost:9200/_cat/indices
The index person has been restored
The index person has been restored
4 documents are in there
curl "localhost:9200/person/_search?pretty&q=john"with this result{
"took" : 31,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.2039728,
"hits" : [
{
"_index" : "person",
"_type" : "_doc",
"_id" : "baNzZXIBB04z4g6GU0tF",
"_score" : 1.2039728,
"_source" : {
"name" : "john doe",
"age" : 25

}
}
]
}
}

WOW! it works. The complex part is not the snapshot and the restoring thing is about understanding what repository, snapshot, and clients are; and how to configure them. In the next chapter, you will find a link to a GitHub repository where you can find the configuration files for starting a cluster on your own and make some tests.

I’m still preparing a new set of articles related to snapshot and restoring management and how to deal with snapshot orchestration… so stay tuned.

Example Repository

I have prepared a repository on GitHub you can use as an example to start implementing your own Elasticsearch snapshot.

Here you can find 3 simple files which will allow, with the help of docker-compose, to create an Elasticsearch cluster and set it up correctly the repository and the client for you.

FROM docker.elastic.co/elasticsearch/elasticsearch:7.6.2RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install — batch repository-s3COPY — chown=elasticsearch:elasticsearch elasticsearch.yml /usr/share/elasticsearch/config/RUN echo “YOUR_ACCESS_KEY” | bin/elasticsearch-keystore add — stdin — force s3.client.default.access_key
RUN echo “YOUR_SECRET_KEY” | bin/elasticsearch-keystore add — stdin — force s3.client.default.secret_key

It defines which Elasticsearch version you will use, it copies the elasticsearch.yml configuration file into the container, and before the cluster starts it will add, using the CLI command elasticsearch-keystore, the AWS access_key, and secret_key to allow Elasticsearch backing it up the indices in an AWS S3 bucket.

version: '2.2'
services:
es01:
build: .
container_name: es01
environment:
- node.name=es01
- cluster.name=es-docker-cluster
- discovery.seed_hosts=es02,es03
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
....
....
....

It is the docker-compose file which creates the Elasticsearch cluster. If you want more information on it and how you can tweak Elasticsearch docker-compose file please have a look at Elasticsearch docker official documentation.
More than this Elastic is providing a full list of available Docker images.

cluster.name: "docker-cluster"
network.host: 0.0.0.0
s3.client.default.endpoint: s3-eu-west-1.amazonaws.com

This is the Elasticsearch configuration file. In our example with are adding to it only one information which is the AWS region we are working with.

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store