Elasticsearch — Indices and documents

Eleonora Fontana
Betacom
Published in
8 min readDec 7, 2020
Photo by Viktor Talashuk on Unsplash

Introduction

As explained in our previous article, data in Elasticsearch is stored in documents. For those familiar with relational databases, we may think of them as rows of a table. A collection of documents with similar characteristics is called index, somewhat in between tables and schemas/databases. In this article you will learn how to create, delete, update and retrieve both indices and documents in Elasticsearch.

In the first section we will discuss how to manage indices and go through some examples. Whereas in the last three sections the focus will be on documents.

Managing indices

Suppose we need to create an index to store all the products available in a warehouse. The first thing to do is create an index.

An index can be created using the following query, where indexName stands for the name we would like to give to the index: PUT indexName. It means that for our example we will execute PUT products.

At index creation time we can specify the index settings such as the number of shards (chunks into which Elasticsearch divides indices, as before refer to this article) and the number of replicas for each shard. It can be done with the following request:

PUT indexName
{
“settings”: {
“number_of_shards”: 2,
“number_of_replicas”: 2
}
}

Please remember that in a production environment it is good practice standing to the default settings.

Indices can be deleted by the delete indexName request.

Creating documents

Once we create an index, we need to populate it with documents. Creating a document means to index it via the following command:

PUT indexName/_doc
{
"field_1": “value_1”,
"field_2": “value_2”
}

Elasticsearch will then create the _id field while storing the document. The _id uniquely identifies documents and by default is automatically created. We can overwrite this cluster setting executing action.auto_create_index = false. Even if we leave the default setting, we can specify the _id value when indexing a document :

PUT indexName/_doc/ID
{
"field_1": “value_1”,
"field_2": “value_2”
}

Let’s index a couple of documents for our product index.

# Indexing document with auto-generated ID
POST /products/_doc
{
"name": "Coffee Maker",
"price": 64,
"in_stock": 10
}
# Indexing document with custom ID
PUT /products/_doc/100
{
"name": "Toaster",
"price": 49,
"in_stock": 4
}

How does Elasticsearch know where to save or search documents? The answer is routing. It is the process of determining which shard that document will reside in. When we index a document, Elasticsearch uses the default routing formula to calculate on which shard the document should be saved:

where the _routing parameters is a document metadata and it is equal to the _id by default.

This default routing strategy ensures that the documents will be distributed equally over the shards of the index. Please note that the number of shards of an index cannot be updated if documents were already indexed since it is used by the routing formula. Indeed it would imply a not equally distribution of documents over the shards.

The storing process can be represented with the following diagram.

Let’s recall some definitions from the previous article:

  • a coordination node is responsible for request processing by managing the delegation of the necessary work;
  • a shard is called a primary shard when it has been replicated one or more times;
  • a replication group is the set given by a primary shard and all its replicas.

Since Elasticsearch is distributed and many operations happen asynchronously, a lot of things can go wrong, such as hardware failures. If such a failure comes at the wrong time, things can get ugly and Elasticsearch needs to be able to handle it, especially because both the risk and consequences increase the more nodes a cluster contains, and the more writes an index needs to handle. Elasticsearch solves those issues with something called primary terms and sequence numbers.

Primary terms are a way to distinguish between old and new primary shards when the primary shard of a replication group has changed. The primary term for a replication group is essentially just a counter for how many times the primary shard has changed. Sequence number is a counter that is incremented for each operation, at least until the primary shard changes. It enables Elasticsearch to know in which order operations happened on a given primary shard.

However, with large indices Elasticsearch needs to speed up this process. To do so it maintains global and local checkpoints. A global checkpoint is the sequence number to which all shards are aligned and refers to a replication group. A local checkpoint is the last sequence of write operations done and refers to a single replica shard.

Each document can have more versions but only the last one is archived. The _version metadata field shows the version number of a document. If we delete a document, Elasticsearch will retain the version for 60 seconds so if we index a document with the same _id within 60 seconds, the version will be incremented.

Updating documents

Documents can be updated via different methods, depending on our needs. If we already know the _id of the document we need to update, we can use the Update API:

POST indexName/_update/ID
{
“doc”: {
“fieldToUpdate”: “newValue”
}
}

The previous request can both update existing fields and create new ones for the document with _id equal to ID. Let’s see an example and update the document with _id 100:

# Updating an existing field
POST products/_update/100
{
"doc": {
“in_stock”: 3
}
}
# Adding a new field
POST products/_update/100
{
"doc": {
"tags": ["electronics"]
}
}

Note that updating a document actually means replacing it since documents are immutable.

Another way to update documents with the Update API is the scripted update. For example, we can subtract 1 from the in_stock field of the document with _id 100 via the following request:

POST products/_update/100
{
"script": {
"source": "ctx._source.in_stock--"
}
}

But what if we need to subtract a number different than 1? We will need to use the “params” key:

POST products/_update/100
{
"script": {
"source": "ctx._source.in_stock -= params.quantity",
"params": {
"quantity": 4
}
}
}

Let’s now take a moment to analyze the request output. In the following picture there is the output from the last update request we wrote.

As you can see, there is a result field which shows how the update went. Its value can be either “updated” or “noop”. The first one means everything went fine and the document has been correctly updated. Whereas the second one means that the document has not changed and no operations were performed.

Note that the scripting update will always result in an “updated” result. It is possible to add ctx.op=“noop” to the request to let the “noop” result visible for the scripted update as well:

POST /products/_update/100
{
"script": {
"source": """
if (ctx._source.in_stock == 0) {
ctx.op = 'noop';
}

ctx._source.in_stock--;
"""
}
}

Please also note that the scripted update can be used only if the document already exists, otherwise we will need to use the “upsert”:

POST /products/_update/101
{
"script": {
"source": "ctx._source.in_stock++"
},
"upsert": {
"name": "Blender",
"price": 399,
"in_stock": 5
}
}

We can also replace documents using the same command we used to create it specifying the _id of the document we want to overwrite:

PUT /products/_doc/100
{
"name": "Toaster",
"price": 79,
"in_stock": 4
}

Another way to update a document is the Update By Query API. We can either pass the document _id or use the scripted update. We will discuss queries in a later article, so for now just know that the “match_all” query retrieves all the documents from the specified index. Let’s see a couple of examples.

POST products/_update_by_query
{
"script": {
"source": "ctx._source.in_stock--"
},
"query": {
"match_all": {}
}
}
POST products/_update_by_query
{
"conflicts": "proceed", # Ignoring version conflicts
"script": {
"source": "ctx._source.in_stock--"
},
"query": {
"match_all": {}
}
}

Internally, that is what happens when an Update By Query request is executed:

Note that if the update fails at some point, the process stops and the documents already updated remain updated.

The snapshot is useful to prevent problems caused by other update requests. We can avoid that the execution stops when there are version conflicts, specifying “conflict”: “proceed” in the request body.

Finally, the bulk update request is a way to update more documents with a single request. A batch processing refers to the Bulk API and allows us to work on more documents with only one request. We can pass more rows in json format separated by “\n” or “\r”. The last row needs to end with a newline as well. If one action fails, the other one will be executed. Let’s see some examples.

# Indexing documents
POST /_bulk
{ "index": { "_index": "products", "_id": 200 } }
{ "name": "Espresso Machine", "price": 199, "in_stock": 5 }
{ "create": { "_index": "products", "_id": 201 } }
{ "name": "Milk Frother", "price": 149, "in_stock": 14 }
# Updating and deleting documents
POST /_bulk
{ "update": { "_index": "products", "_id": 201 } }
{ "doc": { "price": 129 } }
{ "delete": { "_index": "products", "_id": 200 } }
# Specifying the index name in the request path
POST products/_bulk
{ "update": { "_id": 201 } }
{ "doc": { "price": 129 } }
{ "delete": { "_id": 200 } }

Retrieving and deleting documents

Knowing the document _id, it is possible to retrieve it and get all its fields by GET indexName/_doc/ID. For example, we can execute GET products/_doc/100 and get the product with ID equal to 100.

Let’s see how data is read when we execute a GET request.

In the previous flow, ARS stands for Adaptive Replica Selection and is a process to choose the best shard. The response is then sent to the client by the coordination node.

Documents can be deleted using either the DELETE indexName/_doc/ID command or the Delete By Query API. Here’s and example of the last one:

POST products/_delete_by_query
{
"query": {
"match_all": { }
}
}

Conclusion

In this article we discussed how to manage indices and documents, the basic blocks of every Elasticsearch implementation.

Curious to find out more? Subscribe to our publication and be sure not to miss the next article, in which you will learn how to handle all the Elasticsearch data types.

--

--