Dating Elasticsearch

First encounter

When confronted to the use of a new type of storage, a way to tackle the learning curve is to become familiar with a set of basic commands that will “allow me to find by myself what is there”. For example in a PostgreSQL context, this would imply to learn variations of “\d” meta commands — an important one being “\dt” of course — as well as to use a lot of “select count(*) from XXXX” to get quantitative results. Basic questions like “what” and “how much” find then easily an answer, and moving to a deeper usage can be done with more insurance.

Elasticsearch is no different, and being able to “inspect” a cluster is always very useful. A nice thing about Elasticsearch, is its API protocol: it’s HTTP!
So one can easily make some tries with curl or httpie (https://github.com/jkbrzt/httpie). Usually, the first typed command is simply a check to see if the cluster is alive.. Let’s imagine we have a cluster at an address ES_ADDRESS, it will then be:

curl -XGET ‘https://ES_ADDRESS:9200'

Although 9200 is the usual Elasticsearch port, it can be that 9300 is used instead, and in such case one needs to adapt. The answer will give you a few elements of information about the cluster itself, and which version is running.

{
“status” : 200,
“name” : “elasticsearch-node0”,
“cluster_name” : “es-cluster”,
“version” : {
“number” : “1.7.3”,
“build_hash” : “05d4530971ef0ea46d0f4fa6ee64dbc8df659682”,
“build_timestamp” : “2015–10–15T09:14:17Z”,
“build_snapshot” : false,
“lucene_version” : “4.10.4”
},
“tagline” : “You Know, for Search”
}

Name can be configured, otherwise Elasticsearch will provide automatically one (which can even be funny!).

In Elasticsearch, the way to think about data should trigger a Pavlov reflex with the 2 words: INDEX/TYPE. Of course, there is way more than this to Elasticsearch, but this definitely should be the first thing to have in mind when looking around some data stored in such cluster. A list of indices can then be obtained with:

curl -XGET ‘https://ES_ADDRESS:9200/_cat/indices'

Looking at how data are stored inside (“the schema”) can be done with:

curl -XGET ‘https://ES_ADDRESS:9200/_mappings'

However looking at the data in this way can be difficult as the returned JSON is pretty dense. This is where one needs to definitely keep in mind the ”pretty” option. Note also that _mapping and _mappings (plural) are valid… at least at this point.

curl -XGET ‘https://ES_ADDRESS:9200/_mappings?pretty=true'

If Python is installed on your machine, another option is to use:

curl -XGET ‘https://ES_ADDRESS:9200/_mappings' | python -mjson.tool

Effectively, in some cases the pretty option will not pretty-fy all returned fields, but only the first level ones (for exemple the _source one when getting an object, we’ll come on this later). Unfortunately, even with all pretty options, a lot of data can be returned, and reading it can be tedious. In such case, Pavlov comes to the rescue and one needs to think: INDEX/TYPE. That is practically:

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/_mappings?pretty=true'

and (please be aware of the singular form here):

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/TYPE/_mapping?pretty=true'

It’s flirt time!

Still with the idea in mind to explore the data (think read only, no write in this article), the first way to get data is… to ask for it. Remember this is an HTTP API, so GET with a path to the data element should be working. And effectively, if we have an object, in an index and with a given type, stored with the ID, OBJECTID, one can get it with:

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/TYPE/OBJECTID?pretty=true'

In such case, the pretty option is not so helpful as it returns something like this:

{
“_index” : “INDEX”,
“_type” : “TYPE”,
“_id” : “OBJECTID”,
“_version” : 1349,
“found” : true,
“_source”:{
“key1”:”value1",“key2”:”value2",”key3":”value3",”key4":”value4",
”key5":”value5",”key6":”value6",”key7":”value7",”key8":”value8"
}
}

The _source field, where useful data lies, can indeed be large! A way to reduce it is to ask only for needed fields. There are a few ways to do this.

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/TYPE/OBJECTID?pretty=true&fields=field1'

will returns asked fields…but in a specific “fields” attribute.

{
“_index” : “INDEX”,
“_type” : “TYPE”,
“_id” : “OBJECTID”,
“_version” : 1349,
“found” : true,
“fields” : {
“field1”:[“value1”]
}
}

This may not be what we want, especially for a generic processing. Luckily a more modern option exists, that is source filtering.

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/TYPE/OBJECTID?pretty=true&_source_include=field1'

In such case, the returned data will be really a subset of the full one:

{
“_index” : “INDEX”,
“_type” : “TYPE”,
“_id” : “OBJECTID”,
“_version” : 1349,
“found” : true,
“_source”:{“field1”:”value1"}
}

As the name is hinting, one can also use a _source_exclude option to remove a few fields. If only inclusion is considered, it is possible to use _source instead of _source_include.

It is also possible to get multiple records at once through the mget API call. In such case, one should pass data in the body of the query (yes, yes in a GET!) to retrieve all records. The answer will be in the form: {“docs”:[ RECORD1, RECORD2]}.

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/TYPE/_mget?pretty=true&_source_include=field1' -d ‘{“ids” : [“OBJECTID1”, “OBJECTID2”]}’

Deleting a record is simply made with the DELETE HTTP verb. A nice thing is that it can also be used to do more general operations. For exemple, to reset a cluster, simply delete all indices within with:

curl -XDELETE ‘https://ES_ADDRESS:9200/INDEX/'

Counting data can also be done in multiple ways. For a given type, the easiest way is certainly to use the _count API.

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/TYPE/_count

It will return data in a form like

{“count”:336908,”_shards”:{“total”:5,”successful”:5,”failed”:0}}

Not surprisingly it can also be used on a full index:

curl -XGET ‘https://ES_ADDRESS:9200/INDEX/_count

Still at this point, we are doing generic query on many objects, but we are far from “asking for objects with given properties”. After all, this is somewhere why we are here, as the word “Search” is part of the technology we use. The _search API is basically done with a POST call. Search data will be passed through the body of the query (A simpler form using GET and a query string can be used, but is offering only a limited subset of query). We will not go in this article in the full search DSL, but learn how to do a basic request like “Find objets who have this attribute equal to this value”.

From a high level point of view there are 2 query types in Elasticsearch DSL: the query and the filter. One is giving you a weighed answer, while the other will act as a boolean filter. Those can be combined in many ways, like creation of a filtered query, boolean combinations and such…. Again, we will not look in details at this part in this article and stay at a very high level.

The main query we want to learn for now is one that will look for objects with a field matching a given value. We will search here in the full index.

curl -XPOST ‘http://ES_ADDRESS:9200/INDEX/_search' -d ‘{“query” : {“match”:{“FIELD”:”VALUE”}}}’

or with a filter (do not hesitate to experiment with the 2 many times to gather all differences):

curl -XPOST ‘http://ES_ADDRESS:9200/INDEX/_search' -d ‘{“filter” : {“term”:{“FIELD”:”VALUE”}}}’

If one wanted to reduce to a certain type of object, one would logically have:

curl -XPOST ‘http://ES_ADDRESS:9200/INDEX/TYPE/_search' -d ‘{“query” : {“match”:{“FIELD”:”VALUE”}}}’

Search can also be conducted over multiple indices. This is very useful in the case where daily indices are created for a given type of data. For example, a software stack may keep one main index for its core data sets (e.g: user, session,…), and a daily one for storage of data like metrics beacons. Those will be named accordingly to the schema ‘DAILYINDEX-YYYY-MM-DD’. So if one wants to search for data in all daily indices (and practically in such case it is advised to not do it on a production cluster), it is possible to use a wildchar schema like:

curl -XPOST ‘http://ES_ADDRESS:9200/DAILYINDEX-*/TYPE/_search' -d ‘{“query” : {“match”:{“FIELD”:”VALUE”}}}’

Results are usually returned in the following JSON format:

{
“took” : 244,
“timed_out” : false,
“_shards” : {
“total” : 1640,
“successful” : 1640,
“failed” : 0
},
“hits” : {
“total” : 678718,
“max_score” : 13.094422,
“hits” : [ {
“_index” : “INDEX”,
“_type” : “TYPE”,
“_id” : “OBJECTID100”,
“_score” : 13.094422,
“_source”:{“FIELD”:”VALUE”,”OTHERFIELD”:”VALUE1"}
}, {
“_index” : “INDEX”,
“_type” : “TYPE”,
“_id” : “OBJECTID307”,
“_score” : 12.08772,
“_source”:{“FIELD”:”VALUE”,”OTHERFIELD”:”VALUE2"}
},

]
}
}

At this point, it is useful to not forget that the response from the Elasticsearch cluster is paginated, and a limited number of records will be returned (like 10). A natural — and certainly not to do in real life — way to get all the records would be first to count them, and then ask for the full count of records. Again, keep in mind this is really a bad practice and presented here for learning purposes! Querying the count can be done in one of the two following ways:

curl -XPOST ‘http://ES_ADDRESS:9200/INDEX/TYPE/_search?pretty=true&search_type=count' -d ‘{“query” : {“match”:{“FIELD”:”VALUE”}}}’

or the equivalent:

curl -XPOST ‘http://ES_ADDRESS:9200/INDEX/TYPE/_search?pretty=true' -d ‘{“size”:0, “query” : {“match”:{“FIELD”:”VALUE”}}}’

In both case the hits fields of the answer will contains an empty hits array but with a proper total. Redoing the query above with a size parameter sets to the total of existing records will then return all the records.

hits” : {
“total” : 678718,
“max_score” : 0.0,
“hits” : [ ]
}

Now coming to the proper way of doing such query, is leading to the scan scroll type of request which will be explained in the next section.

Here comes the snake

Although being able to interact with Elasticsearch from the command line is really convenient, there always come a time when programming skills are allowing to go faster. Clearly a simple HTTP related library in any language would allow us to do the job, but some wrappers exists that can make life a little be easier, especially for complex operations. We’ll consider here Python as our language of choice and the elasticsearch library. This one is valid for Python 2.X and Python 3.X, and can be installed simply in the following way:

sudo pip install elasticsearch

Using elasticsearch Python library is pretty straightforward. It stays very close to the original API, removing some lengthy part in the expression of requests, and handling some errors through exception. A typical usage would be:

from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=[ES_ADDRESS])
query_data = {“query” : {“match”:{“FIELD”:”VALUE”}}}

try:
r = es.search(index = ‘INDEX’, doc_type = ‘TYPE’,body=query_data)
first_result = r[‘hits’][‘hits’][0][“_source”]
except:
sys.exit()

Now where the library becomes more than useful, is when it offers a few helpers allowing to use easily the most complex Elasticsearch calls. A typical case is the scan-scroll search type (in fact this is the scan search type and the scroll API). This search mode allows to retrieve large amount of results (think “cursor” in a SQL world, where data is retrieved a few rows at a time). The scroll parameter passed to Elasticsearch is here to specify “how long” the search context should be kept around.

Python elasticsearch then takes advantage of the properties of the languages to have an iterator used when returning the results. This can be done with a helper that encapsulate internal running of queries to be made (with reuse of scroll ID provided in first query and such…), and that goes in the following way:

from elasticsearch import Elasticsearch
import elasticsearch.helpers
es = Elasticsearch(hosts=[ES_ADDRESS])
query_data = {“query” : {“match”:{“FIELD”:”VALUE”}}}
for i, doc in enumerate(elasticsearch.helpers.scan(es, query_data,
index=’INDEX’, doc_type=’TYPE’, scroll=’10s’)):
print(‘==Result {idx}’.format(idx=i))
print(doc)

Each document is returned in the now known Elasticsearch form, with “_index”, “_id”, “_type” and “_source” keys.

Another case of simpler encapsulation is the use of the bulk API. In a nutshell, this one gives a way to do many index, update or delete operations in one call. However it requires a careful construction of JSON data to be sent, which can be very sensitive to things like of usage of carriage return. A wrongly placed \n could namely destroy the query! Elasticsearch bulk helpers allow to get a true feeling of the syntax to be used, while avoiding possible traps of syntax mistakes.

from elasticsearch import Elasticsearch
import elasticsearch.helpers
def _delete_bulk_handler(INDEX, TYPE):
#Do watever to find all records to delete and enumerate them
for a_record_to_delete_id in all_records_to_delete:
yield { “_op_type”: “delete”, “_index”: INDEX,
“_type”: TYPE, _id=a_record_to_delete_id}
es = Elasticsearch(hosts=[ES_ADDRESS])
for success, error in elasticsearch.helpers.streaming_bulk
(client=es, actions=_delete_bulk_handler(INDEX, TYPE)):
if success == False:
print(error)

All helpers can of course be combined : for example a streaming bulk would end up invoking a scan helper to create a search to delete operation (this one is in fact existing under the name delete by query API, but will be removed in the version 2.0 of Elasticsearch).

Conclusion

It is clear that we’ve been here barely touching the surface of what Elasticsearch can do, and certainly more articles would be needed, for example, to provide an in depth coverage of the query DSL. The few commands given here should however makes you feel more at ease in experimenting with Elasticsearch, and tweak it to get the results you want.

Like what you read? Give Laurent Cerveau a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.