Elasticsearch notes from trainings

Alexey Novakov
Jul 30, 2017 · 4 min read

If you attend some training, it is almost always a positive experience, especially when those trainings are about Elasticsearch (later ES) from Elastic company.

I have just gathered some notes from past trainings (Core and Advanced Developer), which I think could be usefull for somebody else too.


Core Developer

Day 1

  1. _update document endpoint

Use when reindexing the same documents, as it can be faster than reindexing the same data again, especially in the clustered environment.

2. GET _mget (multiple get) — can be used to avoid multiple round trips. when requesting multiple records bsased on the their _id.

3. Override operator at “match” if one needs conditions linked with “or” operator

{
"match": {
"field_name": {
"query": "word1 word2",
"operator": "and", // — it can be used for SQL like query: condition1 and condition2, but it does not help if we have nested conditions, then the bool query is a right choice
"minimum_should_match": "2"
}
}
}

4. Elasticsearch provides the migration plugin to migrate to ES 5.0.

If you still stuck with older ES version, then migration lugin might very handy. Client Queries have to be analysed manually, just because migration plugin does not know about client queries.

5. New data type at ES 5.0: half_float (half precision), scaled_float (safe on space as it is long data type under the hood with scaling, use case: price field)

6. Percolator queries: ES can store queries to match to provided documents. It is like inverted flow. Someone may be interested which stored queries in the ES mapping are matching to provided documents. Result of peroclator query is boolean.

7. “_all” field includes value of all document fields

It makes sense to disable “_all” field to save on disk space

8. _meta

Special field in the mapping which can be used for comments/documentation. It is not used for search by ES

9. Index management.

  • Common use case for indexes of ES is to create daily based indexes. Can be used for customer orders as an example.

10. Phonetic encoder. Can be used for people names to create synonyms during the indexing. Use case: when search for a person name has some typos/mistakes in a query, ES can find documents in this case

Day 2

1. Elasticsearch cluster is using Strong Consistency model (or close to it). “Update” flow also reminds distributed transaction. Clients receives successful response on write operation, when cluster has been successfully stored and replicated the data across its replicas.

2. User can speed up indexing by disabling replication to other nodes.

3. One can use URL parameter “preference=local” to query local shards, if any available, even these local shards are “secondaries”

4. Split Brain issue

minimum_master_nodes — number of nodes to have a quorum for election of a master node. It can be set in runtime. It is very important setting that helps to avoid “Split brain” issue, when parts of the cluster have its own master node.

5. sort by “_doc” to skip default sorting by score.

6. Pagination: size is limited to 10 000. Use scroll searches for unlimited pagination

7. Use Size = 0 to speed up the aggregations, so that ES do need to collect hits among cluster nodes

8. Parent child types

If system is under frequent writes and it supposed to be fast, then parent_child mapping can be used. However queries are going to be slower then, due to sparse data in ES

Otherwise — go with flat structure in ES, without parent_child mapping

Advanced Developer

  1. Painless scripts
  • Use params in script instead of literals, so that ES can cache the script and query result then.

2. Updates to Elasticsearch are working only with _source field, because Elasticsearch an abstraction on top of Apache Lucene

3. Scoring is one of the expensive operation during the quering the ES. Use “filter” clause when looking for a particular data set.

4. Multi-custer search

Tribe node is going to be deprecated. Use local cluster or single node to query local and remote data cluster using special URI prefix as cluster name. So that client can explictly control which cluster it wants to query.

5. Percolators query do not accumulate any state. It reminds strored function/procedure that can be triggered and be returned with bolean value to answer whether input document matches stored query or not.

6. Geolocation aggregation by geo shapes is expensive aggregation when doing “distance” aggregation by dimater. Try to use box shape search/aggregation instead of diameter to get better query performance.

SE Notes by Alexey Novakov

Software Engineering notes on Scala, JVM and other goodies

Alexey Novakov

Written by

SE Notes by Alexey Novakov

Software Engineering notes on Scala, JVM and other goodies

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade