ElasticSearch as a single data store, and why your find-by-email query does not work

I’m addicted to data.
There, I said it. I like collecting it, counting it and slicing it. But most of all, I like visualizing it — That’s why I decided to go with ElasticSearch for a recent data-focused project of mine.

It wasn’t long, before I realized that most of the data I was storing was going to end up aggregated and grouped in many different ways when served back to the application. Therefore, selecting a data-store that can serve this purpose was a top priority.

ElasticSearch

While many modern NoSQL databases support grouping, aggregations and map-reduce, ElasticSearch stood out for a few reasons:

  1. Unlike other NoSQL databases, ElasticSearch flexibility really comes into play when you’re not 100% sure of what questions (queries) you’re going to perform on the data you collect.
  2. Out of the box support for histograms and date interpolation that would save me from doing time calculations at the application level.
  3. previous experience from other projects

SQL vs. NoSQL Schema Design
Way before JSON was common and even before words like “non-relational” or “schema-less” turned into buzz-words, we were used to think in terms of rows, columns and normalizing data. With every given challenge — we always started thinking about what different types of entities we were handling, and their relationships via foreign-keys and JOIN queries.

While NoSQL removed a few of these questions by introducing concepts of schema-less databases and data-replication (not it terms of high availability, but in terms of keeping subsets of your data wherever you need it, to avoid “JOIN”s)… The one thing it did not solve was indexing.

So why not MongoDB?
MongoDB is a highly popular and commonly used NoSQL and Schema-less database. It stores BSON (Binary JSON), supports replication and has become a natural choice for many; especially JavaScript/NodeJS developers who use JSON all across their stack, instead of dealing with casting their objects into rows and columns and back again when storing into SQL databases… With that said, when using MongoDB you’re still faced with a few similar challenges from the relational database world:

  1. Schema design — Since we usually aim for fast “read” queries, we base our schema design on the questions our application is going to ask. that’s is actually a key stage in NoSQL scheme design (not only for MongoDB). It’s not that rare that you’d plan a schema and then get a new product spec (or even a wireframe) form which you’ll understand schema changes are required — because we generally don’t want to (or possibly can’t) “JOIN” things.
  2. Indexing — Following up with our schema, we now need to explicitly declare how we plan to retrieve our data, and that really means that we want our queries to be at their best performance. 
    So… we use indexes, and if a scenario comes up where we want to query by more than one field, we create a compound index. sometimes we even create multiple compound indexes, because the reality and flexibility required by our application needs is bound to change over time.

Back to the main topic — If I’d choose to go with MongoDB, I would very quickly find myself in a position where I am constantly adding more and more indexes to collections (tables) that contain massive amounts of data (millions of records) just to support potential feature requests that might go away a few weeks later… another pain point is keeping track of what indexes are really needed, what can be dropped, and how each developer on my team needs to construct their queries in a specific order to avoid full-table scans. maintenance is time consuming.

For the sake of a simple stack!
If most of my data is going to be aggregated anyway, and 95% of my collections ARE going to be stored in ElasticSearch, Do I really need another database just to keep user records and accounts? It depends.

As always, the reality of the virtual world is not that simple, and the answer to rather it is valid to use ElasticSearch only and have it keep track of everything that is not “search oriented” depends on various (listing the main ones) factors such as tolerance for data loss, the ability to easily modify existing records and most importantly — eventual consistency.

I just saved it. but it’s not there!
ElasticSearch is “eventually consistent”. When you index (save) a document into ElasticSearch, that document is saved multiple times — on a shard and its replica(s). However, the information you save is only made available at the next index refresh.

An index refresh is an operation that makes the latest changes applied to an index available for search (meaning they’ll reflect in results for search queries). ElasticSearch refreshes every index automatically by the value of its refresh interval, which is set to 1 second by default. But 1 second can sometime be too long for your application. For our use case, think of saving user accounts in ElasticSearch — If your app creates a user record, and immediately tries to search for it, it will probably not return. Even if you’re blocking execution until the operation response arrives, it might still be too early! A simple way to by pass such cases is to tweak your data flow within your app — you could, for example, generate a user token and let the user through without explicitly getting it’s record back. so there are simple ways to over come these issues, but it’s still worth knowing.

OK, It should return by now
So you create a user account, and tweak your application logic accordingly. Your user is enjoying your application for a while and then logs out.
When he/she is back the next day, you want to verify their credentials by looking up their account with the email address provided in the login form, but guess what… it’s not there!

OK, I lied. 
It is there, I swear. so why isn’t it returning?
Lets take a look at this query / NodeJS code block:

elasticSearchClient.search({
index: "users",
type: "sometype",
body: {
query:{
"bool": {
"must": [
{
term:{
"email": "user@domain.com"
}
}
]
}
}
})
...

If you’re familiar with ElasticSearch this should look ok to you. right? right!
Actually, there’s nothing wrong with this code, it will find the user you’re looking for… if (and only if) you’re schema is defined correctly!


Dynamic Mapping and Tokenizers
ElasticSearch is smart. in fact, when you index data into it, it will try to detect and determine the data type of each field using dynamic mapping.
It then uses tokenizers to break the indexed data into individual terms; for example, a textual sentence can be broken down into individual words, using a whitespace as a separator to detect individual search terms within it.

If this is a lot to take in, just remember that ElasticSearch does not “just” store the data you index, but it also applies certain rules to it, and manipulates/breaks it down into searchable terms and stores that information next to the original payload.

Analyzed fields
So we now know ElasticSearch applies different kinds of “magic” on to our data, but sometimes this can go against our needs when we query for that information.
This is exactly the issue when using the default mapping/indexing information like an email address — ElasticSearch breaks it down into individual terms, using the “@” as a dividing character between words.

Consider this example from the ElasticSearch official documentation:
By default, Indexing this text:

Email me at john.smith@global-international.com

Will be broken down into the following:

[ Email, me, at, john.smith, global, international.com ]

It’s easy to see there’s no sign of the text ever having an email address in it. and therefore, will never return when querying for it as an exact phrase.

Mapping to the rescue

Fortunately, ElasticSearch supports defining our mapping manually, so we can let it know how we’d like it to store the information and in turn, how we plan to search for the information itself.

When we define our mapping, we need to consider how exactly are we going to query for that email address later. Consider one of the two following scenarios:

  1. Finding documents that their body contains an email address within a paragraph of text
  2. Finding documents that have a field with a specific (and exact) email value

To find a partial substring, that is an email address, within a larger string we can use the uax_url_email tokenizer. you can find the official ElasticSearch Example here.

However, for our use case, which is finding a document by a specific email address’ exact value, we can actually take a different approach, and that would be telling ElasticSearch to store the string we provide it AS IS!
no tokenizing, no breaking in down to anything, just use it as an exact value.

This can be achieved by manually defining the mapping for the index in the following way:

elasticSearchClient.indices.create({
"index":"users", //index name
"body":{
"mappings":{
"someType" : { //document type
"properties" : {
"email" : {
//Prevent email field from being analyzed
"type" : "string",
"index" : "not_analyzed"
}

}
}
}
}
});

Now, we can query for the document using the exact same query as we mentioned before:

elasticSearchClient.search({
index: "users",
type: "sometype",
body: {
query:{
"bool": {
"must": [
{
term:{
"email": "user@domain.com"
}
}
]
}
}
})
...

Now you know how to prepare your index mapping and query for documents by a an exact email address value, and prevent it from being analyzed by the default ElasticSearch tokenizer.

As for using ElasticSearch as a single data store… from my own experience I’ll would suggest you consider an additional storage that will allow you to immediately query from documents you’ve just added.


That’s a wrap! drop me a line if you find this useful or if you’d like to suggest any other examples to add on here

Eyal