Why and how to denormalize indexing with elasticsearch-rails

federico chavez
Wolox
Published in
6 min readApr 4, 2019

--

The first project I worked on at Wolox was engaging because I had to implement Elasticsearch for geolocalization and full-text searching, and until that moment I had only read about it but never applied that technology.

By the time I started to investigate and work on it, the data had been modeled by copying the exact tables’ structure of the relational database into documents, and the first attempt was to make joins between documents. However, Elasticsearch does not support joins since it is not a relational database.

So the questions were: How can you connect models? How can you combine data from different models? In this article, I will show you how to take advantage of Elasticsearch to achieve quick queries through multiple models by grouping them into a single index in a way that is easier to process and read. This strategy is called denormalizing.

What is denormalization?

Denormalization is a technique used to improve the response time of a query at the expense of adding redundant data. The main idea is to combine the data of two or more tables into a single index, in order to get the expected response and have it filter more efficiently. The concept of denormalization not only applies to Elasticsearch, but also to document, column or key value-based databases.

We need to start from a normalized database, this means a database in third normal form because if you have a database that is not correctly modeled you have another problem and this strategy will be useless.

Basically, you denormalize only the queries that need to be fast, not the whole database. This is important because in a relational database you model the entire business with entities and relations among them, but in this case, you model queries (let us see an example of this below).

You have to consider that Elasticsearch does not ensure ACID, it sacrifices these characteristics to gain performance. It implies that it is the responsibility of the programmer to keep the integrity of the information between Elasticsearch and Ruby on Rails.

How to implement denormalized indexing with Elasticsearch and Rails

This tutorial uses ruby version 2.4.4 and elasticsearch 6.3.0. For Elasticsearch Integrations with ActiveModel/Record and Ruby on Rails, it uses the elasticsearch-rails gem, and the elasticsearch-model gem to facilitate the integration with the classes models of Ruby on Rails applications.

For example, if we have the following erd:

Imagine that you need to build the following report:

Return all the sales that:

  • Were made between 2014/01/01 and 2019/01/01.
  • Belong to the states of Oregon or Washington.
  • Were paid by credit card.
  • Were results of the campaigns in social media.
  • Had a budget greater than $1000.

If we build this query on a relational database, we need to do joins through 6 models and filter the sales from the last 5 years. That means we will have to process hundreds of thousands of records if our application is used often. Even using indexes, the performance of this query will decrease because the joins take too long.

Instead of doing nested joins, we will replace it with denormalized indexes in Elasticsearch to improve the performance.

The models in rails are:

# models/client.rb
class Client < ApplicationRecord
has_many :sales
end
# models/branch.rb
class Branch < ApplicationRecord
has_many :sales
belongs_to :state
end
# models/state.rb
class State < ApplicationRecord
has_many :branches
belongs_to :country
end
# models/country.rb
class Country < ApplicationRecord
has_many :states
end
# models/sale.rb
class Sale < ApplicationRecord
belongs_to :client
belongs_to :payment_type
belongs_to :campaign
belongs_to :branch
end
# models/payment_type.rb
class PaymentType < ApplicationRecord
has_many :sales
end
# models/campaign_type.rb
class CampaignType < ApplicationRecord
has_many :campaigns
end
# models/campaign.rb
class Campaign < ApplicationRecord
has_many :sales
belongs_to :campaign_type
end

To model this query in Elasticsearch, we will set it up for the expected response. To do this, we will add all the requested data in a single index in such a way that allows us to filter sales by:

  • State.
  • Date of the sale.
  • Payment type.
  • Campaign type.
  • Budget.

We will do it in the sale model.

# models/sale.rb
class Sale < ApplicationRecord
belongs_to :client
belongs_to :payment_type
belongs_to :campaign
belongs_to :branch
def as_indexed_json(_options = nil)
as_json(
only: [:id, :date, :amount],
include: {
state: {
only: [:id, :name, :code]
},
payment_type: {
only: [:id, :description]
},
campaign: {
only: [:budget, :description],
include: {
campaign_type: { only: [:id, :description] }
}
}
}
)
end
mapping dynamic: :strict do
indexes :id, type: ‘integer’
indexes :date, type: ‘date’
indexes :amount, type: ‘float’
indexes :state do
indexes :id, type: ‘integer’
indexes :name, type: ‘text’
indexes :code, type: ‘text’
end
indexes :payment_type do
indexes :id, type: ‘integer’
indexes :description, type: ‘text’
end
indexes :campaign do
indexes :budget, type: ‘float’
indexes :description, type: ‘text’
indexes :campaign_type do
indexes :id, type: ‘integer’
indexes :description, type: ‘text’
end
end
end
end
end

Then you have to create the index and import the data from the database to Elasticsearch:

Sale.__elasticsearch__.create_index!(force: true)

Sale.import

The #mapping method defines how the sale document and its fields will be indexed, and the type of the fields. And #as_indexed_json method serializes the model instance to JSON automatically.

You also have to update both databases, this implies that you have to add logic to the update operation to keep Elasticsearch updated. For example, if some field of the model sales changes, like the amount, this will be reflected in the document associated with the sales index. But if the name of the associated state to a sale changes, this will not be reflected on the sale index. This happens because the elasticsearch-rails gem only updates the attributes of the Sales model, and State is not an attribute of Sale. You need a trigger to update the model in Elasticsearch. For example, in case the state’s name changes, the script below will update all the sales associated with that state:

# models/state.rb
class State < ApplicationRecord
…… after_commit :update_sales

private
def update_sales
Sale.update_states(self)
end
end
# models/sale.rb
class Sale < ApplicationRecord
…… def self.update_states(state, options = {})
options[:index] ||= index_name
options[:type] ||= document_type
options[:wait_for_completion] ||= false
options[:body] = {
conflicts: :proceed,
query: {
match: {
‘state.id’: state.id
}
},
script: {
lang: :painless
source: ‘ctx._source.state.name = params.state.name’,
params: { state: { name: state.name } }
}
}
__elasticsearch__.client.update_by_query(options) end
end

Any change in state model will execute #update_by_query on the sales associated. The #update_by_query method will search all the sales that match with the query:

query: {
match: {
‘state.id’: state.id
}
}

And it will execute the script that updates the name of the state in all the indexes of sales found:

script: {
lang: :painless
source: ‘ctx._source.state.name = params.state.name’,
params: { state: { name: state.name } }
}

The sentence options[:wait_for_completion] ||= false sets Elasticsearch to do the update asynchronously.

Finally, we will build the query to make the report. Let us suppose that the id of payment type for a credit card is 1, the ids for the states Oregon and Washington are 30 and 22 respectively, and the id for the campaign type social media is 3:

GET sales/_search
{
“query”: {
“bool”: {
“must”: [
{
“range”: {
“date”: {
“gte”: “2014–01–01”,
“lte”: “2019–01–01”
}
}
},
{ “match”: { “payment_type.id”: 1 } },
{ “match”: { “campaign.campaign_type.id”: 3 } },
{ “range”: { “campaign.budget”: { “gte”: 1000 } } }
],
“should”: [
{ “match”: { “state.id”: 30 } },
{ “match”: { “state.id”: 22 } }
]
}
}
}

And we get this response:

{
“took”: 4,
“timed_out”: false,
“_shards”: {
“total”: …,
“successful”: 1,
“skipped”: 0,
“failed”: 0
},
“hits”: {
“total”: …,
“max_score”: …,
“hits”: [
{
“_index”: “sales”,
“_type”: “sales”,
“_id”: “2”,
“_score”: …,
“_source”: {
“id”: 2,
“date”: “2018–09–21”,
“amount”: 2000,
“state”: {
“id”: 22,
“name”: “Washington, D.C.”,
“code”: “WA”
},
“payment_type”: {
“id”: 1,
“description”: “Credit card”
},
“campaign”: {
“budget”: 3000,
“description”: “Facebook”,
“campaign_type”: {
“id”: 3,
“description”: “social media”
}
}
},
}
]
}
}

Conclusion

Grouping data from different models at an index allows us to build the response in a sophisticated way and speed complex queries. This helps us avoid joins between multiple tables in a relational database.

But as I mentioned before, this comes at a cost because you are adding redundant data and therefore implies that you are occupying more space in your server. But, bear in mind that if you implement Elasticsearch, this is because you need a high performance and big data handling. You must determine if buyin more space is worth it for your business.

--

--