Why and how to denormalize indexing with elasticsearch-rails

Published in

Wolox

6 min readApr 4, 2019

The first project I worked on at Wolox was engaging because I had to implement Elasticsearch for geolocalization and full-text searching, and until that moment I had only read about it but never applied that technology.

By the time I started to investigate and work on it, the data had been modeled by copying the exact tables’ structure of the relational database into documents, and the first attempt was to make joins between documents. However, Elasticsearch does not support joins since it is not a relational database.

So the questions were: How can you connect models? How can you combine data from different models? In this article, I will show you how to take advantage of Elasticsearch to achieve quick queries through multiple models by grouping them into a single index in a way that is easier to process and read. This strategy is called denormalizing.

What is denormalization?

Denormalization is a technique used to improve the response time of a query at the expense of adding redundant data. The main idea is to combine the data of two or more tables into a single index, in order to get the expected response and have it filter more efficiently. The concept of denormalization not only applies to Elasticsearch, but also to document, column or key value-based databases.

We need to start from a normalized database, this means a database in third normal form because if you have a database that is not correctly modeled you have another problem and this strategy will be useless.

Basically, you denormalize only the queries that need to be fast, not the whole database. This is important because in a relational database you model the entire business with entities and relations among them, but in this case, you model queries (let us see an example of this below).

You have to consider that Elasticsearch does not ensure ACID, it sacrifices these characteristics to gain performance. It implies that it is the responsibility of the programmer to keep the integrity of the information between Elasticsearch and Ruby on Rails.

How to implement denormalized indexing with Elasticsearch and Rails

This tutorial uses ruby version 2.4.4 and elasticsearch 6.3.0. For Elasticsearch Integrations with ActiveModel/Record and Ruby on Rails, it uses the elasticsearch-rails gem, and the elasticsearch-model gem to facilitate the integration with the classes models of Ruby on Rails applications.

For example, if we have the following erd:

Imagine that you need to build the following report:

Return all the sales that:

Were made between 2014/01/01 and 2019/01/01.
Belong to the states of Oregon or Washington.
Were paid by credit card.
Were results of the campaigns in social media.
Had a budget greater than $1000.

If we build this query on a relational database, we need to do joins through 6 models and filter the sales from the last 5 years. That means we will have to process hundreds of thousands of records if our application is used often. Even using indexes, the performance of this query will decrease because the joins take too long.

Instead of doing nested joins, we will replace it with denormalized indexes in Elasticsearch to improve the performance.

The models in rails are:

# models/client.rb
class Client < ApplicationRecord
  has_many :sales
end# models/branch.rb
class Branch < ApplicationRecord
  has_many :sales
  belongs_to :state
end# models/state.rb
class State < ApplicationRecord
  has_many :branches
  belongs_to :country
end# models/country.rb
class Country < ApplicationRecord
  has_many :states
end# models/sale.rb
class Sale < ApplicationRecord
  belongs_to :client
  belongs_to :payment_type
  belongs_to :campaign
  belongs_to :branch
end# models/payment_type.rb
class PaymentType < ApplicationRecord
  has_many :sales
end# models/campaign_type.rb
class CampaignType < ApplicationRecord
  has_many :campaigns
end# models/campaign.rb
class Campaign < ApplicationRecord
  has_many :sales
  belongs_to :campaign_type
end

To model this query in Elasticsearch, we will set it up for the expected response. To do this, we will add all the requested data in a single index in such a way that allows us to filter sales by:

State.
Date of the sale.
Payment type.
Campaign type.
Budget.

We will do it in the sale model.

# models/sale.rb
class Sale < ApplicationRecord
  belongs_to :client
  belongs_to :payment_type
  belongs_to :campaign
  belongs_to :branch  def as_indexed_json(_options = nil)
    as_json(
      only: [:id, :date, :amount],
      include: {
        state: {
          only: [:id, :name, :code]
        },
        payment_type: {
          only: [:id, :description]
        },
        campaign: {
          only: [:budget, :description],
          include: {
            campaign_type: { only: [:id, :description] }
          }
        }
      }
    )
  end  mapping dynamic: :strict do
    indexes :id, type: ‘integer’
    indexes :date, type: ‘date’
    indexes :amount, type: ‘float’
      indexes :state do
        indexes :id, type: ‘integer’
        indexes :name, type: ‘text’
        indexes :code, type: ‘text’
      end
      indexes :payment_type do
        indexes :id, type: ‘integer’
        indexes :description, type: ‘text’
      end
      indexes :campaign do
        indexes :budget, type: ‘float’
        indexes :description, type: ‘text’
        indexes :campaign_type do
          indexes :id, type: ‘integer’
          indexes :description, type: ‘text’
        end
      end
    end
  end
end

Then you have to create the index and import the data from the database to Elasticsearch:

Sale.__elasticsearch__.create_index!(force: true)

Sale.import

The #mapping method defines how the sale document and its fields will be indexed, and the type of the fields. And #as_indexed_json method serializes the model instance to JSON automatically.

You also have to update both databases, this implies that you have to add logic to the update operation to keep Elasticsearch updated. For example, if some field of the model sales changes, like the amount, this will be reflected in the document associated with the sales index. But if the name of the associated state to a sale changes, this will not be reflected on the sale index. This happens because the elasticsearch-rails gem only updates the attributes of the Sales model, and State is not an attribute of Sale. You need a trigger to update the model in Elasticsearch. For example, in case the state’s name changes, the script below will update all the sales associated with that state:

# models/state.rb
class State < ApplicationRecord  ……  after_commit :update_sales

  private  def update_sales
    Sale.update_states(self)
  end
end# models/sale.rb
class Sale < ApplicationRecord  ……  def self.update_states(state, options = {})
    options[:index] ||= index_name
    options[:type] ||= document_type
    options[:wait_for_completion] ||= false    options[:body] = {
      conflicts: :proceed,
      query: {
        match: {
          ‘state.id’: state.id
        }
      },
      script: {
        lang: :painless
        source: ‘ctx._source.state.name = params.state.name’,
        params: { state: { name: state.name } }
      }
    }    __elasticsearch__.client.update_by_query(options)  end
end

Any change in state model will execute #update_by_query on the sales associated. The #update_by_query method will search all the sales that match with the query:

query: {
  match: {
    ‘state.id’: state.id
  }
}

And it will execute the script that updates the name of the state in all the indexes of sales found:

script: {
  lang: :painless
  source: ‘ctx._source.state.name = params.state.name’,
  params: { state: { name: state.name } }
}

The sentence options[:wait_for_completion] ||= false sets Elasticsearch to do the update asynchronously.

Finally, we will build the query to make the report. Let us suppose that the id of payment type for a credit card is 1, the ids for the states Oregon and Washington are 30 and 22 respectively, and the id for the campaign type social media is 3:

GET sales/_search
{
  “query”: {
    “bool”: {
      “must”: [
        {
          “range”: {
            “date”: {
              “gte”: “2014–01–01”,
              “lte”: “2019–01–01”
            }
          }
        },
        { “match”: { “payment_type.id”: 1 } },
        { “match”: { “campaign.campaign_type.id”: 3 } },
        { “range”: { “campaign.budget”: { “gte”: 1000 } } }
      ],
      “should”: [
        { “match”: { “state.id”: 30 } },
        { “match”: { “state.id”: 22 } }
      ]
    }
  }
}

And we get this response:

{
  “took”: 4,
  “timed_out”: false,
  “_shards”: {
    “total”: …,
    “successful”: 1,
    “skipped”: 0,
    “failed”: 0
  },
  “hits”: {
    “total”: …,
    “max_score”: …,
    “hits”: [
      {
        “_index”: “sales”,
        “_type”: “sales”,
        “_id”: “2”,
        “_score”: …,
        “_source”: {
          “id”: 2,
          “date”: “2018–09–21”,
          “amount”: 2000,
          “state”: {
            “id”: 22,
            “name”: “Washington, D.C.”,
            “code”: “WA”
          },
          “payment_type”: {
            “id”: 1,
            “description”: “Credit card”
          },
          “campaign”: {
            “budget”: 3000,
            “description”: “Facebook”,
            “campaign_type”: {
              “id”: 3,
              “description”: “social media”
            }
          }
        },         …      }
    ]
  }
}

Conclusion

Grouping data from different models at an index allows us to build the response in a sophisticated way and speed complex queries. This helps us avoid joins between multiple tables in a relational database.

But as I mentioned before, this comes at a cost because you are adding redundant data and therefore implies that you are occupying more space in your server. But, bear in mind that if you implement Elasticsearch, this is because you need a high performance and big data handling. You must determine if buyin more space is worth it for your business.

Why and how to denormalize indexing with elasticsearch-rails

Written by federico chavez