ElasticSearch (ELK) for database people

Raj Samuel
Nerd For Tech
Published in
3 min readSep 23, 2021

ElasticSearch is google search for the enterprise. It is a “database” widely used by enterprises with customer facing apps that require low response times for searching structured enterprise data and textual data.

What other databases doesn’t solve
Both SQL and NoSQL databases require special design considerations to allow fast searches on the data they hold. All of those come down to two solutions: Indexes and Partitions. Indexes enable needle in a haystack type searches. Partitions enable bucketing to narrow down search scope. Many products allow combining these two techniques in interesting ways.

Amazon Redshift implements something called Zone Map somewhat similar to indexes in implementation but logically closer to partition pruning. When a search is requested, zone map skips blocks of database records where the requested record is not present. Redshift does this by storing records sorted by the zone-mapped column which enables it to track that column’s upper and lower bounds on each database block, which can then be used for skipping.

Database search — drawback and workarounds
Each of these techniques, indexes and partitions, have their own derivatives based on the database product, but all of those point to one limitation: having to know in advance which column(s) have a higher chance of being searched. The database is deliberately designed with this knowledge, in both SQL and NoSQL systems. So new search criteria require design changes that are sometimes just not doable.

Some workarounds have been popular to mitigate this. One method is creating as many indexes as you need search columns. Another similar method is creating as many tables as you need partition schemes for narrow searches. A third and predominant method in web apps is to cache data closer to application server (products like Coherence, in-memory extensions etc.).

The drawback in all of these workarounds is the need to keep additional data structures in sync, impacting response times of every database write operation.

ElasticSearch
ElasticSearch is a product that indexes all fields all the time. There’s no tables or constraints, just the index.

If you have an app that require searching on arbitrary number of fields, has large amount of data, and may even contain textual data, ElasticSearch provides a good solution.

However ElasticSearch is just that — an elastic way to search. All other things we have come to expect from a database related to transaction consistency or data integrity isn’t available. Usually web applications that provide search have two database systems, one (or many) for the application’s data, and another ElasticSearch database. A Logstash process keeps the data in ElasticSearch synced up with main databases.

A simple example of search is when a user types in first few letters of city name the UI auto-fills it with a potential list of cities that begins with those characters. The app does that using ElasticSearch. All other data requests go to the primary database(s).

ELK Stack
ELK stack consists of

  • ElasticSearch server that holds data records (documents) in an index so that apps can do searches via HTTP requests (non http requests are also supported via a proprietary binary protocol developed by ElasticSearch)
  • LogStash which is an ETL interface for ES that picks up data from external sources like MySQL and does CRUD operations on ES
  • Kibana which is a visualization tool that let’s you query and do manual or ad-hoc CRUD operations on ES.

AWS ElasticSearch (now OpenSearch) comes with ES and Kibana. You have to separately install LogStash on another server (EC2 or elsewhere) or use AWS Lambda for automated loads.

Example ES query run on Kibana
GET _search
{
“query”: {
“match”: {
“first_name”: “Tom”
}
}
}

Example LogStash config
input {
jdbc {
jdbc_connection_string => “jdbc:mysql://1189-webservr-drupal-01.mince.com:3306/rsys?useSSL=false”
jdbc_user => “test”
jdbc_password => “E!test@”
jdbc_driver_library => “/usr/share/logstash/driver/mysql-connector-java-5.1.42-bin.jar”
jdbc_driver_class => “com.mysql.jdbc.Driver”
schedule => “* * * * *”
statement => “SELECT * FROM city WHERE updated_at > :sql_last_value”
tracking_column => “updated_at”
use_column_value => “true”
type => “city”
}
}

output {

if [type] == “city” {
elasticsearch {
user => “estest”
password => “test%&es”
hosts => “elasticsearch:9200”
index => “cities”
document_type => “document”
document_id => “%{rx_ipid}”
}
}
}

The input configuration specifies where to read from, what records to read etc. The ouptut configuration has the settings for output, that is writing to ES. This config file can be created with any filename (eg: config.txt) in the home folder of ES install, for example on /etc/logstash/bin/config.txt. It is usually what counts for “code” in ElasticSearch and is part of application’s code pipeline.

--

--

Raj Samuel
Nerd For Tech

I write because I forget. (PS: if you take what I wrote and post it as your own please try not to edit it and post rubbish. CTRL+C, CTRL+V is your friend.)