Logstash — Denormalize documents (Part 1)

2 min readMay 2, 2024

This is a three part article. You can find other parts of this article:
* Part 1 : highlights the need of denormalization
* Part 2 : exposes the problematic of not using denormalization
* Part 3 : shows how to implement denormalization

In this first part of Logstash — Denormalize documents, we will take a simple example which highlights the need of denormalization.

Feel free to use my ELK docker compose to reproduce this example: https://github.com/ijardillier/docker-elk

Why denormalization?

When we ingest data, we may need to transform them in order to be full usable and relevant. Denormalization is a way of creating as many documents as items in an array field. By this way, we will improve querying on this flattened field.

That’s what we will explain in the different parts of this article.

Simple example

Index template

This index template will be used to store prizes with a few fields, just to understand what happens without denormalization:

POST _index_template/prizes
{
  "index_patterns": ["prizes-*"],
  "template": {
    "mappings": {
      "properties": {
        "id": {
          "type": "long"
        },
        "firstname": {
          "type": "keyword",
          "ignore_above": 256
        },
        "surname": {
          "type": "keyword",
          "ignore_above": 256
        },
        "gender": {
          "type": "keyword",
          "ignore_above": 256
        },
        "prize": {
          "properties": {
            "category": {
              "type": "keyword",
              "ignore_above": 256
            },
            "year": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

Data

We have the following prizes.json file containing all our prizes:

{"id":1,"firstname":"Pierre","surname":"Curie","gender":"male","prize":[{"year":1903,"category":"physics"}]}
{"id":2,"firstname":"Marie","surname":"Curie","gender":"female","prize":[{"year":1903,"category":"physics"},{"year":1911,"category":"chemistry"}]}
{"id":3,"firstname":"Frédéric","surname":"Joliot","gender":"male","prize":[{"year":1935,"category":"chemistry"}]}
{"id":4,"firstname":"Irène","surname":"Joliot-Curie","gender":"female","prize":[{"year":1935,"category":"chemistry"}]}

You can see that one of our json object contains 2 prizes, in two different categories, and not for the same year.

Logstash configuration

This logstash configuration will just read this json file as a json content and send it to Elasticsearch:

input {
  file {
    id => "prizes"
    path => "/usr/share/logstash/pipeline/file/prizes.json"
    mode => "read"
    codec => "json"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
 json {
  source => message
  remove_field => message
 }
 mutate {
  remove_field => ["@timestamp", "@version", "event", "host", "log"]
 }
}

output {
 stdout {
  codec => rubydebug { metadata => true }
 }
}

output {
  elasticsearch {
    index => "prizes-original"
    hosts => ["https://es01:9200","https://es02:9200","https://es03:9200"]
    ssl_certificate_authorities => ["/usr/share/logstash/certs/ca/ca.crt"]
    user => "elastic"
    password => "${ELASTIC_PASSWORD}"
  }
}

In this configuration, we:

read from the beginning and don’t use sincedb (each time you restart logstash, it will re-read the file)
parse the json content to extract fields
keep only usefull fields to concentrate us on important stuff
send documents to stdout and to elasticsearch, in a prizes-original index.

The next part (Part 2) of this article will highlight the need of denormalization.