Migrating data from Elasticsearch(Opensearch) to S3 via Logstash

Published in

Hacktive Devs

4 min readJul 19, 2022

Data migration is one of the core components of Data Engineering today. This is because most of the day-to-day activities of a data engineer usually revolve around creating and maintaining (ETL/ELT) data pipelines.

Data migration needs vary but entirely depend on what (questions) end-users (or stakeholders) need answers to. This could be to gauge the performance of some services, sales or customer analysis, logs analysis based on some metrics, backup or even forecasting possible outcomes in the future using the available data. Whatever the need may be, the data needs to find its way to the access point before the stakeholders can make use of it.

What is Logstash?

According to AWS,

Logstash is a light-weight, open-source, server-side data processing pipeline that allows you to collect data from a variety of sources, transform it on the fly, and send it to your desired destination. It is most often used as a data pipeline for Elasticsearch, an open-source analytics and search engine.

It is a core part of the ELK stack (Elasticsearch, Logstash and Kibana) from Elastic.co

Requirements

In this article, we will stick to two assumptions:

There is an existing S3 bucket.
There is also an existing Elasticsearch DB setup.

Logstash requires one of these versions:

Java 11.
Java 17 (see Using JDK 17 for settings info)

To install Logstash, it is good to note that a version conflict with your Elasticsearch version may cause errors, especially in a case where Opensearch is being used. For this article, Elasticsearch (which is the managed AWS Opensearch) v7.11 is being used. You might be using a version that is not 7.11 or not Opensearch.

In correspondence to this, the compatible version of Logstash is 7.10.2. This can be downloaded here. Once it’s downloaded, the folder can be extracted and moved to a location of your choice on your computer.

Setup

For this to become functional a .conf file is needed which will serve your migration configurations — it has its syntax (json-like) and it’s pretty easy to grasp. This .conf needs two mandatory fields, which are the input and output. Below is a sample of what the .conf should be like this:

Breakdown of inputs and outputs

The input here is Elasticsearch because it is our source.

hosts: is the URL of the hosted DB. 443 is the port
query: is the exact query that will give us the data we need from Elasticsearch, also note that this query would only work here if it works on the Elasticsearch Devtools UI.
user and password are both credentials of the Elasticsearch account
index: is the target index on elastic search.

The output is usually the destination of the data
1. access_key_id: is the AWS_ACCESS_KEY
2. secret_access_key: is the AWS_SECRET
3. region: is the availability zone
4. bucket: S3 Bucket name
5. prefix: allows you to specify folder path inside your bucket or just some naming convention to your data. Also, we’re able to use the timestamp convention to name the data, which is being done here.
6. size_file: is the size of the data in bytes that partitioning will be triggered, so that your file doesn’t grow too big.
7. canned_acl: this is for access purposes, it could be private, full-control etc.
8. codec: this is the output type, it could be json, CSV etc

The official documentation speaks more on the input and output here.

Once this is done the next step is to run the Logstash command using this conf

This command above assumes that your Logstash is in the downloaded folder and you’re in the directory of the example.conf

This command will take some time to load and execute depending on how large the data being pulled from Elasticsearch is. Also note, that if your Elasticsearch is hosted on a private network you might need to be connected to your VPN to get successful migration.

Environment Variables

To ensure no credential is left bare in the .conf file, the credentials can be exported into the working environment. For example:

The .conf the file will be in turn be edited to look like this:

Logstash

Filters

The filter is a transformation layer in Logstash that allows the input and output data to be manipulated to give desired results. Assume, we would like to extract the _id of each doc from the index and add it as a field in the object of each source in the output json, the filtering will look like this:

Logstash

The filter layer is very useful especially when one would like to handle some basic transformation or even extract extra metadata into the source data. More info on Logstash filtering can be read here.

Thanks for reading.

Migrating data from Elasticsearch(Opensearch) to S3 via Logstash

Requirements

Setup

Breakdown of inputs and outputs

Written by Sogo Ogundowole