Logstash pipeline from SQL Server to Elastic Search

Amin Mirzaee
7 min readAug 23, 2021

--

One of my colleagues wants to transfer data of a table from SQL Server to ElasticSearch while using Logstash and he couldn't find a complete document to use it, so he asked me to help him and when It works correctly, I decided to make a document for everyone who wants to do similar problems. This document is about this problem.

What is Logstash?

The website of Logstash says “Logstash is a free and open server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite stash.” In my opinion, Logstash is a tool that helped you to transfer or collect data from one or multiple sources (input) to one or multiple destinations (output) with some simple filters that you can do in the pipeline.

Logstash is using a configuration file to manage pipeline. So, the only thing that you should do is set the configuration file and introduce it to Logstash. Logstash configuration file usually has 3 parts. Inputs and Outputs must define and Filters that can be defined if it is needed.

In our scenario, input is Jdbc and output is ElasticSearch. The best documentation for using these plugins is the Logstash document, however, there are some tips and points which I am going to talk about for these plugins specially Jdbc.

Logstash Configuration

This is the configuration file of Logstash to read from SQL Server and write to ElasticSearch.

There are 2 parts of object in this configuration, Input, and Output. You can use one or more objects in each part. Each object in the Input part reads data from the source and each part of the output writes data to them. I am going to describe each part configuration and what it does.

1. input

In this section, you can configure what sources must be read. You can define one or more than one in this section. For example, you can have more than one JDBC config which each of them reads from a table.

1.1. jdbc

This is a plugin for Logstash that you can read data from a relational database. with this configuration, you can read data from databases like MySQL, PostgreSQL, SQL Server, etc. Configuration of reading data and how periodic is in this section.

1.1.1. jdbc_connection_string

The connection string of the database that you want to read from it. It usually starts with database type which in my scenario is jdbc:sqlserver then it has server name which is my local server. If you want to connect to another server you must insert database server IP instead of .. Another part of the connection string is databaseName which shows what database you want to read data from it.

This section must be added correctly and completely. If the server IP address or any other part does not enter correctly, it wouldn’t work.

1.1.2. jdbc_user

The username of the user that has permission to run the query in the database. If you are using Windows Authentication on local then you can omit it and remove it from the configuration file or use # at beginning of this line, Although I suggest that use the specific user who has permission to run select queries in the only tables you are using.

1.1.3. jdbc_password

The Password of the user that has permission to run the query in the database. If you are using Windows Authentication on local then you can omit it and remove it from the configuration file or use # at beginning of this line, Although I suggest that use a specific user who has permission to run select queries in the only tables you are using.

1.1.4. jdbc_driver_library

JDBC uses java driver library to connect and work with databases. You have to download SQL Server driver from the following link (Other databases like PostgreSQL, MySQL,… have their own JDBC driver which you should download from them if you want to use Logstash JDBC plugin for them).

After downloading and extract the library, You should copy mssql-jdbc-9.2.1.jre8.jar to the path ./logstash-core/lib/jars/, or set the address of file to this configuration.

1.1.5. jdbc_driver_class

JDBC plugin wants to know which class of driver implements driver interface. This library that I have linked is com.microsoft.sqlserver.jdbc.SQLServerDriver.

1.1.6. statement

This section contains the query that runs by Logstash to collect data from database. This query can execute stored-procedure or statement defined in this file. For this example, I used a select query that gets 1000 rows of a table in each run.

This part has a special feature that I like. If you want to read data chunked or track the table that if new values are added to the table you get them. This plugin saves the last value of the tracking column and when wants to run the statement query, it replaces :sql_last_value with the last value that it has gotten. So in every run, you have the last id of the table which can use to manage and perform the statement query.

1.1.7. tracking_column

The column name that wants to track and use the last value. You must enter column name in lower case to work with it, otherwise, it would not work.

1.1.8. tracking_column_type

The type of tracking column to manage the last value.

1.1.9. use_column_value

If you want to use the tracking feature of this plugin, you must set the config true.

1.1.10. last_run_metadata_path

The file path of tracking value can be used while Logstash is restarted.

1.1.11. Schedule

Sometimes you want to manage the calendar of executing the query. When the query starts and how it repeats. JDBC plugin of Logstash uses rufus-scheduler to manage the scheduling of execution. I could not find any useful and complete documentation about how to configure this param in JDBC plugin of Logstash, But I write what I realized.

The scheduler config has 6 parts that are split with 5 spaces. The parts are (seconds) (minutes) (hours) (monthdays) (months) (timezone). Each part can be * or numbers or others.

  • If it is *, it accrues every time it happens.
  • If it is a number, it accrues when the specific time arrived.
  • If it is a * divided by a number, it periodically skips the specific times and then accrues.
  • If it is two numbers split by a dash (-), it happens between these numbers.

For learning how it works, I wrote some examples:

  • * * * * * *: Execute every second
  • */30 * * * * *: Execute every 30 second
  • 7 */10 * * * *: Execute second 7 of every 10 minutes
  • 0 20 4 * * *: Execute every day at 04:20:00 AM

You could manage and this params to run your query at specific time periods.

2. Output

In this section, you can configure what destinations must write. You can define one or more than one in this section. All data will be transfer to all plugins defined in the configuration

2.1. elasticsearch

This configuration explains where is ElasticSearch and how can connect and write to it.

2.1.1. Hosts

The ElasticSearch cluster addresses are located in this config. You must set this config.

2.1.2. Index

The index name of elastic that data should add to it, is located in this section.

2.1.3. User

The username of the elastic user that can write data to the index is added to this part. If your elastic does not have any security and user management you can omit this config by removing it or add # at the beginning of the line.

2.1.4. Password

The password of the elastic user that can write data to the index is added to this part. If your elastic does not have any security and user management you can omit this config by removing it or add # at the beginning of the line.

2.1.5. doc_as_upsert

If this property has been set to true, the document will be updated if the document Id exists otherwise document will be inserted.

2.1.6. document_id

The field that uses to match for document id in upsert is located here.

How to run?

Before running Logstash you should check config file and SQL driver is located in the right path. I usually save config file in `config` directory and the driver in logstash-core/lib/jars and I recommend that you move in this path. Finally, open a terminal and move to your Logstash Path then run this command:

On Windows:

cd <<LogstashPath>>.\bin\logstash.bat -f .\config\logstash.conf

On Linux:

cd <<LogstashPath>>bin/logstash -f config/logstash.conf

After running the command, Logstash checks the pipeline, requirements, and configuration files. If all parts are correct, Logstash prints each query that run in SQL Server. If the scheduler is set, the query repeat in each run and shows in console, otherwise after one running time program is finished successfully.

Conclusion

Logstash is a useful tool to move and manage data from many sources to many destinations. There are only some configurations must be written to run Logstatsh. As this document says reading from SQL Server and writing to ElasticSearch is too easy to configure and manage by Logstash. The only thing is that you should consider some tips that this document said and you know 😉

--

--