ElasticSearch RealTime Rivers With MSSQL Server

MindingData
4 min readFeb 8, 2015

--

In my day to day job, the biggest headache I have on my hands is how to index large amounts of data into ElasticSearch in real time. For the most part, we use MongoDB at work which allows us to use MongoDB’s “oplog” to read transactions as they happen. The Oplog or “operation log” is basically a table that works like a transaction log in the database. You can essentially “tail” this table with no timeout, and it allows you to “Wait” for records to be inserted. More or less long polling a queue.

In MSSQL, we have no such luxury. Although we aren’t going to be moving to MSSQL anytime soon (Postgres seems to be the preferred option), I’ve worked with MSSQL for most of my programming life, so I wanted to work out what options I would have should I/we ever decide to go down that route.

Existing Solutions

Off the shelf rivers for MSSQL -> ElasticSearch are crap. Period. There is a “JDBC River” plugin for elasticsearch floating around, but….
1. ElasticSearch River Plugins are deprecated (Or going to be deprecated).
2. They simply run a SELECT * on a timed interval to move data to elasticsearch. That’s not realtime!

Onward and Upward

Database Triggers seemed like the defacto choice for doing something in real time. The main problem is that the trigger has to be fast. A trigger is part of the transaction so the trigger itself certainly can’t be calling ElasticSearch (Although… not going to lie… I did have a quick search on how to make HTTP requests from Transact SQL). Quickly the solution became somewhat clear.

My thought process for how it would work was…
1. An edit to a table fires a trigger.
2. That trigger writes to a queue/table.
3. An outside application/service reads from that queue/table and fires the data to ElasticSearch.

The problem with the trigger writing to a table, and then an outside application reading this table was mostly speed. If multiple tables are being edited, are they all going to be writing to the same “river table”, if so locks are going to occur and grind everything down. As for the application reading, it’s going to have to poll continuously. I did find some nifty examples of using tables as queues, but it seemed somewhat backward to work things like this.

Enter Service Broker

The Service Broker feature of MSSQL was extremely foreign to me. Infact, it was by chance that I had heard about it a week prior by a colleague. Service broker allows messages and queues between instances of MSSQL server. Luckily, that’s exactly what we needed (Well… We needed queuing functionality). What if the trigger from our table, could write a message to this queue? Since service broker is so close to the metal of the DB, it should be fast right? Then our outside application can read from this queue and flick it onto ElasticSearch! Woo! Now we’re onto something.

Without describing specific pains of working with the service broker, let’s just say that it was a maze to get it going. There seems to be very little documentation on best practices, or just how things are supposed to work. Every piece of code you write to do with the service broker is very very boilerplate. I searched around for a C# library to interact with it to no avail.

Up and Running

Once I managed to understand just how the service broker works, the process seemed very very easy.

  1. Database trigger writes a message to a service broker queue. The message contains the ID of the record edited, and the table it came from.
  2. C# application long polls this queue for updates until it gets a message.
  3. Once it gets a message, it queries the database for that record by using the ID and the table in the message. If it can’t find it, we know it was a deleted record. So we send a delete command to ES. If we can find it, we get the row and send it to ElasticSearch.

You can see my proof of concept code here up on Github : https://github.com/mindingdata/mssql_elasticsearch_river

What’s next

There is so much that I need to fix with this it’s pretty unreal. But I’m happy that it works. Just to rattle off some real annoyances.

  1. The message should contain what you want to do with the data. e.g. deletion. So that it doesn’t need to query the DB to find out what happened.
  2. It only supports an ID of int (Not that big of a deal in the MSSQL world.. But still).
  3. The C# application is not transactional. It pops a record off the queue, and if it fails, say bye to that message forever. This is bad. Very bad.

If you want to jump in and make it so this thing is actually useable, feel free to throw in a pull request. Obviously all the hardcoded values aren’t ideal, but for a POC, I wasn’t too worried about ease of configuration.

--

--