How to sync your MySQL data to Elasticsearch

siddontang
Mar 26, 2017 · 4 min read

I have been using MySQL for many years. The performance is very very high in many scenarios, but not in full-test search or some sophisticated query. I considered introducing Sphinx to solve the problem before, but gave up quickly because I found it was not easy for me to learn and use Sphinx (or maybe I was too lazy at that time :simle: ). Luckily, I met Elaticsearch(ES).

ES uses Lucene to supply powerful quick search, supports horizontally scale, has a great web to analize and visualize your data, and above all, it is very easy to use. I can set up the ES cluster without knowing anything about search, about Java (aha, I don’t like Java!).

Although ES is my best choice here, it has no data to search, all my data are saved in MySQL. So the next problem is how can I sync my data from MySQL to ES in real time? If I insert, update, delete a row in MySQL, ES must know it immediately.

At first, I consider using MySQL trigger or UDF, of course, the way can’t work, so I focus on MySQL binlog.

MySQL Binlog

To sync data from MySQL to ES through binlog, I must only use row format binlog. If I use statement or mixed format, I can only know the execution query, can’t know the changed data.

Row format contains full, noblob and minimanl three imagetypes. The default image is full, and the last two are to reduce space. I prefer using full image because it has full complete data, and the space occupied seems not to be a big problem in large hard disk, and the binlog can be purged periodically.

Syncing MySQL binlog is very easy. Following MySQL replication protocol, I can write a client, register to MySQL master as a MySQL slave, receive MySQL binlog event continuously, then parse the binlog event and get the real changed data.

I don’t explain the concrete implementation here, you can know MySQL protocol, binlog events, etc. from MySQL Client/Server Protocol, and I have already supplied according replication feature in my open source project go-mysql.

MySQL dump

If we want to sync data from a empty MySQL, using binlog is easy, but it can’t work for a MySQL which running for a long time and the old binlogs are purged. If we sync with binlog directly, we will lose the early data.

We can refer the common backup way in MySQL to fix this problem, we can use mysqldump to retrieve the whole snapshot of the current running MySQL, parse the dump file and get all data, then sync binlog from the position saved in the dump file.

The whole flow has already been implemented in canal in go-mysql too.

go-mysql-elasticsearch

Here, we can sync MySQL data easily with go-mysql canal component, and what remains to do is easy, just sending the synced data from MySQl to ES.

Communicating with ES is simple, the ES API is restful style and the data format is JSON, so we can write a client quickly, with no difficulty.

What we only need to care is how to set the sync rule, that is, which databases and tables do we want to sync? how do we handle the changed row data from MySQL then send to ES? etc…

The sync rule is configured in a TOML format file. First, we need to set which tables we want to sync in source section like:

In the above example, we will sync some tables t, tfield, tfiler in test database. We can also use wildcard to match different tables, E.g, we may split data into multi tables t_0000 to t_1024 for sharding, all sharding tables have same schema and can be saved in one ES type, so we can use wildcard t_[0-9]{4} to match all these tables.

For a table, we must specify the according ES index and type. If not, we will use schema name as ES index and type.

In the above example, we will sync the data in table t to the test index and t type in ES (We can get data in ES like GET /test/t/1).

Customize Field mapping

By default, we will use MySQL table column name as ES field name, but we can use another name, like:

In the above example, we map the column id of the table tfiled to the field with name es_id, and map tags to es_tags.

You may also notice the tag list here, which mean coverting the column data to ES array type. This is commonly used in MySQL varchar type, we save data like “a,b,c”, and want to use [a, b, c] in ES.

Filter Field

We can also sync some columns to ES, like:

For the above table tfilter, we will only sync the data in id and name columns.

Aggregate multi tables

As we talk above, we can aggregate multi tables, like:

We will sync the table which name matches t_[0-9]{4} , of course, all the table schemas must be the same.

Epilogue

Using go-mysql-elasticsearch, we only need to define rule in the config file, then we can easily sync MySQL data to ES. For more powerful usage like parent-child relationship sync, please visit go-mysql-elasticsearch homepage.

Of course, go-mysql-elasticsearch is not perfect, it can’t work well with some DDL condition, needs more sync rule, etc. Welcome your advice and improvement.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade