Clickhouse full text search based on Manticore

Published in

DataDenys

6 min readJul 22, 2022

Clickhouse lacks full text search feature, and I’m not sure it should support something like that. If you want to search on text, there’s a lot of options to choose from. One of them is Manticore — opensource Sphinx Search fork. Let’s take a look on how to use Clickhouse and Manticore together.

Clickhouse table

Imagine we have a table with a text column and multiple numeric columns:

CREATE TABLE test ( `text` String, `a` UInt32, `b` UInt32 )
ENGINE = MergeTree ORDER BY (a, b)

And at some point we need to do a full text search based on text column. Even if we do not need “full” full text search, but only want to search based on substring, it’s recommended to use separate search index for that. Why? Because executing LIKE queries will kill Clickhouse performance.

As Clickhouse does great job inserting large amounts of data into tables, we would want to keep that benefit. That’s why creating full text index should be a background asynchronous process:

Since new data will arrive constantly, we need to do indexing from time to time. We have to decide on reindexing frequency based on common sense and requirements. Popular case is to do that daily, but it depends.

Populating table with test data

We have generated ~ 29.5 million rows for our table. text column valuesof different length are of the following distribution (we used bar() function for that):

As we can see, most of our 29.5 million docs are from 0 to 3k chars length. Our table takes around 16Gb on disk:

SELECT formatReadableSize(SUM(bytes)) size FROM system.parts
WHERE active AND table = 'test'

Gives us:

┌─size──────┐
│ 15.91 GiB │
└───────────┘

Search on text column works slow as expected:

SELECT count(*) FROM test WHERE text LIKE '%human%'

Takes way too much time:

Manticore index setup

Manticore search is available on Ubuntu (or take a look how to install it on your own shit):

apt install manticore

By default, Manticore works with real time search indexes. As Manticore supports Mysql protocol, the process of creating, populating index and searching data looks exactly like creating tables and inserting/selecting data.

Real time indexes is not our case, as we need to index data from different source and it’s much better to do indexing in big batches from performance point of view:

In order to enable plain indexes, we need the following config (available at /etc/manticoresearch/manticore.conf):

# searchd {...}source test_src {
  type = tsvpipe
  tsvpipe_command = clickhouse-client -q "SELECT crc64(text)-1, text FROM test FORMAT TSV"
  tsvpipe_field = text
}index test_idx {
  type   = plain
  source = test_src
  path   = /var/lib/search/test
}

Here we’ve used piping to redirect Clickhouse query execution right into Manticore indexer.

One issue with indexing is that each document must have it’s own ID to identify each document:

In Manticore IDs are signed non-zero 64-bit integers. If we don’t have any ID available (like in our case), we must generate one somehow. We’ve used crc64() (8 bytes is more than enough for our case) to generate intID based on text signature of text column.

Also, do not forget to remove data_dir param from default config searchd block to enable plain indexes.

Restart Manticore:

service manticore restart

In case of no errors let’s index our documents:

sudo -u manticore indexer --all --rotate

Here we ask Manticore indexer to reindex all of declared indexes (we have only one, but still) and send signal to search server to reload newely generated indexes. This is quite a heavy operation, took my desktop 40 minutes to finish (and created 25G index on disk):

using config file '/etc/manticoresearch/manticore.conf'...
indexing index 'test_idx'...
collected 29501000 docs, 25438.5 MB
creating lookup: 29501.0 Kdocs, 100.0% done
WARNING: sort_hits: increasing mem_limit may improve performance
sorted 3755.1 Mhits, 100.0% done
total 29501000 docs, 25 438 529 425 bytes
total 2373.637 sec, 10717109 bytes/sec, 12428.60 docs/sec
total 89908 reads, 6.419 sec, 132.1 kb/call avg, 0.0 msec/call avg
total 98425 writes, 33.644 sec, 443.2 kb/call avg, 0.3 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=657156).

Testing text search performance

Searching Manticore is a simple as executing SQL queries in Mysql. First, let’s connect to search server (listening on 9306 port by default) using mysql client:

mysql -h 127.0.0.1 -P 9306

We can list all available search indexes with SHOW TABLES:

Let’s try the same query we executed using Clickhouse in Manticore:

select count(*) from test_idx where match('human')

And we can see dramatic performance improvement:

We can see Manticore performs 100x better than Clickhouse LIKE query.

Never mind difference in results compared to Clickhouse. I’ve used crc32() instead of crc64() while testing this. And obviously crc32() gave around 3% duplicate values for ID, don’t be as dumb as me, use crc64().

Additional search filtering

In practice you almost certainly would like to have additional filtering capabilities. You should consider 2 options here:

SELECT…WHERE IN (…)

If your text search gives you small set of results (under couple of thousands rows), just add generated ID column to Clickhouse and then use SELECT ... WHERE id IN ([returned-by-manticore]).

Manticore attributes

Add Manticore attributes to be able to filter right in Manticore. Useful for cases when full text search alone gives you millions of documents or more. Attributes allows you to shrink down resultset from Manticore.

Adding attributes to Manticore is easy. Let’s add our a and b columns:

source test_src {
  type = tsvpipe
  tsvpipe_command = clickhouse-client -q "SELECT crc64(text)-1, text, a, b FROM test FORMAT TSV"
  tsvpipe_field = text
  tsvpipe_attr_uint = a
  tsvpipe_attr_uint = b
}

Now after reindexing, we would be able to additionally filter on a and b right in Manticore:

select count(*) from test_idx where match('human') and a > 100 and b < 200

Index optimization

Since indexing all original data from Clickhouse is quite a long process and can take hours or even days, you might want to use main/delta approach. In this scenario, you do full reindex rarely, but reindex small fresh portion of data frequently:

There’s also several indexer options you’d want to tune to improve your indexing performance.

Denys Golotiuk: I’m planning to write more on main/delta indexing schemas in separate article soon.

Summary

Using Manticore together with Clickhouse can be a good solution when you need full text search on Clickhouse data. Use plain indexes instead of real time ones as indexing big batches of text data will perform better. Consider using main/delta (articles on that soon) approach to make text index more time relevant. Do not forget to tune indexing process to improve performance.