How we improved search performance by 2x

By Tobi Knaup

If you haven’t used our moving map search recently you should check it out now, because we made it more than twice as fast! Actually, every search is faster now but it’s most noticeable in map mode. So how did we do that? Let’s start with some background on our setup. We’re a Rails shop, and we’re using Sphinx as our search engine. The two are connected through ThinkingSphinx, an excellent Ruby gem that provides an easy to use query interface, and a DSL for defining indexes. The queries we run are a little bit different from what an average website does, because every single one filters results with spatial constraints (latitude/longitude). We also make heavy use of facets for the various filter options such as room type, neighborhood, or amenities.

Why was it slow?

Sphinx works great for most common use cases, but it’s not optimized for spatial queries. While it gives you some basic functions to query and rank by distance, it doesn’t perform any spatial indexing. The latitude and longitude fields are just floats, and spatial queries have to scan the whole index, which is of course not very performant or scalable. Also, it turns out that the configuration generated by ThinkingSphinx doesn’t allow Sphinx to make use of multiple processor cores. Now while it sounds like this setup doesn’t fit our requirements at all in terms of performance, Sphinx is very fast in general. Rewriting or switching to a different engine wasn’t an option for us at the time so we wanted to make surgical changes to get the maximum out of it. We got help from Vlad and Rich at, who are experts in tuning Sphinx.

How we optimized it

The first objective was to allow Sphinx to use all available processor cores. To achieve this, we split the search index into multiple parts and configured Sphinx to use them as a distributed index. Sphinx then uses one thread to search each partial index, and merges the results afterwards. Here is an example configuration snippet that makes use of two cores:

searchd{  …  dist_threads = 2  …} source h_core_0{  …  sql_query = SELECT … FROM hostings WHERE id % 2 = 0  …} source h_core_1 : h_core_0{  sql_query = SELECT … FROM h WHERE id % 2 = 1} index h_core_0{  source = hosting_core_0  path = /home/sphinx/db/h_core_0}index h_core_1{  source = h_core_1  path = /home/sphinx/db/h_core_1}index h_core{  type = distributed  local = h_core_0  local = h_core_1}

What’s important here is to set dist_threads to the number of processor cores, and to configure one partial index per core. It’s easy to split your data into multiple indexes if you have an id column with auto_increment. Simply use the mod operator in the source config blocks. Another big performance boost came from upgrading Sphinx from 0.9.9 to 2.0. It’s currently in “stable beta”, which basically means that core features are production quality, whereas some newly added features might be less tested. The Sphinxsearch guys recommended it, and since we weren’t using any of the cutting-edge features we felt confident to use it in production. The only downside to those changes is that we had to say goodbye to the ThinkingSphinx index configuration DSL. It doesn’t support these advanced settings.


The future

There are a few ways to get even more performance out of Sphinx. It has its own query language — SphinxQL, which allows you to bundle queries and execute them together. This is really helpful for combining multiple facet queries. It would require major changes in our app and getting rid of ThinkingSphinx though, so we’ll save that for a later date. Another way to get more parallelism and scalability is to split the index across multiple machines. This works similar to same-machine distributed indexes and is easy to set up. Although Sphinx has been great for us so far, the lack of spatial indexing will become a problem at some point. We’re currently exploring other architectures that provide this feature. Stay tuned.

Check out all of our open source projects over at and follow us on Twitter: @AirbnbEng + @AirbnbData

Originally published at on September 30, 2011.