How we built Search at Kit

Published in

Kit Blog

5 min readMar 17, 2016

Kit launched in November as a platform for people to talk about the products they use and love. Creators have joined the site to recommend a diverse range of products from podcasting tools to coffee beans to canoes to ice cream to whatever this is. Our team decided it was time to build search.

I’m hoping this will be useful to other engineers who are evaluating the current landscape of search offerings, and help them navigate some of the obstacles in implementing a solution.

Technical Selection

Kit’s stack is hosted entirely on AWS, so I wanted to look at the different search products Amazon offers before considering something external. Amazon’s first search offering was Cloudsearch, and their newer offering as of October of last year is Elasticsearch. They share a lot in terms of features as they are ultimately both derived from Apache Lucene, but once you get into an actual implementation, the differences become more stark.

Cloudsearch is very attractive, given our very small team size, because it is easy to set up and comes with many tools out of the box that provide visibility into the data and query logic. It also performs many operations like reindexing automatically.

Elasticsearch requires a little more work to setup, even within the Amazon managed environment. Indexes are not managed automatically and must be handled manually. However, Elasticsearch does support more data types than Cloudsearch. Elasticsearch also provides a great deal of control over the tokenization and analysis over your search queries, and ultimately gives a developer control over the search experience that Cloudsearch does not. Elasticsearch also offers a number of advanced features such as “more like this” which will let you find documents similar to other ones, or “did you mean” functionality to try and suggest helpful alternative searches to a user if their initial query doesn’t match any known terms.

Ultimately I felt that the advanced features and customizability offered by Elasticsearch made it worth the extra investment.

Implementation

One of the big advantages you have with Elasticsearch over Cloudsearch is you can actually run it locally instead of relying on connectivity to the AWS infrastructure and having to pay for development instances. If you are going to run Elasticsearch locally it is important to note that AWS does not run the latest version of Elasticsearch. As of the time of this writing, the latest version of Elasticsearch is 2.2.1, and AWS only offers 1.5.2. To avoid discrepancies between your development and production environment make sure you’re installing a version of Elasticsearch less than 2.0.0 when you’re testing your code locally. Pro tip for mac users: if you’re using homebrew you can use the following to install Elasticsearch 1.7 which is pretty close.

brew install homebrew/versions/elasticsearch17

Indexing documents

I’m a big fan of Scala — it has a concise, functional syntax and still allows you to use any library made for Java. I found plenty of Elasticsearch client libraries had already been written for Java, and I settled upon the Jest api client. My examples below will use a client created with the JestClientFactory- but the operations described should be generic enough to be used in any Elasticsearch client.

We have three primary document types to search: the recommendations of products, the collections of those recommendations, and then the users themselves. I think of our document types kind of like a database schema. They can be changed in the future, and the metadata around them might change as well — the type of data stored in the field, if it should be searchable, and if it should be returned to a search client. For example, we might decide that our username field should be searchable, but the image associated with that user should not:

“username”: {   “type”: “string”,   “include_in_all”: true},“imageUrl”: {   “type”: “string”,   “include_in_all”: false}

We actually store this mapping in our codebase, alongside things like our database evolutions. You can then apply that mapping in this case, to your index name using a specified type “user” using the Jest client factory:

val userMapping = scala.io.Source.fromFile(new File(“conf/mappings/user.json”)).getLines().mkString(“”)val putMapping = new PutMapping.Builder(   “index”, // the name of your index   “User”, // the name of your type   userMapping).build()client.execute(putMapping)

After you’ve set up your index you can insert your documents. You’ll want to make sure you’re sending these in batches- you’ll hit an HTTP content max length error if you try to send all your data at once and it’s the kind of thing you might only encounter in your production environment after you have a certain amount of data.

You also may want to think about reindexing documents periodically in case your data type or mapping type change. It’s not actually possible to reinitialize the mapping on an index without first deleting all the existing documents. For this purpose, you can use an alias to your index for querying your documents, and set up a system to reinitialize your index by recreating it from scratch and resetting the alias when the refresh is complete. If you are going to do this operation, make sure that your Elasticsearch cluster’s instance types have at least twice the hard drive space than you need for your data. For more details see this guide on index aliases.

Querying documents

Using this setup you can query documents in a very similar fashion. You can construct a query json using the Elasticsearch query language and use your client to execute it:

val queryJson = s”””{   “query”: {      “query_string”: {         “fields”: [“_all”],         “query”: “bacon sandwich”,         “default_operator”: “AND”      }   },   “highlight”: {      “fields”: {         “*”: {}      }   }}“””val search = new Search.Builder(queryJson)   // multiple indices or types can be added.   .addIndex(SearchIndexer.index)   // you can specify how many records should be returned   .setParameter(“size”, 200)   .build()

Security

Now that I’ve sung its praises, I do have to point out a glaring flaw with the AWS implementation of Elasticsearch. It completely lacks many of the security options that are available for other AWS products, which was a bit of a surprise. Most notably, AWS does not currently offer a way to spin up an Elasticsearch cluster within a VPC, which is the traditional model of securing services on AWS. AWS does allow you to sign requests with an IAM user credentials, but this is not supported by any of the Elasticsearch clients that we looked at.

I recommend my friend Scott’s guide on securing our Elasticsearch cluster by IP addresses. He illustrates some advanced techniques to allow your Elasticsearch cluster instances to be routed to by an nginx node running within the VPC, which will even work with a dynamically scaling cluster of machines that can access the Elasticsearch cluster. Hopefully this is just a temporary solution until AWS does add VPC support to their offering, or provides other alternative security measures.

Monitoring

Once you have everything implemented you want to make sure that your team is properly alerted if the state of the Elasticsearch cluster indicates that customers might be affected. Once again we are using an AWS product for this, Cloudwatch. Some useful alarms to have to make sure your cluster is in good health might be:

* ClusterStatus.red
* Free Storage Space <= 1 Gigabyte
* CPU Usage > 50
* Memory Pressure > 75

You can find some more Elasticsearch specific metrics and suggested values here and here.

Conclusion

Even as the sole developer on a small team, it was pretty easy to get up and running with a fast and scalable search interface for our website in a short period of time.