Going Beyond grep for Searching Source Code — Zoekt

Nikos Katirtzis
The Hotels.com Technology Blog
6 min readSep 2, 2018

--

Software engineering is more about reading code than writing it, and part of this process is finding the code that you should read. — Han-Wen Nienhuys

The good old days when entire websites (including hotels.com) were served by a single web service are gone.

It’s clear that most companies are moving from monoliths to microservices. This shift obviously has numerous benefits but it also brings new challenges such as that IDEs can no more effectively locate source code as this spreads along several projects. Here is where source code search engines come into play.

Why do you need a source code search engine?

First, I wouldn’t agree more with Han-Wen Nienhuys, author of Zoekt source code search engine who said that:

Software engineering is more about reading code than writing it, and part of this process is finding the code that you should read.

How true is this? I’m sure you’ve found yourself quite often looking at implementations by other teams when trying to implement something similar in your apps.

In fact, based on a recent Google survey, the average developer performs more than 10 searches on a typical weekday. Now, narrowing down to specific use cases, indicative examples where source code search engines could help include:

  • Understanding code dependencies in order to avoid breaking changes.
  • Finding references of hosts/code that will be deprecated.
  • Avoiding duplicating existing code.
  • Sharing coding solutions and styles.
  • Finding bad/poor quality code that should be eliminated.
  • Locating security problems (esp. hardcoded keys/passwords).

At Hotels.com we have conducted several PoCs on different source code search engines trying to find the optimal solution for us. A service that we particularly liked and which would scale to at least a few thousand repositories is presented in this article.

Zoekt — A fast-trigram based source code search engine

Zoekt is a fast-trigram based source code search engine. It’s an open-source unofficial Google software and its source code is available on GitHub.

Zoekt has been designed with the following principles in mind:

  • Coverage: the code that is of interest to you should be available for searching.
  • Speed: search should return useful results quickly, so you can iterate on queries, this way ensuring high productivity.
  • Approximate queries: matching should be done case insensitively, on arbitrary substrings, so you don’t have to know what you are looking for in advance. However, the service should support both case-sensitive and case-insensitive searches.
  • Filtering: you should have the option to winnow down results by composing more specific queries.
  • Ranking: interesting results (e.g. function definitions, whole word matches) should be at the top.

and provides the following features that implement the above mentioned principles:

  • Coverage: comes with tools that mirror parts of common Git hosting sites.
  • Speed: it focuses on speed and achieves sub-50ms results on large codebases by using an index based on positional trigrams.
  • Approximate queries: it fully supports regular expressions and substring patterns, and can do case-sensitive and case-insensitive matching.
  • Filtering: it allows you to filter queries by adding extra atoms (e.g. lang:java limits to Java source code), and filter out terms with the minus symbol (-). It provides rich query language, with boolean operators, etc.
  • Ranking: it uses ctags to find declarations which are then boosted in the search ranking.

The home page of Zoekt pretty much covers its searching options by providing useful examples:

Zoekt’s home page (source: cs.bazel.build).

Zoekt in Action!

Step 1: Assume a scenario in which we’re looking for usages of the golang base image in Docker files.

Query 1: from golang (source: cs.bazel.build).

Not really useful, right?

Step 2: Let’s use a simple regex and do a case-sensitive search…

Query 2: FROM\s+golang case:yes (source: cs.bazel.build).

Nice. We don’t really care about README files though. I mean, in general we care about documentation, but not in this case…

Step 3: Let’s narrow this down to Docker files.

Query 3: FROM\s+golang case:yes file:Dockerfile (source: cs.bazel.build).

Interesting. As you can see the file name of the first result is Dockerfile.build. But what if we’re only looking for exact matches, or we at least want to exclude files with that name?

Step 4: Then we can use a regex on the file name!

Query 4: FROM\s+golang case:yes file:^Dockerfile$ (source: cs.bazel.build).

or/and even use the minus symbol to exclude specific files:

Query 5: FROM\s+golang case:yes file:Dockerfile -file:Dockerfile.build (source: cs.bazel.build).

Awesome.

Step 5: Finally, we can winnow down to specific repos by using the r symbol.

Query 6: FROM\s+golang case:yes file:^Dockerfile$ r:GoogleCloudPlatform (source: cs.bazel.build).

Note: The repo (r) option is case sensitive!

Thanks Zoekt :)

Hidden gems

#1: Zoekt has a cool feature which prevents from slowing down your browser in case of long lines. Note the “…(X bytes skipped)…” in the example below.

Query 7: f:json r:codesearch (source: cs.bazel.build).

#2: It also detects duplicates and suppresses them. Note the “Duplicate result” tag.

Query 8: f:__init__\.py$ r:grpc (source: cs.bazel.build).

#3: It fully supports regular expressions and faces no issues when running complex and long regex.

Query 9: jdbc:.*:([0–9]{1,3}\.[0–9]{1,3}\.[0–9]{1,3}\.[0–9]{1,3}) (source: cs.bazel.build).

#4: The r symbol shows info for the repositories including the date when these were last indexed, the number of files indexed, and their size.

Query 10: r:GoogleCloudPlatform (source: cs.bazel.build).

Resources needed to run Zoekt

Zoekt continuously hits the disk/indexes for searching. This means you’ll need a local SSD to store the index files and to achieve low I/O communication latency. However, the service is not CPU or memory intensive which means you won’t need an expensive instance.

AWS offers storage-optimised instances with instance store such as the i3 ones. An i3.xlarge (4 CPUs, 30.5 GiB memory, 950GB NVMe SSD instance store) EC2 instance is more than enough for us but it always depends on your number of repos and their size.

The setup for us looks like the one below:

AWS setup for Zoekt.

In specific:

  • Zoekt is served by a i3.xlarge EC2 instance.
  • Indexes are stored in the -ephemeral- instance store to ensure low I/O communication latency.
  • We’re also using an EBS disk where repositories and logs are stored (we need to store them permanently).

Cloning the repos for the first time is obviously the most time-consuming task. Indexing is fast and once it’s done for the first time, Zoekt uses an incremental approach to re-index; this means it’s a matter of seconds or minutes to re-index depending on how many repos have been modified and how many updates are there. The service is configurable as to how often it’ll clone new repos and fetch repo changes.

You can find more details about the technical design of the service or even answers to common questions under Zoekt’s GitHub repo. Also, there’s a great presentation from the author of the search engine where he explains the differences between the most common approaches used for source code searching/indexing.

Still not convinced you need a source code search engine? Then join us at Voxxed Days Thessaloniki in November where we’ll talk about “Improving your team’s source code searching capabilities”!

References

Zoekt public repo

Zoekt demo website

Zoekt Technical design details

Zoekt FAQ

Han-Wen Nienhuys’ presentation on Zoekt

--

--