Code Search

a search engine for code, written in Rust

5 min readOct 26, 2020

TLDR

Trigram-based search engine for code
Ranking partly based on PageRank, with imports acting as links between files
Backend and indexer are ~3k lines of Rust with almost no external dependencies (essentially just regex, protobuf+ gRPC)
Web frontend is ~1k lines of Javascript with zero dependencies/frameworks
Also built a terminal UI+ vim plugin to allow searching from the terminal/editor
Github here or you can try it yourself here

Overview

I’m a software engineer, and I spend a lot of time reading and trying to understand code. In my experience, the bigger the codebase, the more time is required to understand the code before you can start writing.

Reading and understanding unfamiliar code is perhaps the most critical task in software engineering, but it’s not easy to find the right code or follow logic through the codebase. That’s why I built code search, a search engine for code.

A demo of code search

Matching

Code search uses a trigram-based index, which means every cluster of three letters (a “trigram”) is put into an index which is essentially an on-disk mapping of trigram→ filenames[] .

As an example, the search term Some(x) is split into the trigrams Som , ome , me( , e(x , and(x). Only files containing ALL of these trigrams are considered as possible matches.

Code search uses this index to narrow down the possible matches, then runs a full regular-expression based search on them.

Ranking

Once all possible matches are identified, they have to be ranked. I think this is the most important part of the search process — if the best result is at the bottom, or on page 2, it might as well not appear in the results at all.

Code search uses a number of ranking strategies (see the scoring function here), but the main ones are:

Keyword matching

Keyword match density and proximity between keywords
Whether a keyword is an exact match or an approximate match (e.g. search_utils matching SearchUtils is an inexact match)
Whether the keyword is a complete, partial or interior match (e.g. result myTerminology matching the keywordterm is an interior match, the lowest quality. Resultsterminal or xterm matching keywordterm are partial matches, which are a bit better)
When a function, variable, or structure definition matches the keyword, the ranking is boosted

Downranking of junk results

Machine-generated files are detected by heuristics like average line length, file size or extension and are heavily downranked
Test files are identified using filenames and other language-specific heuristics and are downranked

PageRank

Import relationships between files are extracted using heuristic regular expressions, and a directed graph of files is constructed where the graph nodes are source files and the edges are imports
You can see my implementation of PageRank here, but the main idea is that files which are imported a lot by other files have a high pagerank, which makes them more likely to show up at the top of the search results.

Terminal UI and vim plugin

Although code search is designed to mostly be used in the browser, it’s useful to be able to do searches within the editor as well. I use vim, so I built a terminal UI for code search which can be invoked from within vim.

You can page between search results with j and k, and hitting enter returns the filename and line number to stdout so vim can jump to that line.

Some example queries

std::fs::read_to_string (github, code search)

Github incorrectly splits the query into std , fs, and read_to_string and returns anything that matches all three in the same file
Github’s highest ranked results are test files
By contrast, code search correctly finds usage of the full string in a commonly imported library

queue.proto (github, code search)

Github incorrectly splits this into queue and proto and returns a bunch of random results which have both words. But it doesn’t return the file queue.proto which actually exists in the repository!
Code search gives queue.proto as the first result, and doesn’t even return the other random garbage

recordio (github, code search)

Github returns some random usages of the recordio library, but the library itself is the 8th result
Code search returns the library itself in the first 3 results

read_to (github, code search)

Github returns ZERO matches for this query, because it looks for complete tokens matching read_to and misses matches like read_to_string or read_to_end
Code search finds plenty of useful results

read_proto (github, code search)

Github returns a bunch of miscellaneous usages of the function. The function definition is at the very bottom of the results.
Code search returns the function’s definition as the first result

Want to try it?

Just check out the readme (also on github), or just run this command in your code directory to try it instantly:

docker run -p 9898:9898 -v $PWD:/code colinmerkel/code_search

NB: right now it only has language intelligence for javascript, typescript, python, rust, and protobuf, so if you’re using another language, the rankings won’t be as good.

If you try it, let me know if you have any suggestions for improving it!

FAQ

Can’t I just use my IDE’s built in search?

Sure, but your IDE’s search functionality is probably just equivalent to a grep over the files in your currently open repository. Many companies have lots of code spread over 10s or 100s of repositories — unlike your IDE, code search will search it all. And your IDE’s search results are typically just sorted randomly, rather than by relevance.

Can’t I just use Github’s search functionality?

Github search appears to be based on a text-oriented search strategy, which is a terrible fit for searching code. For example, searching the term my_variable on github does not match my_variable_name — so by lengthening the search query we can actually get more results, which is counterintuitive. Also, the search results appear to be sorted randomly, which in large codebases makes it almost useless.

Code search matches keywords more accurately and most importantly ranks the results so that the most relevant file is at the top.

Don’t you know about sourcegraph?

Yes! Actually I recommend checking out sourcegraph, it’s pretty nice and has a huge amount of interesting features. But from my testing, code search gives better rankings of search results:

Sourcegraph tends to return hidden files (e.g..cargo-checksum.json), machine-generated files (e.g. files with extremely long lines), tests and test data as high-ranking results. Code search heavily downranks these.
Sourcegraph requires precise knowledge of your build system in order to understand relationships between files, which is annoying to set up and fragile. Code search uses a heuristic approach, which is less accurate but more robust and requires no setup.