Code Search
TLDR
- Trigram-based search engine for code
- Ranking partly based on PageRank, with imports acting as links between files
- Backend and indexer are ~3k lines of Rust with almost no external dependencies (essentially just regex, protobuf+ gRPC)
- Web frontend is ~1k lines of Javascript with zero dependencies/frameworks
- Also built a terminal UI+ vim plugin to allow searching from the terminal/editor
- Github here or you can try it yourself here
Overview
I’m a software engineer, and I spend a lot of time reading and trying to understand code. In my experience, the bigger the codebase, the more time is required to understand the code before you can start writing.
Reading and understanding unfamiliar code is perhaps the most critical task in software engineering, but it’s not easy to find the right code or follow logic through the codebase. That’s why I built code search, a search engine for code.
Matching
Code search uses a trigram-based index, which means every cluster of three letters (a “trigram”) is put into an index which is essentially an on-disk mapping of trigram
→ filenames[]
.
As an example, the search term Some(x)
is split into the trigrams Som
, ome
, me(
, e(x
, and(x)
. Only files containing ALL of these trigrams are considered as possible matches.
Code search uses this index to narrow down the possible matches, then runs a full regular-expression based search on them.
Ranking
Once all possible matches are identified, they have to be ranked. I think this is the most important part of the search process — if the best result is at the bottom, or on page 2, it might as well not appear in the results at all.
Code search uses a number of ranking strategies (see the scoring function here), but the main ones are:
Keyword matching
- Keyword match density and proximity between keywords
- Whether a keyword is an exact match or an approximate match (e.g.
search_utils
matchingSearchUtils
is an inexact match) - Whether the keyword is a complete, partial or interior match (e.g. result
myTerminology
matching the keywordterm
is an interior match, the lowest quality. Resultsterminal
orxterm
matching keywordterm
are partial matches, which are a bit better) - When a function, variable, or structure definition matches the keyword, the ranking is boosted
Downranking of junk results
- Machine-generated files are detected by heuristics like average line length, file size or extension and are heavily downranked
- Test files are identified using filenames and other language-specific heuristics and are downranked
PageRank
- Import relationships between files are extracted using heuristic regular expressions, and a directed graph of files is constructed where the graph nodes are source files and the edges are imports
- You can see my implementation of PageRank here, but the main idea is that files which are imported a lot by other files have a high pagerank, which makes them more likely to show up at the top of the search results.
Terminal UI and vim plugin
Although code search is designed to mostly be used in the browser, it’s useful to be able to do searches within the editor as well. I use vim, so I built a terminal UI for code search which can be invoked from within vim.
You can page between search results with j
and k
, and hitting enter
returns the filename and line number to stdout so vim can jump to that line.
Some example queries
std::fs::read_to_string
(github, code search)
- Github incorrectly splits the query into
std
,fs
, andread_to_string
and returns anything that matches all three in the same file - Github’s highest ranked results are test files
- By contrast, code search correctly finds usage of the full string in a commonly imported library
queue.proto
(github, code search)
- Github incorrectly splits this into
queue
andproto
and returns a bunch of random results which have both words. But it doesn’t return the filequeue.proto
which actually exists in the repository! - Code search gives
queue.proto
as the first result, and doesn’t even return the other random garbage
recordio
(github, code search)
- Github returns some random usages of the
recordio
library, but the library itself is the 8th result - Code search returns the library itself in the first 3 results
read_to
(github, code search)
- Github returns ZERO matches for this query, because it looks for complete tokens matching
read_to
and misses matches likeread_to_string
orread_to_end
- Code search finds plenty of useful results
read_proto
(github, code search)
- Github returns a bunch of miscellaneous usages of the function. The function definition is at the very bottom of the results.
- Code search returns the function’s definition as the first result
Want to try it?
Just check out the readme (also on github), or just run this command in your code directory to try it instantly:
docker run -p 9898:9898 -v $PWD:/code colinmerkel/code_search
NB: right now it only has language intelligence for javascript, typescript, python, rust, and protobuf, so if you’re using another language, the rankings won’t be as good.
If you try it, let me know if you have any suggestions for improving it!
FAQ
Can’t I just use my IDE’s built in search?
Sure, but your IDE’s search functionality is probably just equivalent to a grep
over the files in your currently open repository. Many companies have lots of code spread over 10s or 100s of repositories — unlike your IDE, code search will search it all. And your IDE’s search results are typically just sorted randomly, rather than by relevance.
Can’t I just use Github’s search functionality?
Github search appears to be based on a text-oriented search strategy, which is a terrible fit for searching code. For example, searching the term my_variable
on github does not match my_variable_name
— so by lengthening the search query we can actually get more results, which is counterintuitive. Also, the search results appear to be sorted randomly, which in large codebases makes it almost useless.
Code search matches keywords more accurately and most importantly ranks the results so that the most relevant file is at the top.
Don’t you know about sourcegraph?
Yes! Actually I recommend checking out sourcegraph, it’s pretty nice and has a huge amount of interesting features. But from my testing, code search gives better rankings of search results:
- Sourcegraph tends to return hidden files (e.g.
.cargo-checksum.json
), machine-generated files (e.g. files with extremely long lines), tests and test data as high-ranking results. Code search heavily downranks these. - Sourcegraph requires precise knowledge of your build system in order to understand relationships between files, which is annoying to set up and fragile. Code search uses a heuristic approach, which is less accurate but more robust and requires no setup.