Document Search Engine

Christopher Galpin
Feb 19, 2018 · 4 min read
engine results

I built a lightweight search service with Flask and Whoosh, a pure Python search engine library. You can see the engine live at Fellow Python coders can easily re-purpose the code for themselves.

I designed it with these principles in mind: human readable URLs, integrity and clarity of excerpts, exploration, completeness, and ease of use.

Excerpts with Integrity


Like the results, excerpts can be ordered by relevance or chronologically. They are complete paragraphs when viewing the best match or a (semantically) single result, otherwise they are composed of sentence fragments with “[…]” omission links. These fragments preserve formatting (e.g. italics).

If excerpts expose more than half of a sufficiently long document, then the excerpt count is limited for copy protection.

JavaScript carefully formats and cites excerpts when copied to the clipboard. Those ordered by relevance are bulleted to emphasize difference from the source text.

Clipboard citation:

“The conscious mind is not some prodigal child or poor relative of the self.” — NoPR Chapter 4: Session 621, October 16, 1972

The full document location is visible behind the expandable arrow:

expanded tier text of hit document

Advanced Queries

Uses Whoosh’s “Did you mean.. ?” suggestions, to which I added UK/US variants. Whoosh supports many advanced searches, such as date ranges, fuzzy terms and proximity.

The default search uses stemming and stop word removal:
you are welcome → welcom
to match “welcome”, “welcoming”, etc.

You can include stop words with ‘common’:
common:(you are welcome) → you ar welcom
Or skip stemming with 'exact':
exact:(you are welcome) → you are welcome

You can of course use quotation marks for a phrase:
common:"you are welcome" → "you ar welcom"

Elegant Linking

URLs are crafted to be as readable as possible, e.g.
, by substituting html entities with permitted characters of popular browsers.

Links to pages like /q/beer+OR+wine/20/ will always function correctly since the number refers to how many hits to skip, instead of a "page" which could change in length.

Searching from the URL bar is possible with OpenSearch, and the top excerpt is rendered in Facebook previews with Open Graph.

Encouraged Exploration

Key terms are displayed at the end of each document heading. These function as both summary and links to explore the most unique excerpts from the document. They are generated from Whoosh’s tf-idf implementation, then cleaned of duplicates via stemming.

Search queries that don’t look for text within the body of a document are displayed as a list:

64 results for heading:dream

A “more like” section is present at the end:

similar sessions

Expressive Indexing

I process Markdown formatted text files into Whoosh’s ‘documents’ (i.e. individual results) with a simple set of regular expressions that specify the portion to be processed, and identify headings. Headings are then categorized into tiers, if desired.

Here is an example specification:

(books\sb.txt would contain the Markdown)

Within the same tier, a new heading ends the previous. Anything else we specify with ‘end’. For example, here we end “part” once we reach the appendix. Similarly, the start of e.g. “Part One” ends the introduction, just as the start of “Part Two” ends the final chapter of “Part One”. These are very easy to debug with console output.

‘tier2’ is the most specific heading and has its own searchable field. We’d like it to be mostly unique for concise URLs to individual documents, e.g. session:869, though it's not required. You may rename it to fit your application, e.g. chapter:dogs.

‘tier0’, ‘tier1’, and the lines within ‘headings_re’ identifying them are optional. They participate in the visual style of the short and expanded headings seen above, and search-wise are combined into heading, e.g.
heading:"part one" or heading:"chapter 22".


systemd unit /lib/systemd/system/search-engine.service with venv and gunicorn:

To auto-start: systemctl enable search-engine
To run: service search-engine start


If some of your documents have a deeper hierarchy than others, e.g. some are an entire chapter whereas others are a section (session in my case) within a chapter, then you’ll probably need to rename ‘session’ → ‘heading’, and ‘heading’ → ‘extra’, or something like that to make semantic sense to the user.

The html output is somewhat blobbed, and largely within the code (although well organized) rather than the template. Moving it to the template may be in the future.

Source files are expected to be in Markdown, but you can simply omit commonmark() for html. I would imagine it works fine on plain text but I haven't tried it.

Happy coding.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store