Document Search Engine

engine results

I built a lightweight search service with Flask and Whoosh, a pure Python search engine library. You can see the engine live at Fellow Python coders can easily re-purpose the code for themselves.

I designed it with these principles in mind: human readable URLs, integrity and clarity of excerpts, ease of exploration, completeness, and ease of use.

Excerpts with Integrity


Excerpts are complete paragraphs when viewing either a single result (semantically if not literally), or the best match. Otherwise they are composed of sentence fragments with “[…]” omission links. They can be ordered by relevance or chronologically, just as the document hits themselves. Sentence fragment italics are preserved.

If excerpts expose more than half of a sufficiently long document, then the excerpt count is limited for copy protection.

JavaScript carefully formats and cites excerpts when copied to the clipboard. Those ordered by relevance are bulleted to emphasize their order is different from the source text.

Clipboard citation:

“The conscious mind is not some prodigal child or poor relative of the self.” — NoPR Chapter 4: Session 621, October 16, 1972

The full document location is visible behind the expandable arrow:

expanded tier text of hit document

Advanced Queries

Uses Whoosh’s “Did you mean.. ?” suggestions, to which I added UK/US variants. Whoosh supports many advanced searches, such as date ranges, fuzzy terms and proximity.

The default search uses stemming and stop word removal: you are welcome → welcom to match “welcome”, “welcoming”, etc.

You can include stop words with ‘common’: common:(you are welcome) → you are welcom Or skip stemming with 'exact': exact:(you are welcome) → you are welcome

You can of course use quotation marks for a phrase, e.g. common:"you are welcome".

Elegant Linking

URLs are crafted to be as readable as possible, e.g. /q/heading:(essay+4)+'conscious+mind'/, by substituting html entities with permitted characters of popular browsers.

Links to pages like /q/beer+OR+wine/20/ will always function correctly since the number refers to how many hits to skip, instead of a "page" which could change in length.

Searching from the URL bar is possible with OpenSearch, and the top excerpt is rendered in Facebook previews with Open Graph.

Encouraged Exploration

Key terms are displayed at the end of each document heading. These function as both summary and links to explore the most unique excerpts from the document. They are generated from Whoosh’s tf-idf implementation, then cleaned of duplicates via stemming.

Search queries that don’t look for text within the body of a document are displayed as a list:

64 results for heading:dream

A “more like” section is present at the end of both these and single result pages:

similar sessions

Expressive Indexing

Markdown formatted text files are processed into ‘documents’ (i.e. individual results) with a simple set of regular expressions that specify the portion to be processed, and identify headings. Headings are then categorized into tiers, if desired.

Here is an example specification:

(With a books\sb.txt file.)

Within the same tier, a new heading ends the previous. Anything else we specify with ‘end’. For example, here we end “part” once we reach the appendix. Similarly, the start of e.g. “Part One” ends the introduction, just as the start of “Part Two” ends the final chapter of “Part One”. These are very easy to debug with console output.

‘tier2’ is the most specific heading and has its own searchable field. We’d like it to be mostly unique for concise URLs to individual documents, e.g. session:869, though it's not required. You may rename it to fit your application, e.g. chapter:dogs.

‘tier0’, ‘tier1’, and the lines within ‘headings_re’ identifying them are optional. They participate in the visual style of the short and expanded headings seen above, and search-wise are combined into heading, e.g. heading:"part one" or heading:"chapter 22".


systemd unit /lib/systemd/system/search-engine.service with venv and gunicorn:

Auto-start: systemctl enable search-engine
Run: service search-engine start


If some of your documents have a deeper hierarchy than others, e.g. some are an entire chapter whereas others are a section (or session in my case) within a chapter, then you’ll probably need to rename ‘session’ → ‘heading’, and ‘heading’ → ‘extra’, or something like that to make semantic sense to the user.

The html output is somewhat blobbed, and largely within the code rather than the template, although well organized within the code. Moving it to the template may be in the future.

Source files are expected to be in Markdown, but you can simply omit commonmark() for html. I would imagine it works fine on plain text but I haven't tried it.

Happy coding.