Document Search Engine
I built a lightweight search service with Flask and Whoosh, a pure Python search engine library. You can see the engine live at search.sethtalks.com. Fellow Python coders can easily re-purpose the code for themselves.
I designed it with these principles in mind: human readable URLs, integrity and clarity of excerpts, ease of exploration, completeness, and ease of use.
Excerpts with Integrity
Excerpts are complete paragraphs when viewing either a “single” (semantic) result, or the best match. Otherwise they are composed of sentence fragments with “[…]” omission links. They can be ordered by relevance or chronologically, just as the document hits themselves. Sentence fragment italics are preserved.
If excerpts expose more than half of a sufficiently long document, then the excerpt count is limited for copy protection.
“The conscious mind is not some prodigal child or poor relative of the self.” — NoPR Chapter 4: Session 621, October 16, 1972
The full document location is visible behind the expandable arrow:
You can include stop words with ‘common’:
common:(you are welcome) → you are welcom Or skip stemming with 'exact':
exact:(you are welcome) → you are welcome
You can of course use quotation marks for a phrase, e.g.
common:"you are welcome".
URLs are crafted to be as readable as possible, e.g.
/q/heading:(essay+4)+'conscious+mind'/, by substituting html entities with permitted characters of popular browsers.
Links to pages like
/q/beer+OR+wine/20/ will always function correctly since the number refers to how many hits to skip, instead of a "page" which could change in length.
Key terms are displayed at the end of each document heading. These function as both summary and links to explore the most unique excerpts from the document. They are generated from Whoosh’s tf-idf implementation, then cleaned of duplicates via stemming.
Search queries that don’t look for text within the body of a document are displayed as a list:
A “more like” section is present at the end of both these and single result pages:
Markdown formatted text files are processed into “documents” (i.e. individual results) with a simple set of regular expressions that specify the portion to be processed, and identify headings. Headings are then categorized into tiers, if desired.
Here is an example
Within the same tier, a new heading ends the previous. Anything else we specify with ‘end’. For example, here we end ‘part’ once we reach the appendix. Similarly, the start of e.g. “Part One” ends the introduction, just as the start of “Part Two” ends the final chapter of Part One. These are very easy to debug with console output.
‘tier2’ is the most specific heading and has its own searchable field. We’d like it to be mostly unique for concise URLs to individual documents, e.g.
session:869, though it's not required. You may rename it to fit your application, e.g.
‘tier0’, ‘tier1’, and the lines within ‘headings_re’ identifying them are optional. They participate in the visual style of the short and expanded headings seen above, and search-wise are combined into
heading:"part one" or
systemctl enable search-engine
service search-engine start
If some of your documents have a deeper hierarchy than others, e.g. some are an entire chapter whereas others are a section (or session in my case) within a chapter, then you’ll probably need to rename ‘session’ → ‘heading’, and ‘heading’ → ‘extra’, or something like that to make semantic sense to the user.
The html output is somewhat blobbed, and largely within the code rather than the template, although well organized within the code. Moving it to the template may be in the future.
Source files are expected to be in Markdown, but you can simply omit
commonmark() for html. I would imagine it works fine on plain text but I haven't tried it.