Solr + Python — A Tutorial
Update: I have pushed my Python code to GitHub (my repo). My implementation is a tad more advanced than this tutorial. See the readme file and code comments.
Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene (Solr website).
Goal
My goal is to demonstrate building an e-commerce gallery page with search, pagination, filtering and multi-select that mirrors the expectations of a typical user. See this article for a nice explanation of the multi-select filtering I am trying to implement.
Search should work for phrase queries like “mens shrt gap” or “gap 2154abc”, factoring in typos, various word forms (stemming) and phonetic spelling.
Solr Setup
Solr 7 is installed locally on my computer with an active connection to a database. Solr is using the deltaQuery feature (in db-data-config.xml
) to detect changes in my database and import those records into Solr .
Web Development Setup
I have a basic Django/React app with Python 3. See this article for ideas on how to integrate Django with React. I recommend following these instructions to create your own Solr client.
I was considering using pySolr as a client, but it lacks good documentation and seems to have been neglected since 2015 (like most Solr libraries). Nevertheless, pySolr can work if you are ready to comb through the GitHub issues and codebase.
If you are using pySolr:
- Paste
export DEBUG_PYSOLR=’true’
into your terminal before running your server, and you will be able to view the URL generated by pySolr. - The URL you see in your terminal doesn’t seem to be clued in about URL encoding issues, so a query like
Dolce & Gabbana
will work on your website, but break when you paste the URL into a browser.
Facets & Facet Pivots
Facets are synonymous with product categories or specs. Solr has an option to return the available facets with their respective counts for a specific query. You can control the minimum number of products required in a facet by setting facet.mincount=<number>
.
For example, if you are selling brand named clothing, facets
might refer to gender
, style
and material
. If the search was for “mens casual gap”, the facets would look like this (notice the constraints on gender
and style
):
facet.pivots
are designed for nested categories. If we had parent categories of brand:Gap
and brand:Target
and children like collection:Summer
and collection:Winter
.
If you have a bunch of documents, each with a brand
and a collection
, all you need to do is tell Solr facet.pivot='brand,collection’
and you will have get nested categories. Think how easy it will be created a nested filter widget!
Example Query
Let’s run through an example:
Phrase search will be discussed in the next section — Schema Modeling.
- I would suggest using tuples for each key-value pair as it will be easier to
urlencode
. It will also be easier to manipulate, particularly when you have a complicatedfq
with a ton ofAND
,OR
logic (which will happen very soon if you are doing filtering). - Each facet group will have its own
fq
field. This ensures thatAND
logic is applied across filter groups. Here is code for applyingOR
logic within a facet group:
facet.pivot.mincount
allows you to control the minimum number of products required for afacet.pivot
, but beware, if you set it to0
, your server will likely crash.- I’ve found that field values needed to be formatted in quotes:
‘fq’: “brand: \”{0}\””.format(current_query[‘current_brand’])
facets
are returned in arrays like[‘brand’, ‘gap’]
, not adict()
which I find inconvenient. Here is one way to format them:
- By default, if a user selects a facet in a facet group, Solr will return that facet group with only the selected facet, since the search has been narrowed down. But many times, a user would like still like to view the unselected facets and associated counts, to enable multi-select. To allow this functionality, use tagging and excluding. See my StackOverflow answer for a possible implementation.
- To create price ranges as a filter with custom intervals, copy
price
to a new field with one of TriefieldTypes
. The new field should haveindexed
andstored
set tofalse
, anddocValues
set to true. Then follow the instructions to add custom ranges.
Schema Modeling
If you can get past the idea that fields exist simply to store properties of data, and embrace the idea that you can manipulate data so it can be found as users expect it, then you can begin to effectively program relevance rules into the search engine. (Relevant Search, Chapter 5)
We are ready to modify fields in our document schema to conform to the users’ perception of our products.
Take a look at the documentation about how to update the schema, particularly the sections on tokenizing and filtering. Learn about stemming filters. Ask yourself which tokens/filters are relevant for your situation, and whether it should be apply at query or index time.
I will be following a recommendation in the documentation to copy all fields a user might be interested in into a single copyall
field. This solves the albino elephant issue, and signal discordance:
…As we’ve stated, when users search, they typically don’t care how documents decompose into individual fields. Many search users expect to work with documents as a single unit: the more of their search terms that match, the more relevant the document ought to be. It may surprise you to know that search engine features that implemented this idea were late to the party. Instead, Lucene-based multifield search depended on field-centric techniques. Instead of the search terms, field-centric search makes field scores the center of the ranking function. In this section, we explore exactly why field-centric approaches can create relevance issues. You’ll see that having ranking functions centered on fields creates two problems:
The albino elephant problem — A failure to give a higher rank to documents that match more search terms.
Signal discordance — Relevance scoring based on unintuitive scoring of the constituent parts (title versus body) instead of scoring of the whole document or more intuitive larger parts, such as the entire article’s text or the people associated with this film. (Relevant Search, Chapter 6)
We will be using the Schema API through the Admin UI. You cannot edit the schema file manually (explanation). Here is the recipe for creating the copyall
field:
- Create a
fieldType
for the field. I am using the samefieldType
for both index and query time. I have kept the stemming light to ensure that brand names stay intact.
2. Create a copyall
field with a facets
as the fieldType
. Set multiValued=true
to allow multiple values in the field (as an array). Set omitNorms=true
since users don’t care about the length of each field (docs), and we don’t want Solr to care either.
3. Create copyFields for every field in the data source that you want to be copied. Remember, there is no chaining of copyField’s.
4. Repeat steps 1–3 if you want to create a copyall for phonetic spelling. Use an appropriate fieldType. I am using the Beider-Morse Filter.
5. Add a tie breaker of 1 to get a most-fields functionality. The docs provide a nice explanation.
Some ideas:
- Add index time boosts for products that are more popular and you want them to rank higher in the search results.
- Use function queries to customize anything about your query, including relevancy scoring.
- Consider the N-Gram filter for typo tolerance.
- Consider the Edge-N-Gram filter for autocomplete.
- Consider using the
text_en
fieldType
for regular English words (it is one of the manyfieldTypes
which come out of the box):
Debugging and Workflow
- Check Analysis in the Admin UI for how particular terms are analyzed at index or query time.
- Add a
console.log
in your code to print the url for every query. SetdebugQuery=true
and read theparsedQeury
andexplain
. All the math fun is lurking in theexplain
(see Relevant Search, Chapter 2). - After re-configuring the schema, make sure to delete all docs in your index and do a fresh full-import from your database. This can be done in the Admin UI.
- If you need to debug the database import, use the Debug-Mode with verbose output.
Further Reading
The examples in the book use ElasticSearch, but Appendix B provides mappings to Solr. If the book is too long, read chapters 5 & 6. These chapters tackle which strategy to use for matching multi-field (phrase) search with the most relevant results.