Solr + Python — A Tutorial

Avi Kaminetzky
6 min readApr 19, 2018

--

Update: I have pushed my Python code to GitHub (my repo). My implementation is a tad more advanced than this tutorial. See the readme file and code comments.

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene (Solr website).

Goal

My goal is to demonstrate building an e-commerce gallery page with search, pagination, filtering and multi-select that mirrors the expectations of a typical user. See this article for a nice explanation of the multi-select filtering I am trying to implement.

Search should work for phrase queries like “mens shrt gap” or “gap 2154abc”, factoring in typos, various word forms (stemming) and phonetic spelling.

Solr Setup

Solr 7 is installed locally on my computer with an active connection to a database. Solr is using the deltaQuery feature (in db-data-config.xml) to detect changes in my database and import those records into Solr .

Web Development Setup

I have a basic Django/React app with Python 3. See this article for ideas on how to integrate Django with React. I recommend following these instructions to create your own Solr client.

I was considering using pySolr as a client, but it lacks good documentation and seems to have been neglected since 2015 (like most Solr libraries). Nevertheless, pySolr can work if you are ready to comb through the GitHub issues and codebase.

If you are using pySolr:

  • Paste export DEBUG_PYSOLR=’true’ into your terminal before running your server, and you will be able to view the URL generated by pySolr.
  • The URL you see in your terminal doesn’t seem to be clued in about URL encoding issues, so a query like Dolce & Gabbana will work on your website, but break when you paste the URL into a browser.

Facets & Facet Pivots

Facets are synonymous with product categories or specs. Solr has an option to return the available facets with their respective counts for a specific query. You can control the minimum number of products required in a facet by setting facet.mincount=<number>.

For example, if you are selling brand named clothing, facets might refer to gender, style and material. If the search was for “mens casual gap”, the facets would look like this (notice the constraints on gender and style):

facet.pivots are designed for nested categories. If we had parent categories of brand:Gap and brand:Target and children like collection:Summer and collection:Winter.

If you have a bunch of documents, each with a brand and a collection, all you need to do is tell Solr facet.pivot='brand,collection’ and you will have get nested categories. Think how easy it will be created a nested filter widget!

Example Query

Let’s run through an example:

Phrase search will be discussed in the next section — Schema Modeling.

  • I would suggest using tuples for each key-value pair as it will be easier to urlencode. It will also be easier to manipulate, particularly when you have a complicated fq with a ton of AND, OR logic (which will happen very soon if you are doing filtering).
  • Each facet group will have its own fq field. This ensures that AND logic is applied across filter groups. Here is code for applying OR logic within a facet group:
  • facet.pivot.mincount allows you to control the minimum number of products required for a facet.pivot, but beware, if you set it to 0, your server will likely crash.
  • I’ve found that field values needed to be formatted in quotes: ‘fq’: “brand: \”{0}\””.format(current_query[‘current_brand’])
  • facets are returned in arrays like [‘brand’, ‘gap’], not a dict() which I find inconvenient. Here is one way to format them:
  • By default, if a user selects a facet in a facet group, Solr will return that facet group with only the selected facet, since the search has been narrowed down. But many times, a user would like still like to view the unselected facets and associated counts, to enable multi-select. To allow this functionality, use tagging and excluding. See my StackOverflow answer for a possible implementation.
  • To create price ranges as a filter with custom intervals, copy price to a new field with one of Trie fieldTypes. The new field should have indexed and stored set to false, and docValues set to true. Then follow the instructions to add custom ranges.

Schema Modeling

If you can get past the idea that fields exist simply to store properties of data, and embrace the idea that you can manipulate data so it can be found as users expect it, then you can begin to effectively program relevance rules into the search engine. (Relevant Search, Chapter 5)

We are ready to modify fields in our document schema to conform to the users’ perception of our products.

Take a look at the documentation about how to update the schema, particularly the sections on tokenizing and filtering. Learn about stemming filters. Ask yourself which tokens/filters are relevant for your situation, and whether it should be apply at query or index time.

I will be following a recommendation in the documentation to copy all fields a user might be interested in into a single copyall field. This solves the albino elephant issue, and signal discordance:

…As we’ve stated, when users search, they typically don’t care how documents decompose into individual fields. Many search users expect to work with documents as a single unit: the more of their search terms that match, the more relevant the document ought to be. It may surprise you to know that search engine features that implemented this idea were late to the party. Instead, Lucene-based multifield search depended on field-centric techniques. Instead of the search terms, field-centric search makes field scores the center of the ranking function. In this section, we explore exactly why field-centric approaches can create relevance issues. You’ll see that having ranking functions centered on fields creates two problems:

The albino elephant problem — A failure to give a higher rank to documents that match more search terms.

Signal discordance — Relevance scoring based on unintuitive scoring of the constituent parts (title versus body) instead of scoring of the whole document or more intuitive larger parts, such as the entire article’s text or the people associated with this film. (Relevant Search, Chapter 6)

We will be using the Schema API through the Admin UI. You cannot edit the schema file manually (explanation). Here is the recipe for creating the copyall field:

  1. Create a fieldType for the field. I am using the same fieldType for both index and query time. I have kept the stemming light to ensure that brand names stay intact.

2. Create a copyall field with a facets as the fieldType. Set multiValued=true to allow multiple values in the field (as an array). Set omitNorms=true since users don’t care about the length of each field (docs), and we don’t want Solr to care either.

3. Create copyFields for every field in the data source that you want to be copied. Remember, there is no chaining of copyField’s.

4. Repeat steps 1–3 if you want to create a copyall for phonetic spelling. Use an appropriate fieldType. I am using the Beider-Morse Filter.

5. Add a tie breaker of 1 to get a most-fields functionality. The docs provide a nice explanation.

Some ideas:

  • Add index time boosts for products that are more popular and you want them to rank higher in the search results.
  • Use function queries to customize anything about your query, including relevancy scoring.
  • Consider the N-Gram filter for typo tolerance.
  • Consider the Edge-N-Gram filter for autocomplete.
  • Consider using the text_en fieldType for regular English words (it is one of the many fieldTypes which come out of the box):

Debugging and Workflow

  • Check Analysis in the Admin UI for how particular terms are analyzed at index or query time.
  • Add a console.log in your code to print the url for every query. Set debugQuery=true and read the parsedQeury and explain. All the math fun is lurking in the explain (see Relevant Search, Chapter 2).
  • After re-configuring the schema, make sure to delete all docs in your index and do a fresh full-import from your database. This can be done in the Admin UI.
  • If you need to debug the database import, use the Debug-Mode with verbose output.

Further Reading

The examples in the book use ElasticSearch, but Appendix B provides mappings to Solr. If the book is too long, read chapters 5 & 6. These chapters tackle which strategy to use for matching multi-field (phrase) search with the most relevant results.

--

--