Using Solr as a NoSQL Database Instead of a Search Engine
In the past ten years, we have seen a surge of solutions and startup companies revolving around a newly popularized database solution called NoSQL. NoSQL stands for Not Only SQL, as it is a non-relational database solution that uses query languages separate from traditional SQL, although it may share similarities or be used alongside SQL databases. You are likely familiar with at least one of the popular NoSQL databases, such as MongoDB, Cassandra, Redis, HBase, and Couchbase. However, there is one database that is often not considered when thinking about NoSQL solutions…
Solr
As an Apache Foundation project, Solr is one of the two most popular open source search engines in the world, with Elasticsearch as the only project capable of competing. Naturally, many engineers think of Solr as a search engine, and not as a NoSQL solution, despite it sharing similarities with popular NoSQL databases (document store, easy replication, and high availability). Although full-text search is where Solr truly shines, it can be very useful as a NoSQL database. It provides quick filtering, sorting, and faceting at the expense of not being able to easily represent relationships between types of documents (where a relational database would excel).
Before I joined O’Reilly Media’s Search team, I mostly thought of Solr as a NoSQL solution, as my prior Solr experience was typically in a big data environment. As an intern at IBM, I was first introduced to Solr as a component of the IBM Open Platform, which is a suite of big data services revolving around Hadoop, similar to Cloudera’s CDH. I gained more exposure to Solr when I joined IBM Watson Health as a software engineer working on their Explorys applications.
I had the pleasure of building an application (with other IBMers) called Worklist, which helps health providers (doctors, nurses, etc…) target at-risk patients by organizing them in a spreadsheet-like system. We used Solr to support quick filtering and sorting, allowing patients to be categorized into groups (or “Worklists”) based on specific health criteria. For example, a doctor could create a Worklist for patients above the age of 65 with a systolic blood pressure between 130 and 140, enabling them to easily identify older patients at risk for onset hypertension. The doctor could even sort that Worklist on descending blood pressure, enabling them to target the most at-risk patients within the group.
Although Worklist was used by doctors with only 20–30 patients, it was often used by administrators or directors to manage the millions of patients that receive care at their organization. Each patient’s health record had 30-40 fields that could be used for filtering or sorting, and users usually created Worklists built from a large combination of filters and sorts (much more complex than the example I gave above). Grounded in big data technology, the older Explorys applications relied on pulling health data directly from HBase. Although sorting and filtering were possible with these HBase-powered applications, it was too laggard when used at the scale of millions of health records. Solr provided 3 great features essential to the Worklist application:
Fast filtering and sorting. The “bread and butter” of Solr is that it utilizes an inverted index for quick multi-term searching, filtering, and sorting. Each document in Solr has a unique ID, and Solr’s FilterQuery only stores document IDs (instead of the entire field or document), which allows Solr to quickly include or exclude documents based on any criteria, especially when used with caching. Of course, Solr responses are far from instantaneous when filtering or sorting millions of documents, but it was much better than the legacy HBase solutions.
Instantaneous faceting. Since Solr’s filtering is based on document IDs and caching, Solr can retrieve the count of documents that match a filter criteria almost instantly. The Worklist app had a filter dropdown where a health provider could add as many filters and boolean operators (AND, OR, etc…) as desired. Enabling this complex filtering made it crucial for the users to get feedback before actually applying the filters. You can imagine how frustrating it would be if the user added filters, hit the “Apply” button, and the spreadsheet now shows 0 patients, forcing the user to reopen the filter window, change filters, hit “Apply” again, and hope for the best.
The most useful part of this feature was that health providers could see how many patients matched the complex filter criteria before actually applying filters to the active patient list. In this case, we did not even have to configure complex facet definitions, but rather just used the total count returned by Solr. By sending the parameter “rows=0” to Solr, we provided immediate feedback to our users while they are in the process of adding and configuring filters, which is essential to a robust filtering experience.
High availability with easy replication. We never had to worry about a disruption in service from a Solr server going down or from inconsistent data across the different servers. We used SolrCloud, which uses multiple replicas (servers) that handle web requests and replicate from a master source. If one of the replicas go down, the other replicas handle the load while Solr automatically creates a new replica as needed.
Experiences like this taught me that Solr can be a useful NoSQL database solution, and is often used with other big data services. With this experience, I thought that I was well-equipped to join O’Reilly Media and improve their Solr-based search engine. I found out the hard way that using Solr as a NoSQL database is drastically different than using it as a relevance-driven search engine, despite the fact that its filtering and faceting features are rooted in search technology (like the inverted index).
Creating a search engine with good relevancy requires much more than maintaining a Solr schema and supporting filtering, sorting, and faceting. I explore this more in my next Medium post, where I discuss an essential of a good relevance-driven search engine, and how the Search team at O’Reilly has pivoted away from thinking of our Solr as a NoSQL database.