The Evolution of Coursera Search: Enabling Product Innovation Through Technical Innovation
At Coursera, millions of learners use search to discover courses. For learners with a specific intent, we need relevant results. For learners with less concrete goals, we need to give a feeling of serendipity by injecting novelty and diversity into the results. In this blog post, we detail how our new search system, powered by Algolia, allows us to iterate toward this future.
Previous Search System
Search at Coursera has undergone two major revamps. Our initial approach was to return all the course data and search on the frontend. This approach became untenable as our catalog grew to hundreds of courses. We then revamped and constructed a search system powered by Solr. The architecture is as follows:
We indexed data from our online systems. We extracted associated metadata such as the instructors’ names. We supported features such as spell checking, stemming, stop word filtering, and word canonicalization. As seen in the above figure, there is complexity around data retrieval and processing by relying upon online systems.
For relevance tuning, our Solr schema contained fields with hand-tuned weights. For instance, the title of a course should have more influence on the score than the description. We also had a dynamic boosting system that allowed for behaviors like boosting the scores of documents in the learner’s native language. Lastly, a reranking module allowed for skills-based search by taking the Solr-scored entities and applying custom reorderings for specific types of queries.
This system has powered our search for the last four years, but we faced some challenges:
- The indexing code became complicated and hard to iterate on: We are doing many data transformations, but we cannot reuse many of the same pieces of logic that power our data warehouse because indexing is done by pulling data from production APIs.
- Relevance tuning is complex: The dismax scoring system powering Solr is is complicated. Boosting requires careful reading of documentation [pf vs qf?], and reranking is a separate system outside of Solr. This means search engineers and data scientists need to keep a large amount of knowledge in context to make small changes to relevance rankings. Product managers and non-search engineers do not have visibility into relevance.
- Schema migration is an involved and complex process: We need support from our infrastructure team to add fields to our indices and to change Solr configurations.
- UI iteration is hard: We want to experiment with different types of search UI, like having rich contextual autocomplete. But this requires reconfiguring the data processing, indexing, APIs, and the front end.
The main requirements we identified as we looked at iterating on search are the following:
- For relevance tuning, we want the ability to easily combine document reranking with document relevance scores.
- On the backend, we should have the ability to easily modify documents and modify schemas without migration costs.
- On the front end, UI iteration should not be blocked by our search system. This means having a set of widgets that we can style, and a clean integration with the backend system that allows us to not have to think about data fetching and API calls.
In our current system, the median response time is several hundred milliseconds, which blocks experiences like search-as-you-type. Our new search system should return all the data necessary to power the search experience, while trimming the median response time to less than 10ms.
Where we are today with Algolia
Today, we’ve simplified the search system by:
- Processing is consolidated within our Enterprise Data Warehouse [EDW]. Data scientists and engineers can use familiar tools such as notebooks to process the data to be indexed for search. To begin, we’ve ported the processing logic in our Solr search system, including skills-based search. A standard workflow for exporting the data to Algolia is then used to take the processed data and populate them as Algolia indices.
- Tuning is done through the Algolia UI. Algorithmic iterations happen here. For instance, skills-based search is implemented by adding a custom ranking criteria.
- Displaying is utilizing the react-instantsearch library. This productive UI widget library allows us to not worry about low-level concerns like maintaining search state or APIs.
In the future
In the future, we envision a search system that is flexible in incorporating data, allows for algorithmic innovation, and powers all of our content discovery. We’re not there yet, but by factoring the search system into processing, tuning, and display, we’re one step closer.