How to add full text search to your website

Sam Dutton

Published in

Dev Channel

16 min readFeb 9, 2018

Many types of website need efficient, accurate search.

This article explains server and client-side alternatives, and shows how to implement search that works offline.

tl;dr

There are lots of ways to do search:

Via a back-end search engine such as Elasticsearch or Solr.
Commercial search services such as Algolia and AWS Cloudsearch.
Using a database with built-in search such as MySQL or MongoDB.
Application platforms and database services with add-on search functionality, such as Firebase and Cloudant.
Client-side JavaScript search libraries such as FlexSearch and Elasticlunr.
Client-side JavaScript libraries that synchronises data with a backend database: PouchDB for example.
Google Custom Search Engine.

All the search engines, databases and managed services discussed in this article have integrations across multiple platforms, frameworks and languages — not just for the web. Whatever your target platforms, there are several key considerations when choosing a solution:

How often do you update your data? If updates are continual and search results absolutely must be kept up to date, you’ll probably need to use a server-side option.
Do you need to search content, data, metadata — or all three? Search engines can index content and metadata from binary files such as PDF and Word documents, or access data and content from databases, or extract metadata such as EXIF from images. JavaScript search libraries need JSON.
What are you searching, in what context? Log analysis, for example, is very different from consumer product search or a professional search service for a large collection of unstructured data such as an historical archive.
How much data do you need to search? If you can cache all of your data locally on the client (say less than 20MB) you could consider client-side search.
Would you like search to work offline? Maybe just for part of your site’s functionality, such as a store locator or customer account history? Offline search can be enabled by caching index data on the client (demo here).

This is just an overview of some of the issues. Pros and cons for each option are explored in more detail below.

What is ‘full text search’?

For a small amount of simple textual data it’s possible to provide basic search functionality via simple string matching. For example, using JavaScript you could store product data for a small online shop as an array of objects in a JSON file, then fetch the file and iterate over each object to find matches.

This simplistic approach can be better than nothing, but successful search needs more flexible functionality to find relevant results:

Stemming: developing for mobile matches results for develop for mobile and vice versa.
Stopword handing: Search engines need to avoid irrelevant results caused by matching common words such as a and the. (Conversely, ignoring stopwords can also cause problems for queries such as The Who or To be or not to be.)
Basic fuzzy matching: service workers matches results for Service Worker.

In practice some of these features may not work as well as expected, especially in multilingual implementations or for structured, shorter-length data. Some search engine developers have made the case for preferring alternatives.

High quality search implementations provide additional features on the input side:

Autocomplete: Typing progre… invokes suggestions including Progressive Web App.
Spellcheck and autosuggest: Typing Porgessive Web Apps suggests Progressive Web Apps, or does the search automatically. Spellcheck can work better than fuzzy matching, which can lead to irrelevant results. On e-commerce sites autosuggest can also be used for merchandising: to suggest products the customer may not be aware of.
Synonym search: PWA should match results for Progressive Web Apps, and vice versa. This is crucial for product search, where people tend to use different words for the same thing: shoes or footwear, wellies or wellingtons or gumboots.
Recent searches: A high proportion of searches are repeat searches: people often search for the same thing, even during the same session. You may want to provide recent-search suggestions — and potentially cache assets for recent searches, such as images or product data.
Phrase matching: How much distance can be allowed between words in a search phrase? For example, should a search for super moon match descriptions that include the words super and moon, but not the exact phrase super moon?
Scoped and faceted search: Provide the user with UI controls such as checkboxes and sliders to narrow the range of potential search results. For example, search only within a product type or brand, or find quilted coats, for a large dog, on sale, priced less than $50.00. Functionality like this combines full text search (products that match quilted coats) with metadata constraints (less than $50).
Search expressions: Enable search to use AND, OR, NOT and other operators such as quote marks around exact phrases. For example: grizzly -bears returns matches for grizzly but not grizzly bears. grizzly AND bears returns only results that mention both grizzly and bears, though not necessarily as a single phrase. Search expressions are often a core requirement for search tools such as library catalogues.
Advanced relevance features: Ensure results can be ranked properly, by measuring the frequency or position of the search term within documents, phrase word proximity, or with other techniques. For example, a surf shop may sell surfboard wax, but surfboard descriptions may also mention board and wax, leading to irrelevant results for board wax.
Spatial and geospatial searches: A query for London web developers matches results for web developers from the King’s Cross area or postcode NC1.
Aggregation, faceting and statistics: For example, count the number of matches for London Web Developers, or (for a customer database) calculate the median spend for customers from the UK who spent over £500 in the last year, grouped by region.

For a global audience, all this functionality must work across different languages, character sets, text directionality, and geographical locations — and potentially handle linguistics and cultural differences. Japanese stemming is very different from the way it’s done for English. Search engines and search services all provide different approaches to internationalisation and localisation.

What is a search index?

It’s possible to search a small amount of data simply by scanning all of the data for every query. As the quantity of data increases, this becomes slow and inefficient.

In its simplest form a search indexer gets around this problem by analysing a data set and building an index of search terms (words or phrases) and their location within the data — a bit like an index at the back of a book. The search implementation can then look up the query in the index rather than scanning all of the data. Indexers can also implement features such as stop-word handling and stemming.

You can view an example here of a simple index built for this demo using the Lunr JavaScript library:

What is a document?

Confusingly, the word ‘document’ is used with two different meanings in relation to search engines:

Data representing a single item, a ‘record’ or ‘object’, for example:
{name: ‘Fred Nerk’, job: ‘Web Developer’}
An binary file in a format such as PDF or Word.

Providing high quality search results for a set of binary files can be much more complex than searching structured textual data. Imagine a video archive catalogue consisting of millions of legacy files with multiple different binary formats and a variety of content structures — how can you provide consistent and accurate search across the entire document set?

What else do you need to think about?

Data update latency: The delay from when you add, update or delete data until your changes appear in search results. This latency can be very low for search engines, potentially sub-second, even with large data sets. Low update latency is crucial for large-scale e-commerce search, or for any context where search results must reflect data updates within strict time limits specified by SLAs — for example, a news site that may be issued with takedown notices. Client-side search may not be able to provide low enough update latency or guarantee that updates have been propagated. Maintaining client-side search may also be highly inefficient, since every data or index update must be distributed to and installed by all users.
Search accuracy and ranking: Successful search implementations need to trade off precision and recall, avoiding missing results as well as false positives. Understanding how to rank results — and what the indexer should actually ignore — is often crucial.
Metadata search: You may need to search metadata as well as content, for example image EXIF data or PDF author and title fields.

FT.com’s recent search engine update needed expert tweaking to work well. Their blog post about the project describes how the implementation initially tended to return results that made sense but were not really relevant, such as a plethora of articles that mention Trump but are not ‘about’ him. They also had to ensure that page ranking preferred recent news stories rather than always returning what might otherwise seem to be ‘most relevant’.

So… What are the options?

Search engine

Run your own search engine on a server. The two most popular are Elasticsearch and Solr, both open source.

Pro

A search engine can also serve as a data store — though there are caveats.
Search engines can handle petabytes of data on hundreds of servers.
Queries are extremely fast, and updates can be searchable with very low latency.
Potentially relatively cheap and straightforward to set up and run if you already have a server team (or at least a dedicated system administrator). However, see the comments under managed search services below.

Con

Setup, maintenance and update cost. Large-scale, high volume installations can be complex!
Initial and ongoing infrastructure cost.
Search queries done on a server require reliable connectivity from the client, and obviously don’t work offline.
Modern search engines are extremely fast, but client connectivity bottlenecks and latency can make searches unresponsive.
Every search query requires a request to the server, which can incur significant data cost and battery usage on mobile clients. Instant search can make this worse.

Managed search service

Use a commercial service such as Algolia, Amazon CloudSearch or a platform such as Firebase or Cloudant that integrates with third party search services (Firebase uses Algolia).

Pro

It’s not simple to set up and maintain a large-scale search service able to withstand high concurrent demand and traffic spikes, along with complex, petabyte-scale data sets and a high volume of updates. It can be hard to find the right people to do the job.
DIY alternatives can incur greater infrastructure and human resource costs.
A managed service can be simpler and more reliable to scale.
Outsourcing search can simplify the management of setup, maintenance and updates — and reduce startup time.
In most cases, the performance of managed services should be at least as good as a DIY equivalent.
Managed services are potentially more reliable than self-hosted alternatives and should be able to guarantee minimum service levels.
It may be possible to include search relatively cheaply within existing cloud contracts.

Con

Potentially more expensive than DIY search engine options.
As with other server-side options: data cost, battery usage, and reliance on connectivity. Instant search can make this worse.

Database with built-in search

NoSQL databases including MongoDB support full text search. CouchDB can implement search using couchdb-lucene or in pre-built alternatives such as Couchbase.

Full text search is also supported by open source relational databases such as MySQL and PostgreSQL as well as many commercial alternatives.

Pro

A database that supports full text search may be adequate for your needs, without the cost of setting up and maintaining a separate search engine.
There are good reasons not to use a search engine as a data store. If you need to store and update a significant volume of data, it’s likely you’ll need a database. If your database enables search, that may be all you need.

Con

Built-in or add-on database search functionality may not be powerful or flexible enough to meet your requirements.
A full-scale relational or NoSQL database may be overkill for smaller data sets with simpler use cases.
Same connectivity issues other server-side options.

Google Site Search and Google Custom Search Engine

Google Site Search is deprecated, but Google Custom Search Engine (CSE) is still available. The differences between the two are explained here.

You can try CSE with the example here, which searches products from the Polymer Shop project.

If you don’t want ads and if you’re happy to pay (or the free quota is enough) the CSE API might work for you.

Pro

Quick and easy to set up.
Fast, reliable search.
Ad-free for non profits and educational organisations.

Con

API version costs $5.00 per 1000 queries.
Unless you use the API, and build your own, you get ads at the top of search results.
No customisation possible for autosuggest or other input features.
No control over result ranking or layout now that Google Site Search is deprecated.

Client-side search

The Cache and Service Worker APIs enable websites to work offline and build resilience to variable connectivity. Local caching combined with client-side search can enable a number of use cases. For example:

Searching an online shop while offline.
Enabling customers to use a store locator or search their purchase history.
Providing resilience and fallbacks in case of connectivity dropouts.

Client-side search can be particularly compelling for a relatively small set of data that doesn’t change much. For example, the demo here searches Shakespeare’s plays and poems:

Client-side JavaScript full text search libraries include Lunr or ElasticLunr.

You provide a set of ‘documents’ in JSON format, such as a product list, then create an index. Here’s how to do that with the Elasticlunr Node module:

To initiate search on the client, you first need to fetch the index data and load it:

To enable offline search, the index file can be stored by the client using the Cache API. Alternatively, you could fetch document data and build the index on the client, then serialise and store that locally.

And finally:

WebSQL enabled fast text matching (demo) and full text search (demo).
However, the WebSQL standard has been discontinued and only ever had partial browser support.
Full text search in WebSQL is now being removed.

Pro

Relatively simple to set up and maintain.
No search engine, database or third party search service required.
Search is done on the client, so queries do not require a server round trip, so reduce server load and use less radio on mobile devices.
Client-side search is resilient to connectivity vagaries.
Client-side search can be extremely fast for smaller data sets.
Offline search capability can be enabled by caching index files using client-side storage such as the Cache API or localStorage.
Potentially useful for content that isn’t frequently updated, or where updates (though important) are not time critical. For example, an e-commerce site could use client-side search to enable offline access to store locations.

Con

Only viable for a limited amount of data — though this could be up to tens of thousands of documents, potentially more, as long as the data is relatively static and doesn’t change frequently.
High latency for data updates compared to server-side alternatives.
Potential for stale data.
Client-side search incurs memory, storage and processing cost. For example, the demo here (which searches Shakespeare’s plays and poems) uses over 200MB of memory when running in Chrome on desktop.
Currently there’s no way to share a JavaScript object, such as a search index, across multiple pages, except by serialising and storing it. (Shared workers get around this, but aren’t implemented for Chrome for Android. SharedArrayBuffer might help, but has been disabled by default by all major browsers.) Unless you use a SPA architecture, you’ll need to recreate the index every time the user navigates to a new page, which is unworkable.
A server-side search service with an API can be used across multiple platforms and device types whereas client-side search may need to be implemented for both web and native, potentially with server-side fallbacks.
Offline use cases are not (yet) a priority for e-commerce or other types of sites, so why bother with client-side search?

Client-side search with automated replication

JavaScript libraries such as PouchDB and SyncedDB do much the same job as the client-side libraries described above, but they also offer the ability to automatically synchronise data on the client with a back-end database, optionally in both directions.

You can try an offline-enabled PouchDB demo here.

Pro

In theory ‘the best of both worlds’: search is done on the client, data is continually updated.
Enables offline search.
As with other types of client-side search, reduces the number of requests, and thereby potentially reduces server load and device radio usage.
Enables bidirectional synchronisation, potentially for multiple endpoints.

Con

Architecture is more complex than simply maintaining data as (for example) a JSON file: a database server is required.
As with other types of client-side search, processing incurs battery cost, and the potential size of the data set is limited by client-side storage quota.
For a client with good connectivity, potentially slower than a fast backend database or search engine.

What about UX and UI?

Query input

People have come to expect a high standard of design for search query input, particularly on shopping sites. Functionality such as synonym matching and autosuggest is now the norm.

For example, Asos does a great job of highlighting matches and suggesting other categories and brands:

This article has other great examples of high quality search, and provides sensible guidelines for search input design.

Make sure to understand the different types of search required by your users. For example, an online store needs to be flexible about the way people want to find what they’re looking for:

General text search: google phone, android mobile
Product name: pixel, pixel phone, pixel 2
Product specification: pixel 64GB kinda blue
Product number: G-2PW4100
Category: phone, mobile, smartphone

Search results

Search result content and presentation is critical:

Work out what people want and give it to them! If only one result is returned, it may be best to go directly to the relevant item — for example, when a customer enters a part number.
Give people as much information and functionality as possible in search results. For an online shop, that might include review ratings and style options, and the ability to make a one-click purchase. Don’t force people to go to the product page — every click loses users.
Check analytics for searches where users didn’t click on any results. Always ensure that configuration changes that work for some searches don’t mess up others. Accurate result ranking is fundamental to successful search.
Enable query term highlighting (using the mark element) and provide match context where that makes sense. This is crucial for longer documents.

Screwfix provides checkout options right on the search results page, and automatically transforms a query (in this case bosch drill) into a filtered set of results, which each include review ratings and a sensible level of product detail:

By contrast, Made keeps results clean and uncluttered, which suits the brand:

ft.com orders news stories by date, and suggests sensible options for refining results:

Getty Images focuses on… images! Filtering options are provided based on available metadata, along with layout and display options:

Cross-functional considerations

Search doesn’t live in a vacuum. Successful implementations need good communication between different stakeholders:

Maintaining high quality search functionality requires continual cross-functional collaboration between subject experts, content providers, application developers, back-end developers, designers, marketing and sales teams. Different teams need to reconcile different priorities — and understand who has the final say!
Changes often cause problems, or unexpected knock-on effects. For example, increasing fuzziness and slop may improve some searches causing other searches to return irrelevant results.
Make it easy for non-techies to adjust features such as autocomplete and synonym search, or to add merchandising and other featured content to searches.

Searching audio and video

If you have video or audio content, you can enable users to find it by searching metadata such as titles and descriptions.

There are two problems with this approach:

Search is only for metadata, not content.
If a match is found, it’s only to the ‘top level’ of an audio or video file.

If your audio or video has captions, subtitles or other kinds of ‘timed metadata’, search and navigation can be much more granular.

For example, the demo here searches Google developer video captions, and enables navigation to specific points within videos. The content in the demo is hosted on YouTube, which uses the SRT format for captions. You can see this in action if you view the transcript for a manually captioned YouTube video such as the one here:

Websites can use subtitles and captions with a track element (demo here) using the WebVTT format, which is very similar to SRT:

Testing

Whatever you do, test changes and keep an eye on analytics and search logging. Build discount usability testing into your workflow.

Make it easy for non-techies to monitor and understand search statistics:

What are people searching for most? What are people not searching for?
When and why do some queries provide no results? Can these be fixed by adding synonyms or spellchecking suggestions?
When does a search lead to a sale or other type of ‘conversion’?
Check analytics for searches that result in users exiting the site rather than clicking on a result.

Find out more

Installing and using search engines and managed services

The two articles below explain in more detail how to set up and use several of the most popular search engines and managed services:

Articles, books and podcasts

24 best practice tips for ecommerce site search: Sensible compilation of UI and UX tips.
Building a new search for FT.com: How a successful news site rebuilt search functionality for 720,000 items of content.
Relevant Search: An actual printed book! Free chapters online, and a podcast.
Introduction to Information Retrieval: Nearly 10 years old, but still highly relevant and free to read online.
AI-Based Search Engines: Scientific American article about free AI-based scholarly search engines.
Voice Search
The future of search is voice and personal digital assistants
Why You Need To Prepare For A Voice Search Revolution
How audio-only interfaces are changing search and leading to different types of human–computer interactions. More stats about voice search here. According to comScore, ‘50% of all searches will be voice searches by 2020’.
Why shouldn’t I use ElasticSearch as my primary datastore?: Discussion of why you probably still need a database.
Implementing Client-Side Search with Vue.js: Simple code to add text search of a JSON dataset using fuzzaldrin-plus.
How Algolia tackled the relevance problem of search engines: How the Algolia search engine ranks results, with an interesting comment thread disputing some of their claims.
How Search Works: A video overview of Google search.

Thank you to all the people who helped with this article.