More about search engines

Sam Dutton
4 min readFeb 9, 2018

--

This article is a companion to How to add full text search to your website.

It’s an overview from the perspective of a front end developer — and is in no way comprehensive!

The two most popular open source search engines are Elasticsearch and Solr. Both are based on the Apache Lucene Java library.

In reality, large-scale, high-volume search can be difficult to implement and maintain, but potentially search engines can cope with petabytes of data on hundreds of servers, providing extremely fast search, and low latency when data is updated. You can find out more about the differences between the two here.

If you’re a front-end developer you may feel somewhat daunted at the prospect of installing a search engine. However, Elasticsearch and Solr are well documented and relatively easy to get up and running.

Installation and usage

The following instructions describe the installation for the Java versions of Elasticsearch and Solr.

Elasticsearch is also available as a Node module with similar functionality and a promise-based API. Likewise, Solr can be used via Node modules such as solr-client.

Elasticsearch and Solr use REST interfaces to perform search queries and also for administration and index management tasks:

  • CRUD operations on index data: create, read, update and delete.
  • Health checks and statistics.
  • Configuration of search engine settings and indexes.
  • Advanced features such as pagination, sorting and filtering.

Both engines also provide GUI apps for administration and query testing.

Elasticsearch

Install the search engine

On Mac OS X you can install Elasticsearch using Homebrew:

brew install elasticsearch

Test the installation

curl -XGET ‘localhost:9200/_cat/health?v&pretty’

Create an index

curl -XPUT 'localhost:9200/customer?pretty'

An index in Elasticsearch is a collection of related documents: for example, customer data or a product catalog. This corresponds to a collection in Solr.

Delete an index

curl -XDELETE 'localhost:9200/customer?pretty'

Add a document:

curl -XPUT 'localhost:9200/customer/doc/1?pretty' 
-H 'Content-Type: application/json' -d'
{
"name": "John Doe"
}
'

Retrieve a document

curl -XGET 'localhost:9200/customer/doc/1?pretty'

The response looks like this:

{
"_index" : "customer",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : { "name": "John Doe" }
}

Replace a document

curl -XPUT 'localhost:9200/customer/doc/1?pretty' 
-H 'Content-Type: application/json' -d'
{
"name": "Jane Doe"
}
'

Update a document

curl -XPOST 'localhost:9200/customer/doc/1/_update?pretty' 
-H 'Content-Type: application/json' -d'
{
"doc": { "name": "Jane Doe", "age": 20 }
}
'

You can also use scripts to programmatically increment or otherwise change specific fields.

Delete a document

curl -XDELETE 'localhost:9200/customer/doc/2?pretty'

There’s also an API for deleting a set of documents matching specific queries.

Bulk operations are done like this:

curl -XPOST 'localhost:9200/customer/doc/_bulk?pretty' -H
'Content-Type: application/json' -d'
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'

Search for documents

This example matches any document, and sorts results by name in ascending order:

curl 
-XGET 'localhost:9200/customer/_search?q=*&sort=name:asc&pretty'

The response includes information about the number of hits and time for the search, as well as the actual results:

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "customer",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "Jane Doe"
}
},
{
"_index" : "customer",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "John Doe"
}
}
]
}
}

You can make the same request using a JSON body:

curl -XGET 'localhost:9200/customer/_search?pretty' 
-H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"sort": [
{ "name": "asc" }
]
}
'

The number of results returned defaults to 10.

This request queries a list of accounts and only returns lastname values:

curl -XGET 'localhost:9200/accounts/_search?pretty' 
-H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"_source": ["lastname"]
}
'

Search queries are straightforward:

curl -XGET 'localhost:9200/customer/_search?pretty' 
-H 'Content-Type: application/json' -d'
{
"query": { "match": { "name": "Doe" } }
}
'

It’s also possible to filter results, and to aggregate results to group and extract statistics.

Solr

Install the search engine

Start an interactive session

./bin/solr start -e cloud

Test it

http://localhost:8983/solr/#/

Create a new document collection

./bin/solr create -c mydocs -s 2 -rf 2

In Solr a collection is a set of related documents: for example, customer data or a product catalog. This corresponds to an index in Elasticsearch.

Index some data

bin/post -c techproducts example/exampledocs/*

This indexes files in the example/exampledocs directory, which is the set of sample data provided with the Solr installation. Data can be in any format that Solar recognises: XML, JSON, CSV. Solr can also index plain text files and formats such as PDFs and Word documents.

Do a search

curl 'localhost:8983/solr/techproducts/select?q=foundation'

The response looks like this:

{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 8,
"params": {
"q": "foundation"
}
},
"response": {
"numFound": 4,
"start": 0,
"maxScore": 2.7879646,
"docs":[{
"id": "0553293354",
"cat": ["book"],
"name": "Foundation",
"price": 7.99,
"price_c": "7.99,USD",
"inStock": true,
"author": "Isaac Asimov",
"author_s": "Isaac Asimov",
"series_t": "Foundation Novels",
"sequence_i": 1,
"genre_s": "scifi",
"_version_": 1574100232473411586,
"price_c____l_ns": 799
}]
}}

Do a field search

curl 'localhost:8983/solr/techproducts/select?q=cat:electronics'

Add a single document, using JSON

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update/json/docs' --data-binary '
{
"id": "1",
"title": "Doc 1"
}'

Add multiple documents

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update' --data-binary '
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'

Update existing documents

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_field": 2.3
}
},
"add": {
"commitWithin": 5000,
"overwrite": false
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" },
"delete": { "query":"QUERY" }
}'

This request means that this document should be committed within 5 seconds and that there shouldn’t be a check for existing documents with the same uniqueKey.

Find out more here about working with JSON in Solr.

--

--

Sam Dutton

I am a Developer Advocate for Google Chrome. I maintain simpl.info: simplest possible examples of HTML, CSS and JavaScript. South Australian, living in London.