More about search engines
This article is a companion to How to add full text search to your website.
It’s an overview from the perspective of a front end developer — and is in no way comprehensive!
The two most popular open source search engines are Elasticsearch and Solr. Both are based on the Apache Lucene Java library.
In reality, large-scale, high-volume search can be difficult to implement and maintain, but potentially search engines can cope with petabytes of data on hundreds of servers, providing extremely fast search, and low latency when data is updated. You can find out more about the differences between the two here.
If you’re a front-end developer you may feel somewhat daunted at the prospect of installing a search engine. However, Elasticsearch and Solr are well documented and relatively easy to get up and running.
Installation and usage
The following instructions describe the installation for the Java versions of Elasticsearch and Solr.
Elasticsearch is also available as a Node module with similar functionality and a promise-based API. Likewise, Solr can be used via Node modules such as solr-client.
Elasticsearch and Solr use REST interfaces to perform search queries and also for administration and index management tasks:
- CRUD operations on index data: create, read, update and delete.
- Health checks and statistics.
- Configuration of search engine settings and indexes.
- Advanced features such as pagination, sorting and filtering.
Both engines also provide GUI apps for administration and query testing.
Elasticsearch
Install the search engine
On Mac OS X you can install Elasticsearch using Homebrew:
brew install elasticsearch
Test the installation
curl -XGET ‘localhost:9200/_cat/health?v&pretty’
Create an index
curl -XPUT 'localhost:9200/customer?pretty'
An index in Elasticsearch is a collection of related documents: for example, customer data or a product catalog. This corresponds to a collection in Solr.
Delete an index
curl -XDELETE 'localhost:9200/customer?pretty'
Add a document:
curl -XPUT 'localhost:9200/customer/doc/1?pretty'
-H 'Content-Type: application/json' -d'
{
"name": "John Doe"
}
'
Retrieve a document
curl -XGET 'localhost:9200/customer/doc/1?pretty'
The response looks like this:
{
"_index" : "customer",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : { "name": "John Doe" }
}
Replace a document
curl -XPUT 'localhost:9200/customer/doc/1?pretty'
-H 'Content-Type: application/json' -d'
{
"name": "Jane Doe"
}
'
Update a document
curl -XPOST 'localhost:9200/customer/doc/1/_update?pretty'
-H 'Content-Type: application/json' -d'
{
"doc": { "name": "Jane Doe", "age": 20 }
}
'
You can also use scripts to programmatically increment or otherwise change specific fields.
Delete a document
curl -XDELETE 'localhost:9200/customer/doc/2?pretty'
There’s also an API for deleting a set of documents matching specific queries.
Bulk operations are done like this:
curl -XPOST 'localhost:9200/customer/doc/_bulk?pretty' -H
'Content-Type: application/json' -d'
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
Search for documents
This example matches any document, and sorts results by name in ascending order:
curl
-XGET 'localhost:9200/customer/_search?q=*&sort=name:asc&pretty'
The response includes information about the number of hits and time for the search, as well as the actual results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "customer",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "Jane Doe"
}
},
{
"_index" : "customer",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "John Doe"
}
}
]
}
}
You can make the same request using a JSON body:
curl -XGET 'localhost:9200/customer/_search?pretty'
-H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"sort": [
{ "name": "asc" }
]
}
'
The number of results returned defaults to 10.
This request queries a list of accounts and only returns lastname values:
curl -XGET 'localhost:9200/accounts/_search?pretty'
-H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"_source": ["lastname"]
}
'
Search queries are straightforward:
curl -XGET 'localhost:9200/customer/_search?pretty'
-H 'Content-Type: application/json' -d'
{
"query": { "match": { "name": "Doe" } }
}
'
It’s also possible to filter results, and to aggregate results to group and extract statistics.
Solr
Install the search engine
Start an interactive session
./bin/solr start -e cloud
Test it
Create a new document collection
./bin/solr create -c mydocs -s 2 -rf 2
In Solr a collection is a set of related documents: for example, customer data or a product catalog. This corresponds to an index in Elasticsearch.
Index some data
bin/post -c techproducts example/exampledocs/*
This indexes files in the example/exampledocs directory, which is the set of sample data provided with the Solr installation. Data can be in any format that Solar recognises: XML, JSON, CSV. Solr can also index plain text files and formats such as PDFs and Word documents.
Do a search
curl 'localhost:8983/solr/techproducts/select?q=foundation'
The response looks like this:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 8,
"params": {
"q": "foundation"
}
},
"response": {
"numFound": 4,
"start": 0,
"maxScore": 2.7879646,
"docs":[{
"id": "0553293354",
"cat": ["book"],
"name": "Foundation",
"price": 7.99,
"price_c": "7.99,USD",
"inStock": true,
"author": "Isaac Asimov",
"author_s": "Isaac Asimov",
"series_t": "Foundation Novels",
"sequence_i": 1,
"genre_s": "scifi",
"_version_": 1574100232473411586,
"price_c____l_ns": 799
}]
}}
Do a field search
curl 'localhost:8983/solr/techproducts/select?q=cat:electronics'
Add a single document, using JSON
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update/json/docs' --data-binary '
{
"id": "1",
"title": "Doc 1"
}'
Add multiple documents
curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update' --data-binary '
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'
Update existing documents
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_field": 2.3
}
},
"add": {
"commitWithin": 5000,
"overwrite": false
},"commit": {},
"optimize": { "waitSearcher":false },"delete": { "id":"ID" },
"delete": { "query":"QUERY" }
}'
This request means that this document should be committed within 5 seconds and that there shouldn’t be a check for existing documents with the same uniqueKey.
Find out more here about working with JSON in Solr.