Apache Solr Fundamentals
Apache Solr is an information retrieval library or search platform capable of handling unstructured and semi-structured data.
It is used to search and analyze huge amounts of data in real time.
Apart from full-text search, Solr also supports text analysis.
Solr is distributed in Nature thus making it easier to scale.
Apache Solr is truly open source. This is in contrast to Elastic, which is managed by a community.
Solr supports JSON ,XML, CSV and optimized binary response formats.
Solr Simplified Architecture
As shown in diagram:
Index
Apache Solr Index is the “database” for managing structured/unstructured data. It stores data in a way that makes analysis and full text search easier.
Query Parser
All queries submitted by client is handled by Query parser.
Response Handler
Response handler is responsible for generating response is appropriate format (json/xml/csv) for client.
Update Handler
It is used for indexing; i.e insertion, updation, Deletion of Data in Index. For example if we want our MySQL data to be in sync with Apache Solr, we have to create an Update Handler responsible fo sync.
How Solr stores Data ?
Apache Solr stores data in index in an organized manner. Internally data is organized as documents, where each document is a collection fo fields. Corresponding to each document type is a Schema , that stores details about the field types and fields
Ingestion Approaches
Ingestion is the process of importing external data into Solr index. There are atleast 3 ways fo doing this:
For binary files(pdf/doc etc), we can use tools like Solr Cell
For XML Files we can send HTTP requests
For custom ingestion we may write our own program using Solr Client API.
Facets and Constraints
Solr supports faceting. Faceting means arrangement of search results into categories. A Facet represents search result category. Constraints are the facet values.
As shown in digram , Shirts and Trousers are facets, while Full Sleeve, Half Sleeve etc are constraints, each associated with a count.
Let’s see Solr In action.
Download a binary release from https://solr.apache.org/downloads.html
Extract
tar -xvf solr-9.3.0.tgz
Start a node in cloud mode (-c)
cd solr-9.3.0.tgz
bin/solr start -c
Optionally you may access Solr Admin UI by navigating to http://localhost:8983/solr/#/
Step by Step
- Create a Collection
- Define Schema for Collection
- Populate Collection (Index some documents)
- Commit ( Make Changes permanent)
- Execute Queries
In real world we will create a Django App or Laravel App to perform operations mentioned in the steps above. For now we will be using postman for the purpose. You may download postman collection here .
Creating Collection
Make sure Apache Solr is up and running.
End Point: http://localhost:8983/api/collections
HTTP Verb: POST
Headers: Content-Type: application/json
Data:
{
"name": "employee",
"numShards": 1,
"replicationFactor": 1
}
Defining Schema for Collection
End Point: http://localhost:8983/api/collections/employee/schema
HTTP Verb: POST
Headers: Content-Type: application/json
Data:
{
"add-field": [
{
"name": "name",
"type": "text_general",
"multiValued": false
},
{
"name": "department",
"type": "string",
"multiValued": false
},
{
"name": "designation",
"type": "string",
"multiValued": false
},
{
"name": "experience",
"type": "pint"
}
]
}
Note thatc String type does not perform tokenization etc, and is used for fields to be used for for facetting. Text performs tokenization etc and thus provides powerful partial matching. Visit this link to know more Solr Field Types.
Populating Collection
End Point: http://localhost:8983/api/collections/employee/update
HTTP Verb: POST
Headers: Content-Type: application/json
Data:
[
{
"id": "emp-001",
"name": "Chris Nathan",
"department": "Dev",
"designation": "Analyst",
"experience": 7
},
{
"id": "emp-002",
"name": "Christina ",
"department": "Dev",
"designation": "Programmer",
"experience": 3
},
{
"id": "emp-003",
"name": "Naresh",
"department": "Marketing",
"designation": "Executive",
"experience": 2
}
]
Commiting the Changes
End Point: http://localhost:8983/api/collections/employee/config
HTTP Verb: POST
Headers: Content-Type: application/json
Data:
{
"set-property": {
"updateHandler.autoCommit.maxTime": 15000
}
}
Executing Queries
List all employees
http://localhost:8983/solr/employee/select?q=*
List all employees where department is ‘Dev’ (q=department:Dev)
http://localhost:8983/solr/employee/select?q=department:Dev
List name field (fl) only employees where department is Dev
http://localhost:8983/solr/employee/select?q=department:Dev&fl=name
Output (Other field omitted for brevity):
{
"numFound": 2,
"start": 0,
"numFoundExact": true,
"docs": [
{
"name": "Chris Nathan"
},
{
"name": "Christina "
}
]
}
List name and id fields (fl) only employees where department is Dev
http://localhost:8983/solr/employee/select?q=department:Dev&fl=name&fl=id
List name and id fields (fl) only employees where department is Dev and experience is in range 2 to 4 (experience:[2 TO 4]). Remember Solr Queries are case sensitive.
http://localhost:8983/solr/employee/select?q=department:Dev&fl=name&fl=id&fl=experience&fq=experience:[2 TO 4]
Worth considering:
Read this about How Proprietary Software Takes Away Your Freedom.
Excerpt from the article :
“On the Internet, proprietary software isn’t the only way to lose your computing freedom. Service as a Software Substitute, or SaaSS, is another way to give someone else power over your computing.
The basic point is, you can have control over a program someone else wrote (if it’s free), but you can never have control over a service someone else runs, so never use a service where in principle running a program would do.
SaaSS means using a service implemented by someone else as a substitute for running your copy of a program. The term is ours; articles and ads won’t use it, and they won’t tell you whether a service is SaaSS. Instead they will probably use the vague and distracting term “cloud,” which lumps SaaSS together with various other practices, some abusive and some ok. With the explanation and examples in this page, you can tell whether a service is SaaSS.”
Feel free to add a comment if you have any doubt, query or question. Any such discussion will help us grow together.
Happy Coding.