Hazelcast: In-Memory Computing to Build Scalable Auto-Complete Keyword Search Solution
Building Custom Scalable Auto-Complete Keyword Search Solution in Hazelcast
Introduction
In this blog, I’ll demonstrate the design and implementation of a back-end search service we’ve been using at FastPay (gofastpay.com) to power the auto-complete keyword search of all our front-end applications.
Prior to implementing such solution, we used to load all the data in our front-end applications and perform front-end filtering — a solution that quickly became non-scalable as our business grew. Building our own custom search solution on top of Hazelcast (https://hazelcast.com), a powerful in-memory computing platform, provided a seamless experience to both our clients and internal users.
Source Code
The source code of this project can be downloaded from:
Instructions on how to build/run a docker image of this project is provided in the README file.
The Problem: Searching Invoices
Let’s take this simple problem of searching invoices stored in a relational DB. Figure 1. shows a simple schema to store basic invoice information. And for simplicity assume that the user of our application wants to auto-complete search on:
- invoice number (invoice table)
- amount (invoice table)
- payee name (payee table)
- payor name (payor table)
1- Database design
2- Sample Data
For your convenience and in order to be able to run the source code and experiment with the solution, the project uses H2 in memory database. The schema shown above is created and a sample data is loaded from data.sql file upon running the project.
Once you run the project, The H2 console can be accessed from the URL: http://localhost:17600/h2-console
The following SQL retrieves all invoices and invoices related data from the DB.
SELECT *
FROM invoice
JOIN payor
ON ( invoice.payor_id = payor.payor_id )
JOIN payee
ON ( invoice.payee_id = payee.payee_id )
In Hazelcast we want to cache only selected columns, mainly, columns that the user will be searching on (i.e. invoice_number, amount, payee name, payor name). The above SQL can be re-written as:
SELECT invoice_id,
invoice_number,
amount,
payor.name AS payor_name,
payee.name AS payee_name
FROM invoice
JOIN payor
ON ( invoice.payor_id = payor.payor_id )
JOIN payee
ON ( invoice.payee_id = payee.payee_id )
3- What to cache in Hazelcast?
The invoice data selected from the DB shown above will be loaded into Hazelcast’s in memory distributed Map. Each DB row will be stored as a Map item. The key of the item is the invoice_id and the value is a JSON object containing the values of each field we would like to search on as shown below:
key: 1value:
{
“invoice_number” : “111”,
“payor_name” : “Sinclair"
“payee_name” : “Sonobi”,
“amonut” : 100.12
}key: 2value:
{
“invoice_number” : “222”,
“payor_name” : “Sinclair"
“payee_name” : “Adrenalin”,
“amonut” : 41.86
}.
.
.
.
.
When running on a cluster, Hazelcast distributes the data on multiple nodes and manages redundancy. Not only that, Hazelcast provides a powerful map/reduce operations that run on all nodes. This will be illustrated later on.
API — Defining The Search API
Now that we have defined the structure of the data in Hazelcast’s Map, let’s define the APIs needed to construct (i.e. load data from DB), search, and update (i.e. keep the data in sync with the DB) of such map. I’ll first define the APIs and I’ll provide examples in the next section.
- DELETE /invoice: Loads the data from the database into Hazelcast’s distributed map. This is usually a costly operation and usually is only needed once the cluster is launched.
- GET /invoice: Returns all the possible search values.
- PUT /invoice: Returns a subset of the possible search values based on a filter that the user of the service provides.
- POST /invoice: This is the search API. It takes an object containing all the values of the indices you want to search on and returns the keys (invoice_ids) of each Map element matching the search.
- PATCH /invoice: This is to update the map item(s) whenever a corresponding invoice(s) has been modified in the database.
APIs in Action — Examples
Note: You can experiment with the APIs by accessing the embedded swagger page: http://localhost:17600/swagger-ui.html
DELETE /invoice : Cache searchable terms in Hazelcast
This is the API to rebuild the Map in Hazelcast and cache all the data retrieved from the DB
Request:
curl -X DELETE "http://localhost:17600/v1/invoice" -H "accept: */*"
Response: 200 Ok
1 — GET /invoice: Get all unique searchable Keywords
This API returns all the searchable items and their corresponding possible values from Hazelcast. The result of this API are provided to the consumer application for the user to select from.
Request:
curl -X GET "http://localhost:17600/v1/invoice" -H "accept: */*"
Response:
2 — PUT: Get contextual keyword search
The GET /invoice API loads all possible search terms without filter or adding a context to the result. The PUT /invoice API is to compliment the GET API and further filter the possible search values based on a filter. The filter is of the form:
{
"amount": [
0
],
"invoiceNumber": [
"string"
],
"payeeName": [
"string"
],
"payorName": [
"string"
]
}
Let’s say the user types “A” and the UI filters on all loaded data in the previous step (i.e. GET /invoice). The choices of keywords that starts or contains “A” will be shown
The user selects “Adrenalin” as Payee (shown below). Once this choice is made, subsequent auto-complete choices should only pertain to the search invoices that have Adrenalin as payee. For instance, it should NOT show an amount of 341.00, since none of the Adrenalin invoices has this amount.
To get the new list of searchable values, PUT /invoice is called and the following object structure is passed on in the request body:
{
“payeeName” :[
“Adrenalin”
]
}
The full request using swagger is shown below
Request:
curl -X PUT "http://localhost:17600/v1/invoice" -H "accept: */*" -H "Content-Type: application/json" -d "{ \"payeeName\": [ \"Adrenalin\" ]}"
Response:
As shown, the new possible search values are a subset of all the search values cached in Hazelcast obtained in the GET request.
What is missing? Prefix Search instead of GET / PUT above implementation
One of the limitation of the current implementation is that the GET and PUT APIs load all the search values into the UI, a solution that is not scalable. Instead, we could have used Prefix Search construct in Hazelcast by which the consuming application (i.e. UI) does not load all possible searchable values, but only retrieves the values as the user types. The details of such implementation will be left for another blog post.
3a — POST /invoice: Search example-1
Next is to show the implementation of the actual search. This search takes as input an object of the form:
{
"amount": [
0
],
"invoiceNumber": [
"string"
],
"payeeName": [
"string"
],
"payorName": [
"string"
]
}
The request body is similar to the one used in the PUT /invoice request. The only difference is that the search POST /invoice API returns the keys (i.e. invoice_ids) of the items in the Hazelcast map. Those keys can be exchanged to get the actual invoice data from the database to be displayed in the UI.
For instance, searching for payeeName “Adrenalin” will return 4 invoices:
Request:
curl -X POST "http://localhost:17600/v1/invoice?page=0&pageSize=10" -H "accept: */*" -H "Content-Type: application/json" -d "{ \"payeeName\": [ \"Adrenalin\" ]}"
Response:
{
"invoiceId": [
2,
3,
4,
9
],
"totalAmount": 5138.96
}
4 invoices matches the search request. Notice how totalAmount is returned which is the sum of the amounts of all 4 invoices. Such aggregation of amount is achieved using Hazelcast’s powerful Aggregator construct that can carry such computation on a cluster. The code below shows the implementation of the total amount in this project
Aggregator<Map.Entry<Long, InvoiceKeywordIndexEntry>, BigDecimal> invoiceUsdTotaller =
Aggregators.bigDecimalSum("amount");
BigDecimal totalAmount = invoiceDataMap.aggregate(invoiceUsdTotaller, searchPredicate);
It is also possible to build a custom Aggregator class to perform more complex operations on the result set of the search predicate.
3b — POST /invoice: Search example-2
Let’s try another search request as shown below:
{
"payeeName": [
"Sonobi" , "Adrenalin"
],
"payorName": [
"Sinclair"
]
}
This search request can be translated as such:
Return all the invoice ids that have:
PayeeName=(“Sonobi” OR “Adrenalin”) AND (payorName=“Sinclair”)
From the database data, the result corresponds to invoice_ids (1,2, and 3). And the total invoice amount is 2141.98 as shown below
{
"invoiceId": [
1,
2,
3
],
"totalAmount": 2141.98
}
4 — PATCH: Update
The last construct is to update Hazelcast’s map entry whenever any of the cached fields changes in the DB. Let’s say that another process modified the amount of an invoice, then the process that modified the invoice should tell Hazelcast to update its items corresponding to that invoice by calling PATCH API and specifies one or many invoices ids to update. The search service will retrieve the latest requested invoice data from the DB and replace their corresponding Map entries.
The following curl shows simple call to patch multiple invoices in one command.
curl -X PATCH "http://localhost:17600/v1/invoice" -H "accept: */*" -H "Content-Type: application/json" -d "{ \"invoiceIds\": [ 1,2,3 ]}"
The input object to the curl command
Hazelcast on AWS ECS cluster
Configuring Hazelcast to run on ECS cluster requires the cluster’s VPC subnets IP range, AWS region, and that the ECS cluster be tagged by any key-value pair. Also, the network mode of each ECS Task be set to Host mode instead of Bridge mode.
The following JAVA code creates a configuration that is passed to Hazelcast’s map configuration to allow the auto-discovery of Hazelcast tasks running on the same cluster.