Hazelcast: In-Memory Computing to Build Scalable Auto-Complete Keyword Search Solution

Hussein Moghnieh, Ph.D.
8 min readAug 29, 2020

--

Building Custom Scalable Auto-Complete Keyword Search Solution in Hazelcast

Introduction

In this blog, I’ll demonstrate the design and implementation of a back-end search service we’ve been using at FastPay (gofastpay.com) to power the auto-complete keyword search of all our front-end applications.

Prior to implementing such solution, we used to load all the data in our front-end applications and perform front-end filtering — a solution that quickly became non-scalable as our business grew. Building our own custom search solution on top of Hazelcast (https://hazelcast.com), a powerful in-memory computing platform, provided a seamless experience to both our clients and internal users.

Source Code

The source code of this project can be downloaded from:

Instructions on how to build/run a docker image of this project is provided in the README file.

The Problem: Searching Invoices

Let’s take this simple problem of searching invoices stored in a relational DB. Figure 1. shows a simple schema to store basic invoice information. And for simplicity assume that the user of our application wants to auto-complete search on:

  • invoice number (invoice table)
  • amount (invoice table)
  • payee name (payee table)
  • payor name (payor table)

1- Database design

Figure 1. Simple DB schema to store invoice and invoice related data

2- Sample Data

For your convenience and in order to be able to run the source code and experiment with the solution, the project uses H2 in memory database. The schema shown above is created and a sample data is loaded from data.sql file upon running the project.

Once you run the project, The H2 console can be accessed from the URL: http://localhost:17600/h2-console

Figure 2. H2 DB console

The following SQL retrieves all invoices and invoices related data from the DB.

SELECT *
FROM invoice
JOIN payor
ON ( invoice.payor_id = payor.payor_id )
JOIN payee
ON ( invoice.payee_id = payee.payee_id )
Figure 3. Sample Invoice data

In Hazelcast we want to cache only selected columns, mainly, columns that the user will be searching on (i.e. invoice_number, amount, payee name, payor name). The above SQL can be re-written as:

SELECT invoice_id, 
invoice_number,
amount,
payor.name AS payor_name,
payee.name AS payee_name
FROM invoice
JOIN payor
ON ( invoice.payor_id = payor.payor_id )
JOIN payee
ON ( invoice.payee_id = payee.payee_id )
Figure 4. Searchable Invoice data to be cached in Hazelcast

3- What to cache in Hazelcast?

The invoice data selected from the DB shown above will be loaded into Hazelcast’s in memory distributed Map. Each DB row will be stored as a Map item. The key of the item is the invoice_id and the value is a JSON object containing the values of each field we would like to search on as shown below:

key: 1value:
{
“invoice_number” : “111”,
“payor_name” : “Sinclair"
“payee_name” : “Sonobi”,
“amonut” : 100.12
}
key: 2value:
{
“invoice_number” : “222”,
“payor_name” : “Sinclair"
“payee_name” : “Adrenalin”,
“amonut” : 41.86
}
.
.
.
.
.

When running on a cluster, Hazelcast distributes the data on multiple nodes and manages redundancy. Not only that, Hazelcast provides a powerful map/reduce operations that run on all nodes. This will be illustrated later on.

API — Defining The Search API

Now that we have defined the structure of the data in Hazelcast’s Map, let’s define the APIs needed to construct (i.e. load data from DB), search, and update (i.e. keep the data in sync with the DB) of such map. I’ll first define the APIs and I’ll provide examples in the next section.

  • DELETE /invoice: Loads the data from the database into Hazelcast’s distributed map. This is usually a costly operation and usually is only needed once the cluster is launched.
  • GET /invoice: Returns all the possible search values.
  • PUT /invoice: Returns a subset of the possible search values based on a filter that the user of the service provides.
  • POST /invoice: This is the search API. It takes an object containing all the values of the indices you want to search on and returns the keys (invoice_ids) of each Map element matching the search.
  • PATCH /invoice: This is to update the map item(s) whenever a corresponding invoice(s) has been modified in the database.

APIs in Action — Examples

Note: You can experiment with the APIs by accessing the embedded swagger page: http://localhost:17600/swagger-ui.html

DELETE /invoice : Cache searchable terms in Hazelcast

This is the API to rebuild the Map in Hazelcast and cache all the data retrieved from the DB

Request:

curl -X DELETE "http://localhost:17600/v1/invoice" -H  "accept: */*"

Response: 200 Ok

1 — GET /invoice: Get all unique searchable Keywords

This API returns all the searchable items and their corresponding possible values from Hazelcast. The result of this API are provided to the consumer application for the user to select from.

Request:

curl -X GET "http://localhost:17600/v1/invoice" -H  "accept: */*"

Response:

Figure 5. Response of GET /invoice API containing all possible search values

2 — PUT: Get contextual keyword search

The GET /invoice API loads all possible search terms without filter or adding a context to the result. The PUT /invoice API is to compliment the GET API and further filter the possible search values based on a filter. The filter is of the form:

{
"amount": [
0
],
"invoiceNumber": [
"string"
],
"payeeName": [
"string"
],
"payorName": [
"string"
]
}

Let’s say the user types “A” and the UI filters on all loaded data in the previous step (i.e. GET /invoice). The choices of keywords that starts or contains “A” will be shown

The user selects “Adrenalin” as Payee (shown below). Once this choice is made, subsequent auto-complete choices should only pertain to the search invoices that have Adrenalin as payee. For instance, it should NOT show an amount of 341.00, since none of the Adrenalin invoices has this amount.

To get the new list of searchable values, PUT /invoice is called and the following object structure is passed on in the request body:

{
“payeeName” :[
“Adrenalin”
]
}

The full request using swagger is shown below

Figure 6. PUT /invoice to retrieve possible searchable values for payeeName=Adrenalin

Request:

curl -X PUT "http://localhost:17600/v1/invoice" -H  "accept: */*" -H  "Content-Type: application/json" -d "{  \"payeeName\": [    \"Adrenalin\"  ]}"

Response:

As shown, the new possible search values are a subset of all the search values cached in Hazelcast obtained in the GET request.

Figure 7. PUT /invoice response displaying possible searchable values for payeeName=Adrenalin

What is missing? Prefix Search instead of GET / PUT above implementation

One of the limitation of the current implementation is that the GET and PUT APIs load all the search values into the UI, a solution that is not scalable. Instead, we could have used Prefix Search construct in Hazelcast by which the consuming application (i.e. UI) does not load all possible searchable values, but only retrieves the values as the user types. The details of such implementation will be left for another blog post.

3a — POST /invoice: Search example-1

Next is to show the implementation of the actual search. This search takes as input an object of the form:

{
"amount": [
0
],
"invoiceNumber": [
"string"
],
"payeeName": [
"string"
],
"payorName": [
"string"
]
}

The request body is similar to the one used in the PUT /invoice request. The only difference is that the search POST /invoice API returns the keys (i.e. invoice_ids) of the items in the Hazelcast map. Those keys can be exchanged to get the actual invoice data from the database to be displayed in the UI.

For instance, searching for payeeName “Adrenalin” will return 4 invoices:

Figure 8. POST /invoice to search for invoices with payeeName=Adrenalin

Request:

curl -X POST "http://localhost:17600/v1/invoice?page=0&pageSize=10" -H  "accept: */*" -H  "Content-Type: application/json" -d "{  \"payeeName\": [    \"Adrenalin\"  ]}"

Response:

{
"invoiceId": [
2
,
3
,
4
,
9
],
"totalAmount": 5138.96
}

4 invoices matches the search request. Notice how totalAmount is returned which is the sum of the amounts of all 4 invoices. Such aggregation of amount is achieved using Hazelcast’s powerful Aggregator construct that can carry such computation on a cluster. The code below shows the implementation of the total amount in this project

Aggregator<Map.Entry<Long, InvoiceKeywordIndexEntry>, BigDecimal> invoiceUsdTotaller =
Aggregators.bigDecimalSum("amount");

BigDecimal totalAmount = invoiceDataMap.aggregate(invoiceUsdTotaller, searchPredicate);

It is also possible to build a custom Aggregator class to perform more complex operations on the result set of the search predicate.

3b — POST /invoice: Search example-2

Let’s try another search request as shown below:

{

"payeeName": [
"Sonobi" , "Adrenalin"
],
"payorName": [
"Sinclair"
]
}

This search request can be translated as such:

Return all the invoice ids that have:

PayeeName=(“SonobiORAdrenalin”) AND (payorName=“Sinclair”)

From the database data, the result corresponds to invoice_ids (1,2, and 3). And the total invoice amount is 2141.98 as shown below

{
"invoiceId": [
1
,
2
,
3
],
"totalAmount": 2141.98
}

4 — PATCH: Update

The last construct is to update Hazelcast’s map entry whenever any of the cached fields changes in the DB. Let’s say that another process modified the amount of an invoice, then the process that modified the invoice should tell Hazelcast to update its items corresponding to that invoice by calling PATCH API and specifies one or many invoices ids to update. The search service will retrieve the latest requested invoice data from the DB and replace their corresponding Map entries.

The following curl shows simple call to patch multiple invoices in one command.

curl -X PATCH "http://localhost:17600/v1/invoice" -H  "accept: */*" -H  "Content-Type: application/json" -d "{  \"invoiceIds\": [    1,2,3  ]}"

The input object to the curl command

Figure 9. Patch /invoice to sync invoice 1,2, and 3 with the DB

Hazelcast on AWS ECS cluster

Configuring Hazelcast to run on ECS cluster requires the cluster’s VPC subnets IP range, AWS region, and that the ECS cluster be tagged by any key-value pair. Also, the network mode of each ECS Task be set to Host mode instead of Bridge mode.

The following JAVA code creates a configuration that is passed to Hazelcast’s map configuration to allow the auto-discovery of Hazelcast tasks running on the same cluster.

Figure 10. JAVA Hazelcast ECS cluster configuration

Future Blog: Prefix Search

--

--