Leveraging OpenSearch Point in Time for Consistent Marketing Ads

Published in

SSENSE-TECH

9 min readJul 26, 2024

OpenSearch provides numerous ways to ingest, search, and visualize data at scale. While it is effective — and in many cases necessary — to search a live dataset, there are cases where continuous updates can cause running processes to incorrectly capture those changes. At SSENSE, we have integrated OpenSearch as our live product catalog for customers. It is also used for other automated jobs such as sending products to affiliates for marketing ads. However, the constant changes to the catalog, like stock updates and listing/delisting products, can lead to duplicate or missing products being sent to affiliates. This article explores how Point in Time (PIT) search and orchestration can solve this issue.

Generating Marketing Ads

Marketing ads are one of the many components of our marketing strategy at SSENSE. Sharing our product catalog with affiliates, such as Google and Instagram, allows us to expose products to current and potential customers.

At SSENSE, we have the ability to share our catalog with affiliates by searching the live index in our OpenSearch cluster. However, the problem we face is that this index is always changing, as products are constantly updated. These changes can cause duplicate or missing products to be sent to affiliates because we have to paginate and scroll through the catalog, leading to the potential rejection of the ads we intend to publish.

Upon discovering duplicates in our marketing products, we conducted further investigation and determined that the pagination process was ineffective in preventing duplicates and missed certain products because the search was operating on a live index.

Deep Pagination and Scrolling

Scrolling through the SSENSE catalog is a resource-intensive task for OpenSearch. These types of queries fall in the realm of deep pagination and the generic from and size OpenSearch queries that we use to search catalog pages by gender were not intended for this.

The from and size feature is not recommended for deep pagination because every page request requires processing of all results and filtering them for the requested page.

The scroll method was the only way to handle deep pagination in OpenSearch version 2.3. scroll search results are frozen at the moment of the request, allowing users to paginate the response while ignoring live index updates. When we discovered the duplicate/missing marketing products, our OpenSearch cluster was not on a version that supported alternative methods, so we were forced to use scroll. However, OpenSearch no longer recommends this method because:

❌ Scrolling is bound to the search query it was created with, making it memory intensive if run frequently and less flexible since a new context must be opened for different queries.
❌ Scrolling only goes forward in search, meaning we cannot retry on failed batches as it will return the next page’s results.

Considering these issues, it’s risky to expose scrolling in our Product Service that client applications call. Since scrolling is not integrated with our Product Service, it introduces technical debt since the Marketing Job — which involves sending products to affiliates for marketing ads — queries OpenSearch twice.

Leveraging PIT

After upgrading to OpenSearch version 2.5, we were able to use Point in Time (PIT), which is strongly recommended over the scroll method. A PIT is a dataset fixed in time, allowing you to run any search query against it and obtain consistent results. This frozen dataset makes it easier to use deep pagination techniques and retrieve consistent results faster. PITs solve the deep pagination scenarios that scroll cannot because:

✅ PITs are not bound to a search query, you can run any query against a PIT.
✅ PITs support bi-directional paging (i.e. searching forward and backward), allowing you to retry for a page if it fails.

Based on these advantages, PIT integrates seamlessly with our Product Service by enabling separate queries from the live catalog index. It also eliminates technical debt as the Marketing Job only queries OpenSearch once.

This is a significant milestone! We now have an immutable dataset that we can use to fetch marketing products. This being said, to efficiently manage PITs, we needed to build an automated system that manages PITs for us and capable of creating, deleting, and searching PITs without putting too much strain on our OpenSearch cluster.

PIT Orchestration

The Product Worker is responsible for handling PIT creation and deletion. When creating a PIT, we must specify the target index and its TTL. We can do this using the alias of the live SSENSE index in OpenSearch with strict matching.

# expand_wildcards=none means it will ignore wildcards or regex patterns (strict matching)
POST /<live_index_alias>/_search/point_in_time?keep_alive=<ttl>&expand_wildcards=none

Since PITs are immutable and have an expiration, there are two key considerations to keep in mind when searching against them:

There should always be a PIT to search against to prevent the Marketing Job from failing.
It is essential to create a new PIT on a schedule to ensure the Marketing Job does not search against outdated data.

OpenSearch automatically removes expired PITs. By creating new PITs on a schedule and having the Marketing Job search against the most recent one, OpenSearch can clean up stale PITs without manual intervention.

Here is a simple architecture that achieves this:

ℹ️ There is a Lambda that is triggered on a schedule. The schedule can repeat after a PIT expires and start just before the Marketing Job.
ℹ️ There are manual triggers to create new and delete stale PITs if needed.
ℹ️ (Optional) We can add extra steps in the state machine(s) to check if a PIT exists or if there are stale PITs to delete before sending the request to OpenSearch.

PIT Availability

A PIT can be deleted in four ways:

When the PIT expires
When the PIT is manually deleted
When the node where the PIT is located dies
When the cluster dies

For scenarios (3) and (4), the best approach is to optimize PIT queries so they do not overuse the node’s resources. This can be done by querying in small parallel batches, increasing the request intervals, or even increasing the size of the nodes in our cluster.

Situations (1) and (2) can be addressed using our Product Worker, where we automatically create PITs on a schedule. However, we want to avoid having gaps where no PIT exists.

If there is no PIT to query, the Marketing Job will fail. Therefore, it’s crucial to ensure that there is always a PIT to query. Overlapping our scheduled job start and PIT expiration times can protect the Marketing Job from these unwanted failures.

PIT Search

The Product Service is responsible for searching against the most recent PIT. This abstraction prevents race conditions where multiple queries could create or delete PITs simultaneously, causing potential inconsistencies in our Marketing Job or errors in our OpenSearch cluster.

There are two preferred methods for searching PITs:

Using search_after and sort
Using slice

If you don’t need results in any specific order, if you want the ability to jump from a page to a non-consecutive page, or if you would like to perform parallel search requests, I would recommend PIT slicing for your use case. But here is a short list of the pros and cons for each option.

PIT with search_after and `sort`

Using search_after and sort parameters with a PIT ID allows you to retrieve the desired page while controlling the order and number of documents per page.

GET /_search
{
  "size": 10000,
  "query": {
    # filter inputs
  },
  "pit": {
    "id":  "some_id"
  },
  "sort": [ 
    {"@timestamp": {"order": "asc"}}
  ],
  "search_after": [  
    "2021-05-20T05:30:04.832Z"
  ] # array value from document hit in the last query
}

PROS:

✅ Gives control over how many documents are desired per page.
✅ Provides a cursor for each document in the query response to start paging from.

CONS:

⚠️ search_after parameter is determined by the sort value from the previous response, meaning only sequential paging can be done.
⚠️ Sorting is a memory-intensive operation but is required to retrieve the sort value in the documents to use with search_after.

If sorting is required for your use case, consider leveraging this approach since it provides more control over the page size and where to start paginating while sorting. Keep in mind that you will lose the parallel processing ability since only sequential paging can be done.

PIT with slice

Slicing takes the PIT query result and slices the response into X number of pages, while trying to evenly distribute the documents per page.

GET /_search
{
  "slice": {
    "id": 0,  # page number to return in the response (0 <= page < totalPages)           
    "max": 2  # total pages (2 <= totalPages <= [index.max_slices_per_pit] index level setting)
  },
  "query": {
    # filter inputs
  },
  "pit": {
    "id": "some_id"
  }
}

PROS:

✅ Sorting is not required to use slicing.
✅ We can perform parallel requests to retrieve all pages by slice.id since each slice is treated independently.

CONS:

⚠️ We cannot control how many documents per page, which impacts the memory and CPU when making parallel requests over a network.
⚠️ Sorting PIT slices can yield unexpected results since each slice is sorted independently.

If sorting is required for your use case, slicing can still be used. However, remember that slices are treated independently. Depending on the type of sort being done, you may not get the expected results.

PIT Extending

Regardless of the PIT search method, our Marketing Job still needs to make multiple requests against the PIT to fetch the catalog. But, we also want to avoid having the PIT expire while our job runs. This can be achieved by having the Product Service search against the most recent PIT while extending its TTL per query. This way, older PITs can still expire naturally when the Marketing Job queries the Product Service.

GET /_search
{
  "query": {
    # filter inputs 
  },
  "pit": {
    "id": "some_id",
    "keep_alive": "3s" # extend on each query to avoid expiration on subsequent call
  }
}

Node Resource Consumption

When we create a PIT against the live index, it is randomly assigned to a data node in the cluster by default. All queries against the PIT are sent to that node, which impacts the node’s search rate and CPU usage.

The good news is that PITs reduce the risk of using all the resources in our cluster since they are assigned to a single node. This means only some customer queries would be impacted by the node going down. However, we still want to maintain an exceptional customer experience, so regulating queries against a PIT will prevent the node’s CPU from reaching its limit.

It is important to avoid requesting too many documents per page, regardless of whether we use PIT with search_after or PIT with slice. Making parallel requests, or requests in close sequence, could exceed the JVM limit on the node.

{
  "error": {
    "root_cause": [
      {
        "type": "rejected_execution_exception",
        "reason": "cancelled task with reason: heap usage exceeded [812.2mb >= 438.2mb]"
      }
    ]
  }
}

The Impact of PIT Integration

After piecing all these components together, we are left with a fully automated Point in Time (PIT) orchestration and search system.

By leveraging this design in our marketing ads generation process, we are able to deliver consistent product documents to our affiliates while eliminating technical debt. We separated queries from the Marketing Job and SSENSE customers, reducing the risk of a subpar customer experience.

Leveraging PITs in our Marketing Job made our platform more robust, as we can now search a frozen catalog at scale with minimal impact on our customer experience. The integration of PITs with our Product Worker and Service also allows other processes to leverage PITs when necessary. There are other aspects of PIT integration to consider, such as managing socket timeouts and active socket TTLs, which could be the topic of a future article…

Editorial reviews by Catherine Heim, Luba Mikhnovsky & Mario Bittencourt.

Want to work with us? Click here to see all open positions at SSENSE!