A Case of Multi-Tenant Search Index

Gandharv Srivastava
Capillary Technologies
12 min readOct 6, 2023

As a leading CRM platform, Capillary has always understood the importance of powerful search capabilities across various customer data stored in our comprehensive data platform. As our customer base grows, the need for advanced search features becomes even more vital, especially for users of our SaaS products.

In this blog post, we’ll explore how we at Capillary have improved its search features to better meet the diverse needs of our customers. In this discussion, we’ll carefully examine the challenges and complexities of our previous search workflows. We’ll uncover the pain points and bottlenecks we’ve addressed and show how our solutions have improved the search experience to support a diverse use case of supporting searches for multiple tenants and their respective entities. Lastly, we’ll give you an overview of the updated architecture, emphasizing the significant improvements we’ve made to deliver a stronger and more efficient search solution.

Motivation

In Capillary’s data platform, data is rapidly ingested, oftentimes of the order of 100s of thousands of events per minute, from various global regions. These data sources include different types of information, like invoice details (transactions), customer details, and product information.

Events from different sources and interests are grouped and called entities. For example, bill transactions are “Transactional entity events,” customer-related events are “Customer entity events,” and so on.

Different search uses cases for different tenants

As we welcome more customers, they come with their own specific search needs that require flexibility. These needs can involve various aspects and different types of data, as shown in the image above :

Tenant #1 needs to search for product details using SKU (stock keeping unit) codes, customer details using customer identifiers, and also perform two different types of searches on transaction details involving, let’s say, bill number and date of transaction.

Tenant #2, on the other hand, only requires a customer search based on first names and a transaction search using its own custom entity field.

Tenant #3 doesn’t want to perform any sort of searches.

Our old search system was limited to querying customer data, specifically for retrieving customer details using customer identifiers like first names, mobile or email id. This worked well with the use case of customer lookup with the customer identifier they remembered. With the emergence of new use cases as discussed above, we needed to enhance this existing feature. Let’s first examine the various use cases that were available for us to solve.

Search Use Cases

Use case #1: Searches by combination of entity attributes

This use case involves searching for entities by combining multiple fields, usually using unique schema fields within transaction or customer data.

Within Capillary, a tenant can decide what information their entities should include to be stored within our core data platform. For instance, one tenant may choose to have only the first name and last name in their customer details, while another tenant might include additional fields like address, date of birth, and anniversary date.

These fields offer versatility to meet the specific needs of different industries and tenants. For example, in the airline industry, a tenant might use PNR codes, origin, and destination to search for flight details and handle retro claims. In hospitality, a tenant might search for guest stay information using criteria like the guest’s name and stay dates. These searches need to be highly flexible to handle various field types.

Our goal is to enable entity attribute searches that encompass the following key features:

  • Exact lookup searches using a single field or a combination of multiple fields
  • Range searches across specified fields
  • Multilingual search support
  • Data archival support for searchable data post certain time

Use case #2: Prefix searches by single attributes

Every customer data platform should be able to search using essential identifiers like mobile numbers and email addresses. While our current system partially supported this, a new workflow gave us a chance to address existing issues, which we’ll discuss below.

These searches mainly focused on customer-related data, where the goal is to retrieve customer profiles using their identifiers, such as in a store or call center. These searches typically involve straightforward prefix-based text queries and do not demand the same degree of flexibility as outlined in use case #1.

Existing search workflow

Our old search system was mainly for finding customers using identifiers (use case #2). It included additional searchable data on top of the core platform data. When searching, we would query both the specific search data and the core data to get the results.

Existing workflow for search data ingestion
Existing workflow for searches

Pain points

Our current workflow worked well for customer searches, but when new use cases from various areas emerged, we faced significant challenges in accommodating these custom search needs.

Limited Flexibility for New Search Types

Our existing workflow limited varied searches. With the old approach, each new type of search extension required introduction of new field indexing within our search and core database. For instance, if a need for search was required on 2 new fields, this would require creation of a new index on our core database and more queries to those relevant tables. Since ours is a multi-tenant system, addition of a new index would index data of the tenants which didn’t even require searches on those fields. This inturn would lead to database resource pressure. Additionally this search extension would require a data schema change within our search database as well, making each tenant’s search requirement a separate task to be picked by us.

Overwhelming Volume of Searchable Data

New demands required searching across multiple entities, significantly increasing the searchable data volume. To address this, we introduced new indexes and restructured core data, sometimes exceeding 100 million records. This led to memory pressure and poor search performance in our search database, which didn’t support sharding in its architecture. We initially scaled it vertically, but as data continued to grow, it was no longer resource-efficient to support the search on the search database. Data archiving wasn’t an option due to customer data policies.

To mitigate these challenges, we stopped new data replication to the search database and began relying more on core databases for our searches.

Pain points with existing workflows

Performance Challenges on the Core Platform

As the volume of searchable data increased, the performance of the search database began to degrade. This led to a heavier reliance on our core platform databases for search responses, causing introduction of more indexes on the core database. To prevent similar resource-related issues on our core platform databases, we had to scale them vertically, which became increasingly challenging and costly to manage within the existing workflow. This situation mandated a strategic shift to alleviate the strain on our infrastructure and ensure a smoother, more efficient user experience across the entire platform.

Approach

We will delve into each component in greater detail and provide illustrative examples to highlight our approach.

Search Criteria

Before searching for anything in our system, you must first define a flexible search criteria. Criteria decides which fields are available for searching from all the fields available within a certain entity. This schema provides information to the system about:

  • Entity Type : Defines fields from which entity type will be used for search
  • Fields for Searching: Defines the combination of fields to be used for the search
  • Field Requisite Information: Describes if the field is mandatory for search, i.e., INDEXED or FILTERABLE
  • Indexing Strategy: Specifies how data should be indexed, described below
  • Archival Policy: If there’s a need for data archival, this policy is defined

In this criteria, certain fields are necessary for every search, while others can be used to fine-tune and narrow down results based on user preferences. This approach ensures that searches provide detailed entity information as per the user’s request. Each search criteria is separate and its searchable data is agnostic, with a focus on supporting multi-tenancy.

Search Indexing Strategies

Search indexing strategies determine how we index searchable data. They ensure consistency in API signatures and data structures across various strategies, making it easier to expand the system for different use cases. Initially, we implemented two separate search strategies.

COMBINATION : This strategy aims to support searches across different combinations of fields, providing a flexible solution to meet the the need for use case #1. Taking the example from before, the airline tenant could use a search criteria for bookings using origin, destination, and PNR. Same tenant can create another search criteria for searching booking with the first name and last name of the customer as well. In parallel, a pharmaceutical company could search for products using product codes and brand names. All of these capabilities can be achieved by indexing the relevant fields and relevant data while keeping the database schemas and API signatures consistent.

PREFIX : We created this strategy to improve customer identifier searches in our existing system discussed above. It mainly centers on finding information using customer names. Searches with this approach are simple and straightforward. Just enter the first letters of the field to search, and the system quickly finds the relevant entity details.

Architecture

New workflow for searches

With the new workflow, we first conduct searches on the indexed storage to fetch entity identifier information. Then, we make another query to the core platform database to gather additional entity details, using the entity identifier information obtained from the search platform. Because all queries to the core platform are now based on ID lookups, this results in quick and cost-effective requests to the core platform. This clear separation of concerns ensures that neither flow experiences performance interference from the other, promoting overall system efficiency.

Search Data Management

For the mentioned use cases, we needed a database with a compound indexing for exact searches on each parameter. So, in terms of capabilities, we looked for a document database with strong indexing features. MongoDB and ElasticSearch were two options within our existing tech stack. But since these use cases didn’t need fuzzy, or stem searching (full-text search), we decided to use MongoDB. We made sure that our existing core platform databases were not affected in the process.

Our system’s ability to define specific criteria for each tenant sets us up for future adaptability. Since the data for each criteria doesn’t depend on the others, if we ever need text or fuzzy searching in the future, we can easily move the data for that specific criteria to a full-text support database like ElasticSearch. With this we can make use of different capabilities using different databases. This approach enables us to smoothly handle changing use cases.

Let’s take an example to better understand the MongoDB schema definition with the new workflow:

Sample search criteria definition for airline tenant use case
Sample search criteria definition for hospitality tenant use case
Sample search data for airline tenant criteria
Sample search data for hospitality tenant criteria

Notice that “field1” contains two types of data: “search data 1” (PNR, string) and “search data 2” (mobile, number). Because MongoDB doesn’t require a fixed schema, we can store both data types in the same collection and query them differently for different needs. Additionally, the “expireAt” field helps with data archiving, using MongoDB’s TTL (Time-to-Live) index feature. MongoDB’s schema flexibility allows us to efficiently store data from multiple tenants and various search criteria in a single collection, all managed with a single compound index. This flexibility simplifies data organization and indexing, making it easier to handle diverse information sources in the system.

Data Inflow

We’ve adopted a multifaceted approach to input searchable data into our search databases, which comprises three primary pipelines:

Real time events and bulk inflow pipeline

Real-Time Events Pipeline: In this pipeline, we leverage Capillary’s existing event notification system. It ensures that data becomes searchable almost instantly upon entering the Capillary ecosystem. Data ingestion begins as soon as the search criteria definition is established, aligning search ability closely with events in Capillary’s system in real-time.

Bulk Flow Pipeline: Our Bulk Flow Pipeline is designed for situations where you want to make historical data searchable, especially if a tenant has been using Capillary’s system for a while but only wants to enable search capabilities later. This allows older data to be available for searches, which the real-time pipeline doesn’t handle. The Bulk Flow Pipeline achieves this by ingesting bulk data through Capillary’s OLAP (Online Analytical Processing) system. It collects data starting from the earliest defined point in time up to when the search criteria definition is created. By covering this entire time frame, it complements the real-time events pipeline, ensuring complete access to historical data for search purposes for any specific tenant.

Data reconciliation pipeline

Reconciliation Pipeline: In cases involving substantial bulk imports into Capillary’s system where real-time events aren’t automatically triggered, we’ve implemented a Reconciliation Pipeline. This pipeline conducts a detailed comparison between the core platform data and the search system data. Any pending data is seamlessly integrated into the search system. This meticulous approach ensures no data is overlooked or missed during searches.

Data Sharding

Criteria based sharding of searchable data

Our system’s flexible framework lets us define specific rules for each tenant, strategy, and search category. This allows us to efficiently distribute data across different database types or across a single database on multiple machines. By ensuring that each dataset related to specific criteria is agnostic of the other, we can distribute them carefully across various data instances. Hence, enabling us sharding on tenant or criteria level. Essentially, our approach to spreading data across different places and types of databases ensures that our search system is able to handle a growing amount of data without slowing down.

Conclusion

Our project taught us valuable lessons about organizing searchable data. In our old system, we struggled with extensions and managing growing data because it wasn’t structured for new use cases. Each extension required new indexes or changes to the entire table schema, causing additional cost of indexing data of tenants who didn’t require those indexes. With the new approach, we put a strong emphasis on search data modeling. This allowed us to index only the data needed for searching.

We learned about how multi-tenant systems need configurable search criteria. With the problem statement of keeping search flexible, the number of search criteria combinations for the system keeps growing. So traditional search index methods of a single data store and indexing a set of fields does not scale well. Since the indexes should mimic search criteria, if the search criteria is configurable so should be the indexes. And hence a multi-tenant search index would need the search data to be flexible accordingly to support more use cases.

We also saw the benefit of keeping each criteria’s searchable data separate and agnostic, regardless of the tenant or entity. This sets us up well for future improvements like data sharding and making use of different types of databases for different sets of use cases.

Additional data pipelines and reconciliation also made our system more complete. With these flows we were able to include the data missed in older workflows.

Capillary’s new search platform enhances search capabilities for multiple tenants and entities. This robust platform seamlessly integrates search into various product features, offering new possibilities. By refining and enhancing it further, we’ll provide even more value for various use cases.

We hope this post provides valuable insights into our platform’s evolution. We’re dedicated to keeping you informed about our ongoing work, so stay tuned for updates.

Acknowledgements

Special thanks to Pardeep Singh and Prakhar Verma for designing, developing, and contributing to different parts of this project.

--

--