Design a Scalable Full-Text Search using Azure Search & MongoDB

Implementing full text search into web application using MongoDB as a documents store and Azure Search full text search index.

Published in

Comae Technologies

5 min readApr 10, 2017

Recently at Comae, we have been adding full-text search in our service. The requirement was to build search which can scale on millions of documents providing stable performance with low response times.

Main prerequisites for Search are :

Should fit nicely with user document access control permissions
Must not increase the time for saving documents.

In our application, we operate large logical documents of 30–60MB as computer memory snapshot descriptors. Each document exceeds MongoDB BSON limit therefore we need split them on individual entities and store in multiple typed collections (e.g. processes, services etc). We use MongoDB as a document storage and as far as we control the number of indexes it works perfectly. As long as indexes fit into the RAM, MongoDB is able to hold millions of documents.

On the other hand our full text search does not have hard limitations on document saving performance, though it should provide predictable search time over 0x1FF of indexed fields, definitely should not be constrained by RAM.

Options

Pros & Cons

Initially, as open source software adepts we have reviewed only 2 options: MongoDB full text search index and moving search index to Elastic Search engine. Though since we are hosted on Azure we have also added Azure Search option, just to compare. Below is a table which we filled in to provide fair review.

We were quite surprised that Azure Search appeared as a much cheaper option, considering SLA is guaranteed by Microsoft Azure, simple scalability strategy and no IT maintenance cost — we decided to go this way.

*) though there is elasticsearch-river-mongodb the project is outdated and latest version of MongoDB supported is 3.0.0, while we use 3.4.x
**) We evaluated Elastic and Mongo based on cost of 3x 28Gb D12 instances, we already have Mongo therefore did not include cost, Azure Search as a Standard plan which has 25Gb per document partition, 12 partitions. Cost of Elastic IT maintenance is not included, so total will be more.

Solution Overview

Our architecture is micro-services based using communication layers leveraging REST API and Service Bus Message Queue (MQ). Since there is no automatic Azure Search indexer for MongoDB our integration consist of 2 parts which fit in our architecture:

On every new document created, we post relevant activity in the MQ `fanout` exchange. Search service will listen to its dedicated queue bound to this exchange, read the document and replicate it to the index.
Search request comes in via REST API, we run search on the Azure Search index, which will return list of ids of documents with search hints, then we filter list through Access Control Lists (ACL) service, final list of ids hydrated with document contents from MongoDB.

Important Note: Since the list of ids should be filtered through ACL, it is important that the list is not sparse, therefore we employ additional filters limiting the search to those areas where user can potentially have documents. Also we never ask search to return 10 documents, we get 100 by 100, since we don’t actually know how many will be filtered out by ACL.

Project Configuration

We build our backend services with Node.js/Koa actively using ES6 import/export, classes, async/await, also we employ flow type checking engine. With node 7.7 following .babelrc is enough:

{
  "plugins": [
    "transform-flow-strip-types",
    "transform-es2015-modules-commonjs",
    "transform-class-properties",
    "transform-object-rest-spread"
  ]
}

The only thing left:

yarn add azure-search

Note: In the code below, there are type descriptions which are nothing but flow type annotations, they are stripped by transform-flow-strip-types on the compilation time, as well can be removed manually.

azure-search module integration

Azure Search has got quite good and easy to use node.js SDK azure-search. Unfortunately, it is still callback based but we can quickly transform that to promises:

// ./ComaeAzureSearch
import AzureSearch from 'azure-search';class ComaeAzureSearch implements SearchProviderI {
  client = AzureSearch(config.search);  async addDocuments(index: string, doc: Object) {
    return new Promise((resolve, reject) =>
      this.client.addDocuments(index, [doc], (err, results) => {
        if(err)
          return reject(err);
        resolve(results);
      })
    )
  };// Fulltext search will return ids of found documents in collections
  async searchIndex(index: string, search: string, $filter: string = “”, $skip: number = 0, limit: number = 100) {
    return new Promise((resolve, reject) =>
      this.client.search(index, { search, $top: limit, $skip, $filter}, function(err, results) {
        if(err)
          return reject(err);
        resolve(results);
      })
    );
  };
}// ./config.js
export default {
  search: {
    url: process.env.AZURE_SEARCH_URL,
    key: process.env.AZURE_SEARCH_KEY
  }
}

Adding documents to the index

Documents get appended to the index, in the two following scenarios:

During the storing process of the documents to the MongoDB index
While receiving document created event from the Service Bus. In our case saving a document is quite a straightforward process, it is a POST request to the Azure Indexer REST API. The call is not waiting for actual index to be populated and returns quickly.

Note: Indexes in Azure Search are plain, while MongoDB documents have nested fields. Therefore MongoDB documents should be converted into indexable documents, it can be easily achieved maintaining fields mapping with the `lodash/get` module:

// ./index-utils.js
const fields = [
  {path: “_id”, name: “_id”}, 
  {path:“data.description”, name: “data_description”}
];function toIndexable(input) {
  return fields.reduce((res, field) => 
    ({…res, [field.name]: get(input, field.path)}, {});
}

Having document _id to store, storing document can be as simple as following 2 lines of code:

const document = await collection.find({_id: objectId}).next();
await indexer.addDocuments(‘index’, toIndexable(document));

Searching for documents

Search in our case consist of 3 stages:

Run query to Azure Search (optionally filtered)
Filter through ACL (this is mandatory for us, optional for other projects)
Hydrate documents by ids

import keyBy from ‘lodash.keyby’;async function search(userId: string, query: string, skip: number, limit: number)
  try {
    const userId = ctx.state.user.sub;
    //query filters based on userId should go as 3rd parameter
    const results = await azure.searchIndex(“index”, query, “”, skip, limit);    if(!results || !results.length)
      return [];    const resources = results.map(({id})=>id);
    const accessible = await acl.filter({userId, resources});    const ids = accessible.map((id) => ObjectId(id));    const byId = keyBy(
       await collection.find({_id: {$in: ids}}).toArray(),
       doc => String(doc._id)
     );    return ids.map(id => byId[String(id)]);
  } catch(err) {
    debug("Error in fulltextSearch %o", err);
    ctx.throw(err);
  }
}

That’s it, now search is bounded to the REST API, it will return hydrated documents in array. In our tests, it takes less than a second to get an answer on a search request and since scalability is embedded on every level it should not grow significantly with an increasing volume.

Note: `acl.filter` it our in-house development providing very fast checks over huge ACLs.

Index Migration

azure-search SDK provides the abilities to check, create and delete indexes, which we use on the service initialization stage. After a new index is created it should be populated with all the documents. Fortunately, there is a bulk update operation which means the indexing process is non-blocking. Therefore, the main cost will be reading documents from MongoDB and transferring them over network.

How to Make It Even Easier

Azure Search provides prebuilt indexers for its own hosted database services and recently they announced DocumentDB to support MongoDB protocol. This dramatically simplifies theintegration with Azure by removing dependency on Service Bus from this service.

We hope this blogpost helps you with integrating search into your applications or at least to consider this design architecture.