Creating an Abstraction Layer Over Multiple Data Sources

Published in

Zencity Engineering

4 min readJun 16, 2021

At Zencity, the most basic unit of data is called a data atom. An Atom can represent a social media post, a tweet, an email, or a news article. We aggregate, analyze, and process these atoms in various ways to derive useful insights for our local government clients.

Needless to say, this amounts to vast quantities of data which poses several interesting challenges for us as developers as we want our clients to have a fast, smooth user experience, our developers to have solid coding infrastructure to build upon and as well as to keep costs down.

The solution we came up with required us to use both MongoDB and Elasticsearch and to use the right data source in each scenario to squeeze out the best possible performance for each query.

In this blog, we’ll discuss how we created a layer of abstraction over our various data sources, the challenges we faced, and the solutions we came up with.

It’s the Circle of Life

Describing the way we gather our data atoms goes well beyond the scope of this article, but once a data atom is harvested and processed we store it in MongoDB. From there, the data is propagated to Elasticsearch through a process we will discuss in a future post.

Prior to adopting Elasticsearch, we had used Azure Search. Azure serach is no longer used but kept as a fallback. We initially chose Azure Search for its language analysis capabilities. Unfortunately, as we kept using Azure Search some downsides became apparent: an incomplete feature set, lack of fine-grained control and high cost. All of these pushed us to migrate to Elasticsearch which is the industry standard.

Querying for data atoms can be done in either of two ways:

Query MongoDB directly
Send the request to a Search Microservice which will query Elasticsearch

A high-level overview of our architecture

Decisions, Decisions…

Elasticsearch shines when it comes to textual search. For instance, a user may search for atoms containing “covid-19” and Elasticsearch will produce fast results (inverted indexes anyone?). Elasticsearch is also capable of finding all the variants of a word (e.g. dog/dogs).

MongoDB, on the other hand, while capable of textual search, is unable to compete in speed with Elasticsearch and lacks some features like tokenizers and analyzers that come with a full-fledged search engine.

In our system, there are some grey areas and edge cases where the choice isn’t obvious, and fine-tuning the decision process is currently our primary goal.

Taking Control

A common way of abstracting multiple implementations of the same interface is the Inversion of Control (IoC) principle. Basically what that means is that we decouple object initializations from their usage. This allows us to build a common interface that all data sources adhere to, while abstracting away the intricacies of each data source.

This is what this looks like in our code:

Lingua franca

Another key to being able to hide multiple implementations behind a unified interface is having a common query language. Since we rely heavily on MongoDB we decided to utilize the MongoDB Query Language (MQL) as our baseline query syntax and to translate the queries for data sources other than MongoDB.

Future-Proofing

This architecture also allows us to easily add additional data sources as the system continues to grow and our needs evolve.

Adding an additional data source is as simple as:

Adding a provider which implements the common interface
Implementing a translation mechanism between MQL and the desired source’s query syntax
Creating an additional microservice that will talk to the new data source

Bringing It All Back Home

By having a single entry point into the data layer we were able to improve our logging and monitoring.

This in turn allowed us to find slow queries and migrate them to the appropriate data source.

Here is some data showing the stark difference in average query execution time between the two databases for the past two months:

  Provider        Avg Time  
 --------------- ---------- 
  MongoDB         20.56s    
  Elasticsearch   10.75s

Lessons Learned

In the process of optimizing and refining our data layer we learned several important lessons that we find to be valuable beyond their immediate context:

Use the right tool for the job

Although the difference in performance may seem negligible at first, the effect it has on user experience is felt well by our clients.

Here are a couple of examples:

Using Elasticsearch enabled us to create a search feature in our platform that searches through many thousands of Data Atoms and responds on average within a second and often quicker than that. For some of our bigger clients, this is simply a task MongoDB could not handle within reasonable time.
When dissecting Data Atoms according to certain textual criteria to generate client reports, using Elasticsearch allowed us to retrieve relatively large, accurate results in mere seconds.

Abstract when necessary

We are often wary of premature abstractions and the added complexity they entail. However, in this case, we benefited from adopting the aforementioned architecture as it allowed us to write scalable, testable code. It also helped developers on other teams to have this logic abstracted for them. They can simply use the Data Atom Provider without having to concern themselves with what’s going on behind the scenes.