The Good, the Bad and the MongoDB

Published in

TrustYou Engineering

12 min readJul 28, 2020

Introduction

Feedback is everything at TrustYou, similarly to how others claim that “data is the new gold”, we understood the potential of receiving feedback from the market and most importantly to listen to this feedback in order to make better decisions, and ultimately, succeed.

The TrustYou platform is consisting of a plethora of modules, however in the context of the present technical article we will be focusing on a use case covering the Meta-Review, Badges and the Reviews widgets. The first is a precomputed aggregation of reviews, “a review containing all reviews” if you like, the second is awarded to hotels which are excellent in one or more aspects of their service, while the last presents the reviews themselves along with a set of statistics about them.

These components serve data that is precalculated by data pipelines that essentially crunch the underlying data and store them to MongoDB so that apps can take and server the information from there via APIs in HTML or JSON format.

This all sounds fine and well, but what’s the link with the title? Why “the good” ? Why “the bad”? Why MongoDB? Well as with every story all is fine while things work within reasonable constraints, however looking back just a several months ago for these components the situation was quite far from acceptable.

The Problem(s)

Historical tickets prove that in the beginning, when the data pipelines were implemented, actually the execution times were reported as acceptable, everything was all sunshine and rainbows.

However as time passed some features were added while some others were removed, some technologies got outdated while some new business requirements were added. The result of all this is that the system collecting the data for Meta-Reviews ended up taking more that 5 days. The bottleneck of the process being the import of data into the MongoDB store which took more than half from the effective runtime.

With an execution time spanning over most of the working week, releases were difficult to do, mostly in the weekends while having the kids running around and throwing toys towards the laptop screen . Due to the long runtime the pipeline frequently overlapped with other daily jobs competing for the same MongoDB resources causing a cascading effect that spread the impact also to other modules from the ecosystem. The effect of this were random errors that the on call engineer needed to investigate.

The monitoring of the MongoDB cluster merely consisted of a basic set of metrics exposed by Ganglia that are mostly describing the status of the server executing the engine (network, RAM, CPU, etc.). While useful, these metrics are ignoring what is going on inside the MongoDB engine, especially within the key features that have the most impact, features that will be more obvious as we move on with our investigation.

Understanding the Effects of the Problem

In order to illustrate the problem the following charts are proposed for a more in depth study. These present the runtime patterns that overlap between the pipelines at various points in time during the week.

The cycle repeats weekly since we have a weekly pipeline that computes meta reviews and badges and a daily pipeline that processes reviews. The former graph is the load averaged to a period of one minute while the later is the bytes coming in on the network.

Pattern 1 is the Reviews running daily in isolation, pattern 2 is Meta-Review together with Reviews and pattern 3 is Badges with Reviews. Production incidents are reported in cases 2 and 3 that manifest as excessive runtime and workers timing out after many hours losing the progress mostly in the Meta-Review pipeline. Interestingly there are no CPU, RAM, DISK issues reported during the incidents. Network shows spikes, which so far cannot be linked to a root cause. The jobs that are running during the crashes are the MongoDB import jobs from different pipelines. These jobs simply take formatted JSON documents and bulk upsert them to the database. During the upsert the pipelines seem to be competing for the limited MongoDB resources which can be a first clue towards the root cause of the problem.

The main focus point is the Meta-Review pipeline, identified with “Pattern 2”, a system that is required to process data at the scale of terabytes, with an average document size around 60 KB, while waiting for write confirmation from all members of a geographically distributed replica set. Unfortunately the constraints cannot be relaxed since the data is required to be available globally and needs to be consistent at all times. Moreover the data is constantly refreshed for the entire data set since the Meta-Review is an aggregation of reviews where even the slightest change in the data set can impact the overall statistics.

Getting to a Stable Starting Point

According to what has been presented so far, crystal clear are only the effects of the problem while the root cause remains hidden. In consequence even before trying to solve the problem first there is a required step to buy time. A contingency plan is developed that reduces the number of production incidents and allows the team to focus on the fix: increase visibility by adding a MongoDB monitoring solution with dashboards that present clearly the metrics, stabilization of the pipeline by refactoring jobs to take at most one hour before writing the checkpoint to disk, implement fault tolerance in our internal library developed in PyMongo to resist connection drops and retry on timeouts for idempotent operations such as upsert.

Fortunately in the case of the Reviews widget the pipeline is not handling aggregations but rather the reviews themselves. In this case a first “low hanging fruit” performance improvement is possible by changing the processing methodology from “full” to “differential”, that is instead of processing the entire database every day process only a differential set consisting of new, modified or deleted reviews.

With the stabilization efforts the team has won critical time that instead of being spent on production incident firefighting now can be spent in starting the quest to find the root cause.

Finding the Root Cause

Essentially a runtime that spans in most of the cases over the largest part of the working week leaves very little room for failure in attempts to fix the problem directly in production. The corrective pull requests are required to have surgical precision in order not to lose a significant amount of time with the long delays from the production feedback loop. Learning through coding by failing fast in combination with swift corrective actions is clearly not applicable in this scenario.

A new plan is required where every step needs to have proved efficiency backed up by numbers from studies that ensure the change is bringing a tangible performance improvement to the system.

Since production did not allow the require levels of experimentation, the focus is turned towards the existing staging system that replicates to some degree the live configuration.

Client Side Wars

The starting point of the investigation was the hypothesis that the client side setup is affecting the performance. In order to verify this claim a set of tests are created with different MongoDB clients with different parameters. The contenders are: mongoimport (standard from MongoDB), tymongoimport (same interface as mongoimport, implemented in PyMongo) and Spark MongoDB connector.

After running a first batch of tests the following results were achieved in comparison with production runtime statistics:

Legend: cluster means geographical distribution, importer is the client program, write concern is how many members of the replica set needs to confirm the write, executor and all executors is the average import rate per executor and for all executors.

The actual absolute values might vary depending on the underlying conditions like hardware, network and other factors. However doing a relative comparison on the staging system between different configurations that sit on top of the same infrastructure revealed some interesting conclusions to be considered: the client side technology does not seem to have a big impact since the results show comparable speeds, in contrast the same cannot be said for the replication. Once the US servers are added to the cluster, the transfer rates seem to drop significantly, which can be expected for such loads and constraints.

The PyMongo based tymongoimport implementation was faster that the standard mongoimport, the Spark MongoDB connector being the fastest. Unfortunately the Spark MongoDB connector has a hidden problem that revealed itself only when working with bigger datasets: it removes null fields. What this means is that for instance if there is a JSON field that is set to null that field will be removed from the upsert. This caused issues since dependencies relying on the presence of the fields even if null. There is a fix for this in Spark 3 however the pipelines use Spark 2 where this behavior cannot be controlled at the time of writing this article.

After the first round of experimentation comparing client side technologies the decision based on the collected metrics is to stick with tymongoimport. The other contenders seem to fall short either by being slow or having unwanted behaviors that cannot be suppressed.

Playing Hot and Cold on the Server Side

A short recap what has been achieved so far: identified three patterns from three different components influencing each other causing performance issues, the situation is stabilized in such a way that the performance issues are still there however they are not creating production incidents, there is now an improved monitoring that can tell us more about what is going on inside the database engine and certainly the problem is not a client side issue.

Now is the time to turn the investigation server side. Having an “on-premise” installation, right from the beginning everybody was aware that this is going to be expensive since without blocking time from the Infrastructure team this is not going to work. While most probably in the foreseeable future cloud platforms will become less expensive and will replace on premise setups, therefore empowering developers to have full responsibility over entire solutions, reality in present times is that there are legacy systems that were created many years ago that still service a large client base. In our particular case we are talking about an on premise MongoDB 3.2 installation in a geographically distributed cluster mode.

The production release windows are mostly in the weekends. Every release requires most of the working week to check the effect of a given change. In similar conditions there is little room left for trial and error, we need surgical precision, but the question is how can somebody achieve this level of precision with such a complex system? From experience the most reliable way to predict the behavior of a complex system with high precision without being necessarily an expert of the system is to replicate the configuration to a test environment that allows fast experimentation. Therefore the staging system is instrumented with replica set members from the US in addition to the existin servers from DE.

Great! Now there is a way to do fast prototyping the only aspect remaining unsolved is what direction to take to converge as fast as possible to quick results that can justify the investment made so far. The team was aware that there is a need to move to newer MongoDB versions, the first step being a mandatory upgrade to 3.4 since upgrades are incremental. However the piece of the puzzle that was still missing is the proof that this will work and will already bring a significant performance improvement.

Luckily our company uses a well organized ticketing system from where some interesting information surfaced during the investigation about when exactly did things turn really bad for the runtime performance. The records revealed that at some point the average document size was 8KB but at the time of the investigation the average size was 60KB. Why do we have this difference? Apparently in the beginning there was an application level compression in the form of encoding the JSON document body to BASE64. With new storage engines the BASE64 encoding is no longer necessary in order to save storage space, therefore the decision was made to remove the encoding in favor of the built-in compression from WiredTiger. Shortly after that the execution time exploded since the data on the wire was no longer compressed.

The Test of Faith

At the time when the MongoDB upgrade from 3.2 to 3.4 was started there was no clear proof that this will work, simple boy scout rule was applied to attempt to migrate to newer technology, leave the place in a better condition than it was before.

It was during the study of the migration changelog for MongoDB 3.4 where the “Eureka” moment kicked in when reading the following line:

Added message compression support for internal communication between members of a replica set or a sharded cluster as well as communication between mongo shell and mongod or mongos.

The link with the BASE64 story was immediate, this is it, the snappy intra-cluster compression is what we need to regain the performance that existed before the extraction of the application level compression.

Making it Faster!

Hell yeah, it was about time! Finally progress… finally a ray of confidence that the solution is really close. The application code base is now upgraded to support both 3.2 and 3.4, we are ready to start!

The staging measurements with the new configuration reveal the following statistics:

That’s it! The proof that we can expect a considerable improvement in production, obviously the magnitude of the actual performance increase in the live system will be different because of the other external factors not present in staging (for ex. load), the metric brings confidence that the upgrade and the configuration change will have a positive impact on the pipeline.

The upgrade to MongoDB 3.4 progresses like a breeze, no downtime, no complications, no unexpected surprises, no issues whatsoever. Kudos to MongoDB, their upgrade process is really well suited for our purposes.

Fortunately after the upgrade similar performance improvements are achieved in the production system as the ones outlined in the staging system.

Flying too close to the sun

Greek mythology offers many lessons of life such as the tale of Daedalus, that builds wings for him and his son to escape from the labyrinth. He tells only one rule to his son before flying out, that is not to fly too close to the sun because the glue holding the wings together will melt. Shortly after starting their journey their flight ends with a fatal accident because the rule was not respected.

The temptation is of course great to continue in the same way in our case, to force our luck. There is MongoDB 3.6 that advertises a significant amount of new features, one of them being zlib compression that claims to have higher compression rates at the expense of more server side resources.

Same process followed that produced the following statistics:

Collected metrics show that in staging, for our systems and for our workloads the average transfer rate is lower with zlib in comparison with snappy. While for different systems and payloads this might very well be a completely different story we currently lack the proof that the setup involving zlib is going to perform better in production. At the same time MongoDB 3.6 with snappy seems to outperform production, however this might not be relevant since essentially they are different systems with different loads.

With the indicator that the new version is able to perform better, the production upgrade is proposed. Once at the new version however results show that the suspicions were true, the system is performing not as fast but the difference is insignificant for our current needs, therefore the last major version from the 3 series is left as the production engine.

Future Work

Looking into the future there are many aspects that can be still further improved. The production system is significantly more powerful which makes a zlib test run a low risk operation that can actually yield better performance. The fact that staging metrics proved the contrary should not be discouraging since the difference is small and allows a margin of risk without visibly impacting the runtime of the pipelines.

While working on adjusting the source code to be compatible with MongoDB 3.4 then with 3.6, it quickly became obvious that having a common component that is responsible for serializing the data to MongoDB can be really beneficial. Imagine for instance a solution that is decoupled from the batch ETL processing pipelines that is streaming records to be upserted to the database. Not just one code base to maintain when moving to a new version but also less runtime for the pipelines that allows more frequent releases.

New versions of MongoDB starting with 4.0, 4.2 and what will follow will bring better features, faster performance. Continuously upgrading the system seems to make a lot of sense in order to avoid performance problems similar to the ones tackled in this round of improvements.

Conclusion

All in all since thanks to the improvements brought to stability, processing logic and infrastructure, the system now is both running stable and finishes within the specified time constraints. In consequence the decision is made to put aside for now further improvement efforts and focus on capitalizing on the gains achieved in the context of the improvement efforts described in this article.

TL;DR

In the quest of finding answers to complex performance problems it is imperative to have deep knowledge of the underlying technology, to have a testing setup that allows verifying theoretical claims without affecting production and last but not least to follow and understand all details carefully because these are leading to the solution.