Migrating Mountains of Mongo Data

At Addepar, we’re building the world’s most versatile financial analytics engine. To feed the calculations that give our clients an unprecedented view into their portfolios, we need data — from as many sources, vendors, and intermediaries as possible. Our market and portfolio data pipelines ingest benchmarks, security terms, accounting, and performance data from hundreds of integration partners. Behind this data pipeline is a database. And like every database, ours requires maintenance and care.

Maintaining an obsolete database instance is challenging due to lack of support, inferior performance, and a dwindling developer community. As our dataset grew and we faced increased stability and performance requirements, the engineering group at Addepar decided it was time to upgrade our venerable Mongo 2.4 (TokuMX 2.0) database to the latest and greatest Mongo 3.4. Every organization has at least one database saga, and we’re excited to share one of ours.

Dependency management extends to databases

Motivated by our need for an extremely reliable datastore that could handle complex and evolving schemas, we chose Mongo as the sole database for Addepar’s data ingestion pipeline five years ago. We use Mongo not only as a document store for persisting state, but also as a blob-store for capturing intermediate, heterogenous results from the pipeline. As each transformation stage in the ETL pipeline succeeds, its result is decorated with indexable metadata and placed in Mongo. This use-case does not require transaction isolation but benefits from high write volumes, flexible schema definition, multiple indices, easy replication, and automatic failover. However, when we initially decided to use Mongo, the database only supported very coarse locks at the collection (i.e. table) level, limiting write throughput. We were not the only ones frustrated by Mongo’s limited support for concurrent operations. TokuMX, an open-source fork with MVCC semantics and index and document level locking, had been developed to fill some of Mongo’s gaps and we were happy to adopt it.

As our daily data intake grew, it became clear that TokuMX, long since deprecated by its new parent company, was no longer meeting our needs. Conflicting access patterns necessitated indices that were less than optimal for most queries. TokuMX’s index range locks would freeze large chunks of our collections, blocking critical jobs and yielding a worrisome number of operation timeouts. These performance issues made it even more critical that we be able to explore query plans and identify bottlenecks, but diagnostic tools for inspecting the database engine were limited in early versions of the software. Nobody enjoyed wondering why certain operations would sometimes take ten times longer than usual. We needed a new database.

Fortunately, since our adoption of TokuMX, the mainline Mongo distribution had made impressive gains. The newest version of the database offered a much-improved replacement database engine with optimistic document-level locking and radically improved support for server-side aggregation and transformation of documents. I was thrilled to have access to more bulk-write options that could improve the performance of some of our most “chatty” database operations. Upgrading would resolve some of our performance concerns in the short term, resolve bugs that had long-since been patched in newer versions of the software, and allow us to continue scaling with our clients.

Planning the great migration

We also decided to run both TokuMX and Mongo 3.4 in parallel in a production-like environment. Using Java’s convenient dynamic proxy support, we transparently intercepted each application-level database request, performed the operation against both databases, and compared both the document results and operation times using a separate thread pool so as not to block the application threads on additional computation. With the help of a few scripts and our log aggregation provider, Sumo Logic, we built a detailed view of the results (correctness) and performance characteristics (latency) of all of our database operations, broken down by service, data provider, and other attributes. Pairwise comparison of all requests showed that Mongo 3.4’s operations were on average faster, with less variance in their duration. While (expected and infrequent) transient failures complicated this simple approach, we could confidently proceed with the changeover knowing that the new database would perform at least as well as the old.

The portion of our product reliant on the Mongo database is not client facing or driven by external requests. We were able to take advantage of a weekend maintenance period to migrate our database offline. TokuMX’s on-disk format is incompatible with Mongo 3.4’s, ruling out an in-place swap. While it is possible to use a combination of database snapshots and the Mongo family’s “oplog” (akin to the binlog for our MySQL friends) to perform the migration with only several minutes of downtime, looser business requirement allowed us to use the simpler offline process. Still, we needed to be sure that our database would be back up and running before the end of the weekend. Without the database there would be no pipeline, and without the pipeline, our clients would not have their data in time for Monday’s market open.

Rehearsal makes execution all that much easier

When the migration weekend rolled around, the collaboration between our software engineering and DevOps teams paid off. We rolled up our prep work into a detailed script with expected run-times and pre-considered contingencies for possible hiccups. The migration went off with only a few bumps (remember to adjust your configuration management tool to avoid having services unexpectedly restart!). We started processing data again with time to spare until the next business day and with confidence that everything would proceed smoothly. We retired a deprecated, yet critical, dependency, strengthened the relationship between two of our teams, and set the stage for further performance improvements for our clients.

Our upgrade work set the stage for us to server larger clients more reliably than we could before. Features in the latest Mongo distributions are allowing us to tighten up our operational standards while also evolving our models to help our clients make the most of their financial data. We’re looking forward to persisting our next billion documents, recording data on hundreds of billions of dollars of assets, even faster and more reliably than ever!

Build@Addepar

Addepar Engineering + Design Blog

Elan Kugelmass

Written by

Such build, many data. Wow.

Build@Addepar

Addepar Engineering + Design Blog