Migrating Data-Intensive, High-Throughput APIs to Microservices

Maggie Lou
Nov 17 · 8 min read
Photo by Alina Grubnyak on Unsplash

The history of Thumbtack’s backend codebase mirrors that of many of our fast-growing peers. When we were first built in 2008 our codebase was developed as a monolith. For years, all application code lived in a single repository and the bulk of our data resided in a single Postgres database. Eventually our growth required a new focus on scalability and we began to migrate to a microservice architecture.

Despite their many benefits, these migrations come with their own set of challenges. In this post we’ll walk through some of the challenges we encountered when migrating job preferences out of the monolith and into their own microservice. We’ll also touch on some potential risks that ended up having straightforward solutions due to the nature of our architecture.

Hopefully these learnings can help teams preempt these challenges in the future.

Why microservices?

As opposed to having a single monolithic codebase, in a microservice architecture each business functionality is broken out into a self-contained service. A microservice architecture undoubtedly has many benefits, especially in terms of flexibility and scalability for fast-moving and fast-growing companies. Amongst the numerous benefits of a microservice architecture these include:

Example migration: Job preferences

Thumbtack prides itself on its ability to match customer projects to professionals who are uniquely qualified to meet the specific needs of that customer. To enable this level of detailed matchmaking, Thumbtack collects job preferences about each customer project.

For example a customer looking for a house cleaner might specify the following job preferences: “1 bedroom”, “1 bathroom”, “no stairs”, “pets”.

Professionals then have the opportunity to tell us, with detail, the exact types of jobs they prefer to do. For example, a house cleaner might tell us that they want leads from customers with “1–3 bedrooms” but “no pets”.

Due to the high volume of customers on our site, as well as the potentially high number of job preference selections for each project category, our job preferences APIs are data-intensive and high-throughput. Also, job preferences are one of the core tenants of our matching engine and a critical dimension of how we deliver a good customer experience to our users. Given the size and importance of our platform to many of our users, we had to ensure availability and consistency of the product at all times — i.e. a zero downtime migration.

These requirements resulted in the following challenges that we will be covering in this post:

Ensuring zero downtime

One key requirement of our migration was that the site remained fully operational. This was especially challenging because of the importance of the data and APIs being migrated.

Microservices are typically expected to independently manage their data. We had to extract job preference data from the monolithic Postgres database into a database owned by the microservice.

We followed a common double write, double read pattern. This allowed our site to stay live throughout the migration, as opposed to shutting it down for a maintenance period. The gradual process also allowed us to ensure a high level of accuracy in the migration. This would have been more challenging if we attempted to migrate all the data and logic in a single attempt.

This process entailed:

This process is complex, requiring additional network calls, the creation of new APIs, and a significant increase in load on the live production database in order to duplicate every read and write across two tables.

One feature at Thumbtack that facilitated this process was our robust feature flag architecture. By gating each step behind a feature flag we could gradually ramp up the flags to rollout changes. While monitoring metrics we could quickly turn off any step if we noticed bugs or unsustainable load on any component.

Latency

In a monolith any component can speak to another with a function call. Logic that uses diverse sets of data can use a quick and easy database join. However, communicating between microservices requires network calls which are high latency and less reliable.

The first time we tried to ramp up use of a particularly data-intensive API the P95 latency of the endpoint was 200ms slower than its original implementation in the monolith. This was caused by a combination of network latency and implementation details.

While these solutions are not specific to microservices, the need to counteract the additional latency led us to pursue several performance optimizations. One optimization was to fetch data in parallel batched calls instead of executing requests sequentially. Another optimization was the addition of a caching layer. The data in question is very infrequently updated, making this a particularly good use case for caching. These optimizations reduced the P95 latency of the endpoint by 400ms.

Check out this blog post for another interesting exploration of lowering microservice latency.

Caching complexity

While adding a cache dramatically improved the performance, caches also add additional complexity to systems. Cache bugs can be subtle and difficult to debug. If incorrect data is cached it may be difficult to realize that the data is inconsistent between the database and cache.

In our case, a hard-to-catch bug popped up in the form of a tricky race condition.

To resolve this, we implemented the following solution:

At the end of the write Thread A can either explicitly set the updated data in the cache — using the Set operation — or it can let the 0B placeholder expire. If the placeholder expires, the next time a thread reads or writes from the cache there will be a cache miss. After reading the data from the database the thread will write that data to the cache

Memory optimization

Due to the limited scope of each microservice, they should each have much less memory allocated individually than the monolith. This shift highlighted some memory inefficiencies that previously weren’t a problem and we began to receive alerts that our containers were running out of memory.

Upon further research we learned that the Go runtime doesn’t always immediately return freed memory to the OS (more details in this Stack overflow article). This is not technically a memory leak — the memory is still available to the process that allocated it. But it will cause the memory usage of the container to continue to increase because the garbage collector is not being called.

Even when it doesn’t currently need it, the runtime holds on to freed memory because coalescing and freeing memory to the kernel is expensive. If the runtime anticipates needing the memory again in the future, it attempts to optimize by just keeping it.

For our use case the runtime would have allocated a lot of memory, freed it, and then never had to allocate that much memory again. We cache processed maps of data. The first time we read the data from the db we need to keep both the raw db rows + the processed maps of data in memory. When reading cached data we only need to keep the processed maps in memory. This means all of the extra memory the runtime is holding onto is essentially being wasted.

We resolved this by setting an environment variable GODEBUG=”madvdontneed=1" to force the process to automatically release freed memory.

Conclusion

Microservice migrations are fairly commonplace with growing tech companies. Despite their many benefits, they also come with new and unexpected challenges. To benefit from a microservice architecture and minimize regressions, organizations need to approach these migrations carefully.

At Thumbtack the learnings from our past migrations continue to make the continual migration process smoother. We hope they can help other organizations with similar endeavors.

Reach out to us here if you love tackling technical challenges around scalability and marketplaces and want to join us as we develop the future of homecare!

About Thumbtack

Thumbtack (www.thumbtack.com) is a local services marketplace where customers find and hire skilled professionals. Our app intelligently matches customers to electricians, landscapers, photographers and more with the right expertise, availability, and pricing. Headquartered in San Francisco, Thumbtack has raised more than $400 million from Baillie Gifford, Capital G, Javelin Venture Partners, Sequoia Capital, and Tiger Global Management among others.

Thumbtack Engineering

From the Engineering team at Thumbtack