Challenges and Learning: A MicroserviceDB Migration Project

Tim Collins
CODE + CONTOUR by IPSY
6 min readApr 21, 2021

Managing millions of subscriptions takes a lot of resources. Third-party order management systems may claim to handle large subscriptions without any hiccups, but the truth is that most services cannot handle a large number of active users (over a +1million) without any downtime, and those that can be costly.

Development of a Custom-Built Subscription Management System

In late 2013, we began to replace the third-party order management system with a purpose-built subscription and order management system that would be able to handle subscriptions natively.

Several years later, we had millions of subscriptions, and while the system was working, we decided it was time to start looking into what the next generation of our systems would look like. The current system had built up hundreds of millions of rows of individual orders and even more payments. It used large batch processes that required millions of rows to be created and then joined against millions of other rows all within a single query in one large monolithic database. Eventually, something was going to give.

While designing the new systems, the dam finally broke and we began running into performance problems with the large joins. We battled with the PostgreSQL planner as it began to fail due to bloat by rebuilding indexes or improving our auto vacuuming settings, but these remediations also stopped being helpful. Our new system was only in the design phase, and we had already started facing major problems with processes.

So what did we do next?

We tried multiple things. First, we got creative with our large batch jobs. We had written them as functions with PostgreSQL 9. We thought that if we could break them up into smaller batches, we might be able to alleviate some of the pains caused by the large joins within large transactions. We would stop locking so many rows at once. However, the issue with this was that PostgreSQL functions do not allow for multiple separate transactions; they were not strictly a stored procedure.

We brainstormed for a solution and came up with the idea to leverage a feature for PostgreSQL to connect to another remote database and initiate a transaction there. What we did was set the function to make a remote call to the SAME database. The remote call would create its own transaction. By doing so, we were able to limit the number of rows in a transaction and that allowed us to limp along as we built the new system.

Unfortunately, while one problem was solved in writing these rows, as we ramped up in traffic and introduced our new Add-Ons program we ran into another performance problem.

Every 2nd of the month, our Ipsters are able to add extra items to their personalized Glam Bags for just a few dollars, and ship for free. It drove the largest increase in traffic than we had ever experienced. This was a great problem but also something we needed to solve fast so we could capitalize on this surge in demand.

We solved the specified problem by scaling the database instance sizes. However, the database only scales up to a limit, and eventually, we would run out of space to grow. So, the next step for us was to partition our tables.

Developing Innovative Solutions to Novel Development Issues

We have a monthly subscription and all the orders are not changed at the end of the month. There’s no value in keeping data in one big table, as most will never be touched again unless a customer asks for a refund at a later date. A solution to the problem was partitioning the database.

We had a simple key in the month the order was placed, and even a table with a few million orders is still relatively easy to process versus one with hundreds of millions. Using a trigger to both creating and partitioning the data, we were able to cut down the number of visible rows by an order of magnitude.

Our fun with PostgreSQL, regrettably, did not end there!

Between being in business for several years and having multiple subscription products, the way we partitioned our data resulted in a little over 100 partitions, and each month we added a few. We soon learned that the version of PostgreSQL we were using was not super efficient in how it treated partitions.

We had to deal with developers creating queries that did not include the partition key. When you have an aging codebase with a monolithic system, there are a lot of little gremlin queries you need to suss out; otherwise, they would perform a scan on all partitions making it drastically slower than it was previously. That took some time to clear up. We learned that for every single partition, PostgreSQL would create a Shared Access Lock, regardless of whether the partition key was used or not. However, those rows were only in one table. The shared access lock on a row doesn’t exist in the other tables. Moreover, the other partitions are from old months. We thought since we don’t really look at them very much, a shared lock should not cause a problem! Alas, that was not the case.

PostgreSQL has a limited amount of shared access locks it can give out at a time. If you have thousands of queries a second, and each needs hundreds of locks, this number adds up very quickly. Once they run out, every other query needs to wait until there are free locks available. Even with configuring the database correctly for this, we still ran into many unexplained failures.

Our release of our new microservices, our proverbial land of milk and honey, was imminent. We still were running into more and more issues that were very esoteric and hard to debug. IO was a constant issue in the monolith. As our last attempt to solve our problems before the new release, we migrated our monolith to Amazon’s Aurora database. This was a big solution for us. The IO issues completely disappeared as Aurora’s storage layer was magically able to handle everything we threw at it. We still had performance issues, it did not solve all of them but between the IO issues going away and the ability to have near-instantaneously synced read instances it allowed us to have a little more breathing room.

Final Remarks: What We Could Learn from Our Migration Experience?

A number of key points could be derived from our migration project. The first point being you should plan for the amount of data you want to have, not how much you currently have. Had we partitioned our systems earlier, we would have been able to run more smoothly and be less hurried to solve these issues as they came up. Now, as we consider new systems, we ask ourselves:

  • How many rows are we going to add each month?
  • Is that number going to stay constant if we double or triple our user base?
  • If it grows linearly with the number of users, we’ve decided to start partitioning right away instead of waiting for it to be a problem.

Second, Aurora is really helpful at our level of traffic. Having the ability to auto-scale, read instances for databases up and down, and have them almost instantly and always in sync is extremely valuable. Most of our systems now use Aurora and read exclusively from read instances. We are able to scale traffic much more reliably, even though we use caching like Redis.

The above two points are not the only learning instances from our migration experience. We now know that the monolith had to go. No amount of remediation was going to solve that bottleneck. Luckily, we had enough forethought to start on that project early, before we needed it. So the amount of time we had to keep the old system running was far less than it could have been if we waited until everything broke.

--

--