How we migrated multiple legacy systems without downtime and data loss

Numbers — 57+ DBs, 60+ S3 buckets, 12+ CloudFront instances, 3 DB Instances, 5 different projects, 4 different programming languages and more than 10 frameworks.

Abhishek Nandi
Nov 25 · 8 min read

Background

YourStory Media is a digital media publication which champions startups and businesses in India since 2008. Technologies needed and used by Media houses have evolved dramatically in the last 10 yrs. When I joined YourStory in 2017 with the acqui-hire of Odiocast, I inherited a mixed stack. It looked something like this — customized WordPress, an in-house CMS on Ruby for UGC and regional languages, a web frontend written in NodeJS and VueJS, authentication systems in Ruby, and multiple other services with MySQL, MongoDB and Neo4J databases. The infra practically hung via a thread with no auto-scaling groups and CI/CD.

Just to make it more complicated, we had to plan everything, upskill the team, hire and build new product features, maintain old ones and migrate all with a lean team.

Let’s take a look at the problem

The simplified version of the initial architecture
The simplified version of the initial architecture
This is a simplified version of how it looked

The above image is a simplistic version and hides a ton of things like multiple points of failure, security concerns, scalability and availability issues, cost concerns, data discrepancies etc.

Goal

The below image is also a simplified version of the architecture that we finally achieved after the rollout of all micro-services and integrations in Feb 2019.

The Goal — Final Architecture showing all AWS and GCP components used
The Goal — Final Architecture showing all AWS and GCP components used
The architecture we wanted and finally achieved

Taking stock of everything

So what do you do when you have ZERO documentation, 144 repositories and 101 things going on. You pause and take stock of everything before you screw up. Even before we started to plan for migration there was a ton of work which needed to be done, like adding performance monitoring, alerting, improving stability of the existing system so that we could put stuff on auto-pilot while we built an alternative. And in between all this, we had a ton of feature requests pouring in.

We ran a lot of client campaigns and that’s what led to a bulk of the 60+ S3 buckets. I started chipping away stuff that was not needed and consolidated things that could be consolidated. Audit IAM users, roles and certificates, S3 bucket permissions. We enabled monitoring for buckets which were public to see what was going on, monitored CloudFront distributions to remove unused ones. We started tagging every resource and then would then go about examining and removing what was necessary, much like a mark and sweep strategy that of a Garbage Collector.

We moved our existing applications to Auto Scaling Groups, added CI & CD to be able to roll out updates easily and rollback when needed without any downtime. After a year or so, when things became stable and a huge set of new features were rolled out, we had some breather to get things in order and start our journey off WordPress and attack the next set of feature requests.

To know more about what kind of pipeline we set up and how we scaled with ASG and Docker Compose read this

How did we decide on a database?

Ours was a data-intensive application. From the outside, it might look that its a read-intensive application, but after looking at the existing workloads and database metrics, we observed that we will need to account for spikes of both reads and writes. We would also need consistency, encryption at rest, a point in time recovery mechanism, auto-scaling apart from being blazing fast. We evaluated multiple options. Databases we considered

  • MongoDB
  • DynamoDB
  • RDS — MySQL
  • Aurora — PostgreSQL

We were already running workloads on MongoDB Atlas. Atlas is great, it gives you insights and also provides automated instructions to further optimize it. But, when we came up with capacity planning and started figuring out how much it would cost us, we were not very happy about it. The same applied to DynamoDB along with more effort and limitations in terms of querying the database.

Aurora allowed us to create a read replica and associate it with an auto-scaling group give us enough flexibility to horizontally scale our database in terms of IO and latency. After all optimizations, our database query latencies were always in a single-digit millisecond. We loved this, as with any managed service it required very less DevOps / DBA effort. And the winner is AWS Aurora — PostgreSQL.

Having the cake and eating it too

So how do we execute this? We had an additional problem, none of our internal or external users will stop generating more content/data and we need to ship a CMS, a PWA with a new Design and also migrate data all while everyone was working. It does not end there, I had to anticipate new features or product rollouts while we were working on everything else.

We started with simplifying things. We rewrote the APIs to remove the need for MongoDB. Next, we wrote a router, where the v1 APIs point to the old DB and v2 points to the new DB. That way we could remove everything once the migration had completed and would know what to delete from our codebase without risk of deleting working code.

Data cleansing

We decided to clean things while we were moving it. We did not want to carry the baggage. Which meant we had to do more on the validation side, but also meant we will have a predictable outcome after we had migrated. Things like unwanted data, malformed URLs, missing metadata.

Example:

  • Change in the domain: Initially, our company domain was on a .in domain. Content which dated back to 2008 had permalinks to .in domain instead of using relative paths. We changed everything to relative paths.
  • Missing images: At some point, things were moved from Drupal to WordPress and then at some other point of time, only the frontend of the website was changed to a VueJS based application while the backend remained in WordPress. In both cases, a lot of images went missing or incorrectly mapped. (How wrongly mapped? well imagine 10k images with the name Untitled_Image.png). We decided to restore as much as possible and not repeat the same mistake again
  • Malformed URLs: A single youtube video can be embedded in 3+ ways, while the URLs that can be used are 4+. We decided to structure them in something that our new CMS will understand and transform all types into a single format. Same goes for other embeds.
  • Rethink Authentication: The company had grown over the years plus somewhere in 2010, it allowed people to signup on the same WordPress to publish comments and do other things. We cleaned up users, mapped people to relevant new accounts.
  • Duplicate data: We encountered multiple duplicate sets of data and backed them up and removed them.
  • Issues with AMP: AMP has certain guidelines w.r.t to what HTML tags can be used, we had a ton of violations in AMP and decided we need to address them now rather than later.

We had to adjust our CMS, migration scripts, which was built on top of QuillJS, to make sure older articles could be edited. This was the most challenges pieces.

The Gamble

This is something we could have avoided but we decided to gamble and ended up adding 6 weeks to our initial launch plan. While the gamble was a calculated one and did cost us time, it benefited us in other ways. We decided to change the UI, with performance optimizations in mind and saw, it need a significant rewrite. We also needed to accommodate for new APIs with the CMS movement, hence decided to embark on this path as well.

The path we took

This gamble also benefitted us when we decided to roll out new features like in-house comments using Coral Project Talk, rolling out a new property YS Journal and numerous other changes, which cannot be listed here.

The Migration

After everything was ready, we would run a migration on a sample set of content and test it. Every time we ran it, we increased the sample set. We ran close to 4 migration runs. We switched the APIs in our staging servers and also went to the extent of running Puppeteer scripts to make sure they looked similar. While manually comparing 160k images was not humanly possible, just increasing the thumbnail size and seeing them side by side, did point out issues which could not have been seen otherwise.

Running migration on 72 cores
Running migration on 72 cores
Running on 72 cores with 100% CPU utilization

We launched a 72 Core machine with 144 GBs on EC2 to running the final round of migration. This allowed us to complete the process in a small period. The cost was minimal.

Before we ran the final migration, we added a feature flag to use the new DB and had pushed the new UI as well. This allowed us to simply switch the flags and see the changes in action and switch back in case we found an issue. We asked our editors and authors not to publish for 4 hours. This was more from us monitoring everything and making sure things were fine. We then took the next few hours explaining and making people go through the new CMS. This was done every time something went on to the new CMS.

The end-result

We had moved everyone on to a new CMS which tied up the editorial process like desk workflow, automated plagiarism checks, in-built analytics for editorial flows as well. This made the system secure, centralised data, images, improved web performance thereby improving our SEO and allowed us to do more with less.

We achieved API latencies < 18 ms. Our website loaded blazing fast, with a score > 85 on LightHouse audits. We were serving more than 10TB in bandwidth, with millions of API hits. Our editors and writers were happy. We went on to add more features like import from Google Doc and Microsoft Word to make it every faster for them to publish an article.

Everything I discussed above was done with a team of 5 people without whom nothing would be possible — Arnab Kundu, Utsav Bhagat, Vinit Pradhan, Mayuresh Mandan, Chetan Jain. We hired 2 more amazing engineers Amritha & Prantik via MountBlue. Arnab, Mayuresh & Chetan have moved on from YS. I quit a few months back in search of a new adventure.

Abhishek Nandi

Written by

Co-Founder @ NextStep | Former CTO YourStory, Co-Founder Odiocast & PureMetrics, X- MoEngage, Huawei, Infosys