Tanker: A Multi-Datacenter Customer Data Migration Framework

Sergei Kretov
Pipedrive R&D Blog
Published in
9 min readJun 15, 2020
Illustrative diagram of multi-dc data migration

Intro

Every successful or growing SaaS company usually reaches the point where serving customers from one datacenter (DC) isn’t enough and a change in infrastructure, architecture, and software applications will be necessary.

As you likely know, serving customers from multiple DCs can be quite challenging and it requires proper design and implementation from the lowest levels of networking all the way to considering external requirements like customer data protection rules. This is all done for the goal of a faster, more secure, and more reliable service.

Pipedrive is no exception to this rule.

In this article I touch on the history and reasons behind implementing our own mechanisms for customer data lifecycle. I will also go deeper into the data migration platform we have built and use, and lastly touch on future enhancement possibilities.

History

Let’s start with the “Megaparsec” project (as we call it) that was brought to the table in 2017 when we decided that a second DC is inevitable. From that point forward, our Infrastructure team and a few others have begun preparing for a giant leap.

Image from: forbes.com

One of the main contributors behind needing a new DC, was the introduction of GDPR, a regulation that forced us (beginning May 28th of 2018) to store European Union (EU) customer data within the EU — all while we were continuously getting more and more customers from that region.

From day one, Pipedrive has had EU clients that were stored in the United States based DC. To abide by GDPR rules, their data had to somehow be sent back to and served from the EU, to comply with new data protection policies (and to make our EU customers happier with faster response times and increased security).

Before data migration we needed to solve the multi-dc request routing problems. This is where Barista — our own public requests gateway service — came into play. It provided us with an easy request routing from DC to DC, centralized authentication mechanisms, rate-limiting, service discovery and more. This service deserves an article of its own, so I won’t get into details now.

After we solved the cross-continent network issues, it was time to move the customer data. First, customer DBs were migrated manually by our devops, but this wasn’t a scalable solution going forward.

The good part is that there were only few a microservices with their own DBs at this time — some already had them in addition to main customer data that was in company dedicated DB. [You can read more about the database architecture in this article by our very own Infra Architect, Vladimir Zulin.] With the number of services and dedicated DBs growing every month, manual data migration started to become a pain.

It came as no surprise when Pipedrive’s CTO Sergei Anikin, approved development of Tanker — a multi-dc customer data migration framework and orchestrator service.

Illustration of a tanker ship

Why Tanker? Because tankers are ships used to transfer goods, and our service is one that transfers valuable customer data.

My team started this project in Q4 of 2017 and then launched it live in Q2 of 2018.

An interesting factoid, is that Tanker became the first big service written in golang in Pipedrive and it opened a route to this stack for other services, teams and now tribes.

The problems we wanted to solve

As stated earlier, moving customer data from one DC to another became a necessity for us. This involves not only movement to a European DC, but also EU to US data movement as well. Requirements for location of data storage change either due to a customer’s legal address change, hardware performance reasons or because a majority of users are geographically located near a specific DC. Those changes force us to do data migrations on demand and sometimes at a big scale — moving thousands of customers between DCs or database servers in single DC.

Additionally, our support and infrastructure engineers did a lot of manual work to migrate customer data. Automation was required to reduce the cost and save us from basic human mistakes.

What this all comes down to, is that the need for data migration service was justified, but as time passed, we also wanted to use the migration framework as the basis for company data lifecycle management.

Customers migrate, cancel accounts and often come back after some time has passed. For our systems, this translates into the following main operations:

Main multi-dc migration framework operations

I won’t cover registration or bootstrapping operations of customer data here as those don’t directly belong to our data migration framework. We have a separate set of services that take care of all customer registrations and billing.

How the migrations were done

The Lifecycle of Customer Data starts directly after the company account registration. After an application trial usage period has passed or a company has decided to leave Pipedrive and closes their account, we need to delete all their data (to follow GDPR) and end the lifecycle. Unfortunately, this can cause problems because, the first 30 days following an account removal, users sometimes ask us to have it restored and resume using their previous data (data protection policies prevent us from storing backups for a longer period).

For the sake of making this article shorter, I will describe only a part of the migration process, leaving some details about account removal and restoration out.

To be able to move/migrate any data, one needs to make an export of it in a source location and only then transfer it to a target. It sounds simple, but there’s an issue — we have multiple services that deal with customer data, multiple storages, multiple types of storages, data that is being continuously written from numerous integrations & sources, and it comes from public API calls etc. Basically we can’t simply dump a single DB, create an sFTP connection, copy data and then re-import.

The migration process is much more complicated in the microservice world. Multi DC migration framework consists of an orchestrator (Tanker) that runs in an independent region, migration agents (RoRo — a type of tanker ship) that run in specific geo-DCs where customer data is located, company data provider (CDP) services that actually access customer data and backup storage service (AWS S3 in our case) where we store the exports.

Simply put, Tanker as the Orchestrator allows us to:

- holds a connection to its agents (RoRo) in DCs via gRPC

- schedules sub-tasks and monitors/updates their status

- tracks overall progress of the main task and reports it to our Backoffice UI for engineers.

Orchestrator and agents use MySQL to data store and queue for tasks, subtasks and related log records.

Simplified migration process of a “Happy” flow

Multiple steps must be covered for a successful migration:

  • To ensure data integrity, we stop writing to storages and serving customer API requests after a process of backup/migration/delete/restore was initiated. We basically completely lock the company’s public and internal traffic. Additionally, as we use event queues (Kafka, RabbitMQ) and orchestrator waits for some time before starting exports until customer events get processed. During the lockdown, no one from the customer’s company can use the application (no integration calls will pass through) so operations can safely be done without breaking or losing any data.
  • Create exports from all customer related storages via related company data provider (CDP) services and save them in Amazon Web Services S3 bucket for easy access from target region. These different exports should not be out of sync, meaning all data should be considered as a slice in a specific time (the time a process of a backup was initiated).
  • We make sure that all export operations have finished successfully, and start to import tasks in target DC and stream data to CDPs for importing from AWS S3.
  • Only if all import operations in all customer related services are successful, will we configure our systems to serve customer requests from the target DC, unlock the traffic and consider migration as a success. After some time, we cleanup the source DC as account data there is no longer in use.
  • If something goes wrong, we cancel all imports, unlock the company in source region, rollback all successful imports (delete data) in the target region and consider migration task as failed.

On every step we confirm that the operation was successful, that the number of bytes received from and sent to S3 was as expected, that the number of imported rows in tables was as planned and so on. This ensures customer data safety and integrity.

We store exports/backups on S3 as it is relatively cheap, has a built-in TTL mechanism for automatic files cleanup after a period of time (GDPR), and is accessible from any geo location.

The format of exported files is determined by the CDP service so the teams behind each service have freedom and responsibility to come up with any format depending on the data structure they own and the storage type used on their side.

The Multi DC migration orchestrator and its agents don’t modify contents of backups and act as a proxy between CDP service and S3, leaving memory consumption of agent instances low. Agent does go through the file content received from CDP or S3 but only uses chunks of streamed data to calculate checksums to ensure integrity of file transfers and does not store the whole file in memory.

Every CDP has standardized, well-documented API endpoints that were agreed on beforehand with all teams, to make export, import, delete and post-migration operations. This means that every new feature supporting a microservice that deals with and own customers data, should have those migration API endpoints implemented before production starts so at any point, we can provide our customers with reliable migration, backup, removal or restoration flows.

Thoughts for the Future

Tanker and RoRo were the first services written in Golang for us, and at that time nobody really knew how to use them properly. That said, there hasn’t been many issues since we do tests (dedicated testing service & unit tests), but the code is pretty complex and not easy to modify or read. This also happened before with NodeJS, where we learned it by using it, collecting best practices and experimenting, making services more reliable, self-healing and becoming less prone to mistakes. Nowadays NodeJS is a pretty mature language in Pipedrive and we are on the same path for Golang.

We create unified/reusable libraries for our Golang ecosystem and as our expertise grow, I could see a significant improvement for data migration services in terms of reducing technical debts we originally introduced when we rushed into adding more capabilities (restoration, on-demand backups, manual and automatic obsolete accounts data removal).

Additionally, we could make backups without the need for total customer lockdown, but this may be extremely hard to achieve as data may quickly become out of sync if we do not stop incoming requests during backup phase.

Our plan is to constantly improve Tanker, RoRo, and other framework services especially in terms of reliability, readability, maintainability and visibility. When a migration requires to operate with tens of CDP services in parallel, involves connections to multiple DCs, monitoring always is the helping hand in understanding what happened, if something goes wrong.

One future use case of a framework that could help engineers with investigating issues/bugs by restoring customer data backups in dedicated temporary environments — testboxes — and basically replicate the production environment for a specific customer. This way engineers can perform deeper investigations without breaking customer data consistency or interfering with normal customer activity.

Conclusion

I have shown you only a glimpse into the main flow of the migration process, but in reality, it is more advanced and includes additional steps that sometimes, for different CDPs, need to be run before and after import processes are complete, e.g. reinitialize some synchronizations, reinstate integrations, clear caches, etc.

During the years of migration framework usage, we have helped to restore many different customers accounts so they could continue from the point where they left us. We’ve moved thousands of them from one DC to another, conducted numerous DB servers maintenances (yes, we can migrate data between DB servers in the same DC as well), cleaned up our infra from unused data and more.

Obviously, it is a huge investment into developing something like Tanker, but once you master customers data management properly, this opens up possibilities to serve users in ways that were either impossible before or were done manually thus raising happiness and quality of experience, optimizing customer support and devops engineers time by automation.

Happy migrating!

--

--

Sergei Kretov
Pipedrive R&D Blog

A Principal Lead Engineer @ Pipedrive in CORE/Storage Platform tribe, MSc in Computer Science. Startup lover and technology enthusiast. Proud Bengal cat owner.