Navigating Complex Data Migrations:
Lessons from Capillary’s Multi-Tenant Journey

Published in

Capillary Technologies

9 min readSep 6, 2023

Introduction

At Capillary, we’ve recently completed extensive data migrations, encompassing a wide array of products and applications, with minimal downtime. Our journey underscores the imperative for individuals handling data and databases to exercise exceptional caution, resilience, and preparedness.

Several critical factors make data management distinct, to highlight a few crucial points:

Recovery and Rollback Complexity: When dealing with data and databases, the process of reverting to a previous state in case of issues or the introduction of bugs is far from straightforward, especially when compared to the relatively simpler task of rolling out new features. This complexity demands meticulous planning and execution.

Unique Verification Processes: The verification of activities related to data doesn’t neatly fit into the typical Software Development Life Cycle (SDLC). It requires a distinct set of procedures and measures to ensure accuracy, integrity, and security.

Impact on Existing Functionalities: Data activities can have a significant impact, not only on the functionalities of a system but also on its performance. Degradation in performance during data-related operations can be a critical concern that necessitates careful consideration.

Cost Implications: Managing data comes with cost implications that are often underestimated. These include the costs associated with storage, data transfer, maintenance etc. A well-thought-out data migration strategy is essential to mitigate these costs.

In this article, we explore our recent experiences in large-scale data migrations, sharing valuable insights for navigating this complex landscape.

Background

Capillary CRM operates on a robust multi-tenant micro-service architecture, encompassing a vast ecosystem of over 100 micro-services collectively managing an extensive dataset exceeding 100 terabytes. This dataset encompasses a diverse range of data storage technologies, including relational databases (MySQL), NoSQL databases (MongoDB), file storage services (AWS S3), in-memory databases (Redis) and more. Additionally, these databases employ a multi-tenant partitioning approach, hosting specific groups of tenants. This heterogeneity presents complex challenges in data migration.

In its inception, the system operated on a single-stack infrastructure. However, the expanding clientele necessitated a transition to a multi-clustered architecture to meet evolving demands effectively.

1. diagram depicting the multi-clustered mode of the capillary micro-service architecture

The Capillary CRM application stack is deployed and meticulously maintained across diverse global locations. We diligently assess each tenant’s geographical context and allocate them to the most suitable cluster, customizing their setup accordingly.

Problem statement

Our challenge arises from the operational challenges faced by the deployment of certain tenants within application clusters that do not align with the optimal configurations for their specific needs. This mismatch creates unwanted operational situations which can have a detrimental impact on these tenants’ overall customer experience. To rectify this situation, we undertook the task of relocating these tenants to more suitable clusters. The crux of this challenge lies in orchestrating the seamless transfer of their data across clusters, all while minimizing downtime.

Acceptance Criteria for the data-migration

At a broad level, our task involves transferring information from one cluster to another across various databases within the system. As mentioned earlier in this document, the diverse nature of data storage and management presents a considerable challenge in finding a solution for this.

To tackle these difficulties efficiently, we have established three primary acceptance criteria:

High Availability for Migrated Tenants: To minimize downtime for tenants undergoing migration.

No Impact on Non-Migrated Tenants: To ensure that tenants not involved in the migration are not adversely affected.

Cost-Effective Approach: To keep the overall cost of the migration process in check.

High Availability for Migrated Tenants

Problem Statement: Data export and import operations, especially with the massive datasets involved (several terabytes per database server), can result in extended downtime. Our challenge was to devise effective data migration strategies that could either eliminate downtime entirely or minimize it to the greatest extent possible

Approach: To achieve high availability during migration, we propose building an incremental migration pipeline. This pipeline captures delta data continuously during the data-migration phase. By processing only the delta data, we can significantly reduce downtime. This cycle of capturing and migrating delta data ensures that tenants experience minimal disruption.

No Impact on Non-Migrated Tenants

Problem Statement: Some shared database servers have tenants with low-latency requirements. Export and import jobs on these servers can introduce additional load, potentially impacting the performance of other tenants sharing the server.

Approach: To avoid affecting non-migrated tenants, we need a pipeline capable of exporting and importing data in smaller, manageable batches. This pipeline should also be intelligent enough to pause automatically when database server CPU usage reaches a certain threshold. This way, we can ensure that the migration process does not compromise the overall performance of the db-server.

Cost-Effective Approach

Problem Statement: Data export and import activities can be resource-intensive. While completing the migration promptly is essential, we must also be mindful of the associated costs.

Approach: After a thorough analysis of the cluster’s overall traffic patterns and resource usage, we identified specific periods of low activity during the week. Leveraging these windows of opportunity, we can optimize our import and export operations. By employing small batch processing and integrating automatic pause mechanisms, we can significantly reduce cost implications. This strategic approach enables us to strike a harmonious balance between efficient migration and cost-effectiveness

Architecture

Traditionally, data migrations have been manual, time-consuming endeavors often accompanied by significant downtime, sleepless nights for administrators, and potential disruptions to customer experiences. Recognizing the criticality of these activities and their potential impact on customer satisfaction, We sought to revolutionize the data migration process within capillary by developing a comprehensive framework.

In the initial stages, we conducted a fundamental assessment of the migration process, revealing that it consisted of multiple discrete steps. Each step had clearly defined success and failure criteria, and the process could be executed iteratively.

Recognizing that this process naturally fell within the realm of workflow orchestration, we embarked on selecting the most suitable tool from a myriad of options.

Camunda emerged as the framework of choice for our needs for several compelling reasons:

The Power of Camunda

Support for Long-Running Processes: Camunda excels in managing long-running processes, even those spanning several months. This capability aligns perfectly with our data migration needs, where complex operations can’t afford to be rushed.

Elegant State Management: One of Camunda’s standout features is its elegant state management. This feature is particularly beneficial for our use case, ensuring that data migration processes can be paused, resumed, and tracked seamlessly.

Efficiency through Generic Tasks: We’ve harnessed the flexibility of Camunda to create a set of generic tasks tailored to our specific needs. These tasks encompass essential actions such as MySQL export, MySQL import, S3 data copying, and data validation. By leveraging the Camunda modeller, we’ve designed distinct processes for each database, reducing the need for extensive manual coding. This, in turn, minimizes the potential for human error and software bugs.

Extensibility: The selection of Camunda as our workflow orchestration engine was also driven by its proven performance in specific projects within our organization. This prior success with Camunda made it a compelling choice for orchestrating the migration workflow, ensuring continuity and leveraging our existing expertise.

Resilient State Management: Ensuring data integrity and process continuity is paramount. Camunda excels in maintaining state resilience even in the face of application restarts (application to which Camunda is embedded). In our context, this is not just a feature but a necessity to guarantee seamless migration execution.

Integration with CI/CD Pipelines: Camunda seamlessly integrates as an embedded engine within our Spring Boot modules, making it an integral part of our Continuous Integration and Continuous Deployment (CI/CD) pipeline. This cohesive integration ensures that data migration processes are not only efficient but also well-aligned with our development practices.

Retrial Support with Subscription Tasks: Data migration isn’t always a straightforward journey, and sometimes, hiccups occur. Camunda’s subscription tasks empower us to handle retries efficiently, allowing us to recover gracefully from any unexpected setbacks during migration.

Visibility and Monitoring: Camunda provided a robust set of monitoring and tracking tools, enabling us to maintain full visibility into the progress of our migration activities.

Community Support: The active Camunda community meant that we had access to a wealth of knowledge, resources, and best practices, making our implementation smoother and more reliable.

Camunda block of execution

The provided example showcases a Camunda workflow process featuring a single block, which has been developed as an integral part of the comprehensive framework.

The “SampleDataExport” activity within the Camunda workflow is a programmatic execution task designed to facilitate the export of data from the database to the disk. Upon its successful execution, the workflow progresses forward. However, in the event of a failure, the workflow is configured to backtrack to the “SampleDataExport-Failed” activity.

The “SampleDataExportFailed” activity in Camunda is classified as a user-centric task, indicating that it requires manual intervention to initiate the “SampleDataExport” activity once more. Whenever the workflow reaches a user activity, a notification is automatically triggered to the relevant team responsible for taking necessary actions. Once these actions are completed, they can resume the workflow’s progress. This approach ensures efficient management and resolution of any issues encountered during the export process.

Workflow for data migration

After breaking down each unit of the task into individual Camunda tasks, the overall migration workflow process can be outlined as follows

Start Data migration: Initiates the migration process by marking the data export timestamp on the source cluster.

Delta Dump: Exports the data from the source cluster.

S3 Copy: Copies the exported data to an Amazon S3 storage location.

S3 to Disk: Transfers the data from S3 storage to the target cluster.

Approve Delta-Restore: A manual approval task that grants control over when to proceed with the data restoration.

Delta Restore: Restores the data onto the target cluster.

Version Up: Marks the completion of data export and import, initiating the next cycle of data export and import.

Trigger Next Delta Workflow: Initiates the subsequent delta workflow.

The Camunda BPMN diagram for the same would be as below

Highlights of Camunda’s role in the overall pipeline

Utilizing Camunda Service Tasks: We employ Camunda service tasks to perform programmatic activities, including MySQL export, MySQL import, S3 file copying, and more.

Asynchronous Processing: To handle potentially long-running tasks, we implement asynchronous processing for these activities. Additionally, we incorporate subscription activities that await the completion of their corresponding programmatic tasks.

Outcome-Driven Workflow: The progression to the next workflow activity depends on the successful or failed execution of the preceding activity.

Manual Approval for Failures: In case of a task failure, it is routed to a manual approval step, requiring human intervention. Upon approval, the task is automatically retried.

Failure Alerts and Auto-Retry: We have robust alerting mechanisms in place to promptly notify us of any task failures. During application restarts, we trigger automatic task failure, prompting a redo of the task from the last successful batch.

Idempotent Pipeline: We ensure the idempotency of the entire pipeline by maintaining batch IDs for both export and import operations. This enables activities to restart from the last successfully completed batch. Any potential duplicates are mitigated through techniques like ‘insert ignore’ or ‘replace’ for MySQL export/import activities, as well as similar constructs in other databases.

Key takeaways

Prioritizing Automation for Large-Scale DB Activities: In the past, we often skipped automating database activities due to limited scope and simple success/failure checks. However, for the current activity that spans all 100+ microservices, a robust data validation layer is essential. Our approach stemmed from defining the necessary validations.

Unlocking Potential with Incremental Data Migration: Incremental data migration, including delta data tracking, proves invaluable for various scenarios, especially when minimizing downtimes is crucial. We successfully employed this approach to shard a multi-terabyte database with zero downtime.

Prioritizing Automation of Retrials: Failures are inherent in systems, especially those handling bulk or batch jobs. To effectively tackle the complexities of large-scale data operations, it is crucial to emphasize idempotency during retry scenarios and automate the retry processes.

Throughput adjustment controls: To optimize large-scale data operations, it’s vital to implement throughput adjustment controls. These controls serve to minimize the impact on the system. Building mechanisms for controlling throughput through efficient batching and automated pause-and-resume functionality is crucial. These mechanisms should be triggered by predefined system thresholds, ensuring smoother and more efficient data processing.

These strategies are pivotal for handling extensive data tasks effectively.

Conclusion

Capillary’s journey through multi-tenant data migrations highlights the critical importance of meticulous planning, innovative solutions, and cost-effective strategies. The adoption of a business process management tool as a comprehensive framework demonstrates the power of automation and resilience in managing complex data operations. These lessons serve as a valuable guide for organizations navigating the challenges of data management, emphasizing the need for careful execution and efficient processes in an ever-evolving digital landscape.

Navigating Complex Data Migrations:Lessons from Capillary’s Multi-Tenant Journey