12 Common Pitfalls to Avoid: Your Ultimate Guide to Successful Data Platform and Analytics API Migration

Sameeksha Bhatia
helpshift-engineering
7 min readJun 19, 2023

Sharing key learnings on executing the migration of analytics API services.

Introduction

Helpshift offers data insights to its customers through a variety of channels, including Dashboard Analytics, Power Bi, and the Analytics API. Recently, the data team at Helpshift undertook a mammoth migration of their data platform from a Hadoop-based on-premise environment to a managed cloud provider like AWS.

Migration can be a difficult and time-consuming task that demands careful planning and execution. There is always a possibility of something going wrong or being missed during the migration process.

However, with proper planning and execution, the transfer can be a success. In this post, I’ll discuss my journey and some of the main takeaways I’ve gathered from my experience with the data platform and analytics API migration at Helpshift.

This was my first time being part of a migration project. Starting the migration was relatively tough. A lot goes through your mind, and it took a week to figure out where to begin. One has to understand the legacy code, decide what to migrate, and estimate the work involved. It was a lot of research and learning, asking the right questions, understanding the business use case, and trying to find the right balance between understanding the system and making tangible progress at the same time.

Plan thoroughly

Planning is a crucial stage in any project, particularly when it comes to migrations. You should spend a good amount of time understanding and seeking answers to the right and important questions.

To illustrate, let's consider one of the PowerBI reports we started our migration with. This report has some metrics data being fetched from our data warehouse. The simplest thing to understand at this step is that all the metrics that are currently served from the old infrastructure (Hadoop system) should now be pointing to the new infrastructure (AWS) after the migration. Secondly, there should be consistency in the metric numbers before and after the migration.

To proceed with these goals in mind, you should seek answers to key questions such as:

  • What does this metric signify?
  • How is this metric calculated in the existing infrastructure pipeline?
  • Which other services/dashboards/APIs use/show this metric?
  • Which job is responsible for populating this metric on the dashboard?
  • How frequently is the job run (daily, weekly, monthly)?
  • What is the query to fetch data from the source table?
  • Which table is this metric being fetched from?

The next step would be to create a data mapping table where you gather details about source and destination schemas.

At this stage, you should consider checking the SQL query, the parameters used, and the output formats. In this process, you get to know if any new columns are missing from the target table.

It is also crucial to gain a comprehensive understanding of the legacy codebase and its architectural design. Understanding the reasoning behind why a certain decision was made to calculate the metric in a particular way is also important.

Once you have the answers to these questions, you should document everything in a development specification document and have it reviewed by your team. This ensures everyone is on the same page and helps to avoid misunderstandings during the migration process.

Project Planning and Estimation

To provide ballpark estimations for a migration project, it’s important to have a good understanding of the tasks involved. One effective way to achieve this is by reviewing the code base and identifying what changes or additions need to be made. Once you have a clear idea of the scope of work, you can determine the best approach and brainstorm with your team if necessary.

Consider whether a proof of concept (POC) needs to be completed to validate your approach. Establishing clear goals and milestones for the project will help you stay on track and ensure that everyone is aligned.

To break down the migration into manageable tasks, consider creating a Jira epic and dividing it into smaller tasks. By doing so, you can more accurately estimate the timeline for completing the project.

The goal is to set realistic timelines, prevent burnout, and maintain productivity throughout the process.

Clear your priorities if you are unsure which of the many tasks you should start.

Common Pitfalls

There are some points to take care of and keep in mind when you start with the implementation and execution of migration.

  1. Avoid unnecessary changes: First, it's very important to be focused on changing and modifying code that is necessary. Avoid making any unnecessary changes. Things like changing column names, changing the output format, etc. For example, if the source schema column names, API response format, data types, and date types match the target (new infra), then it is better to keep things as they are.
  2. Create a connection framework: To streamline the migration process, you can create a connection framework that generalizes the connection to the data warehouse. This ensures that secrets and connection code logic are in one place and every service connects to this library, making the migration process more efficient. We created a new library that connects to and fetches data from the data warehouse. So each API service connects to the data warehouse by making a call to this library.
  3. Check date formats: Always check the date formats in your queries and ensure that they match the new platform’s requirements. This common issue can cause queries to fail during the migration process. As an example, the date format in the old system could be ‘yyyy-mm-dd’ whereas in the new system it could be a timestamp of the format ‘yyyy-mm-ddThh:mm:ss’.
  4. Check data consistency and accuracy: During the migration process, it is important to ensure that the data remains consistent and accurate. As an example, when migrating a metric for an API service you can check beforehand when hitting the API what data is generated and keep it as a reference to be compared after migration.
  5. Keep stakeholders informed: Communication is key during the execution phase of a data platform migration. It is important to keep stakeholders informed about the progress of the migration, any issues or delays should be communicated as early as possible.
  6. Test: Before executing the migration, it is important to test your migration strategy thoroughly. You can create a test environment to simulate the migration process and test your pipeline, queries, and data flows to identify and resolve any issues that may arise during the actual migration.
  7. Back up your data: Always back up your data before starting the migration. This ensures you can recover the data in case of any unexpected issues or errors during the migration process.
  8. Slow Rollout: Determine the traffic and usage for the metric being migrated. Prepare a list of customers and roll out the migration for the customers, with the first rollout for a customer with low usage of that metric. This is just an example, but the overall idea is to release your changes in a slow and steady manner. A slow rollout ensures the unexpected errors or failures you encounter during migration will not impact all customers.
  9. Prepare a release checklist: When you perform the production release, you don't want to be thinking about what to do next and searching for it. Instead, be prepared with a list of tasks that you need to be doing on the day of release. This checklist should contain information about the prerequisites of the release, the release tasks, and how you will be testing and ensuring things are proper and working fine once the release is done.
  10. Monitor and optimize performance: Once the migration is complete, it is important to monitor the performance of the new platform and optimize it for better performance. One of the important things is to check query performance by looking at the query plans. Optimize your queries if a large amount of data is being scanned every time or a lot of data shuffling is happening in order to sort the data.
  11. Set up alerts: You want to be aware of any pipeline failures, delays in daily pipeline runs, queries that took a long amount of time to execute, things that are spiking your CPU usage, a check on costs incurred from your ec2 instances, etc. Alerts should be in place for any such situations that may occur.
  12. Data Quality: One of the most crucial steps is to check for any data quality issues. You don't want to be notified by your stakeholders about any data mismatch or data correctness problems. Set up an entropy detection system that can compare numbers between the old and new pipelines.

As closing thoughts, one of the things I would like to share is to be very prepared for unexpected challenges. Things might at times go off plan, and that's fine and understood, but what is expected from you is to communicate the challenges or issues because of which you foresee any delay or problems in the process. Keep all the stakeholders in the loop.

During the process, if you discover new information about a legacy code base or there is a change of approach/plan, always document and update this in your Dev spec document. This helps in understanding later why a certain decision was made by you.

Lastly, do not forget to take frequent water breaks, relax and chill. Things will definitely work out in the end.

--

--

Sameeksha Bhatia
helpshift-engineering

Software Engineer @ Helpshift — Data platform Team #data #engineer