Avoid pitfalls of data migration project.

Published in

Data Migration Plan

8 min readJan 9, 2022

Data is of significant value for the organisation who owns it. Therefore, considerable care must be taken to migrate the legacy data into the new system accurately. Several challenges occur in data migration projects. In the previous blog, I described following unique challenges of the data migration project.

Legacy data is an unfamiliar terrain.
Limited development time
Delays in decisions making from the business.
Restricted execution time
Poor data quality
Lack of rigorous testing
Impact on data warehouse and BI

Solution

The primary objectives of the data migration project are to extract the data from the legacy system, transform it into the desired target data model, and finally to upload into the new information system. However, it needs an agile and proactive solution model to overcome unique challenges of data migration. This blog summarises the coherent data migration solution, used by Venus Informatics, which mitigates the risk stemming from above-stated problems.

Venus Informatics offers a proven solution, to overcome above challenges of data migration projects. Based on the lesson learned from my experience and the best practices for data migration, recommend by veterans and industry experts, It comprises of following key components:

Develop with near production data
Migrate along logical data partitions
Measure migration quality
Regular quality reports
Agile and robust processing
Data cleansing
Incremental ETL
Living documentation

Develop with near production data

Database models of legacy and target systems are significantly different. The documentation (data dictionary) of the legacy system is not completely reliable, which often results in some assumptions in the migration code. In such scenario, developing a data migration system with a stagnant copy of production data may not cover all cases and outliers in the business data. On the go-live day, data migration code should run without any issue to ensure integrity and consistency of migrated data. Leaving any particular untested case in the legacy data can break the migration code or prevents some legacy data from the migration.

From the beginning, developing and testing data migration code with a clone of production data, with frequent refresh (daily or weekly) overcomes this issue by rapidly revealing the data semantics of legacy system to data migration developers. Therefore, It ensures early detection of all possible glitches at very early stage of development and testing. Developing with near real-time data highlights any missing knowledge about individual cases in the business data, prevents wrong assumptions, provides sufficient test data, and gives a good estimation of the execution time at go-live.

Migrate along logical data partitions

Although the data models and the capabilities of legacy and target system cannot be matched one-to-one, their domains are always resembling. Therefore, data migration code should first migrate the master data(more inserts, fewer updates) followed by transactional data. For example, when migrating legacy order system data, migration code should first migrate the customers and products(master data), followed by the historical order data.

Migrating data along logical partitions enables the rapid and continuous, but manageable, development of data migration code. It reduces the overall testing efforts but improves the quality of tests as each step of the migration can be tested individually. It empowers early testing of migrated data, as soon as data migration code for a logical data partition is ready. For example, the business system experts can start comparing legacy customer data with migrated customer data in the test target environment while data migration team is working on migrating products and orders data.

Measure migration quality

The transformation part of the migration code may incorrectly modify the data or a subset of the data, which can prevent the wrongly transformed data from the migration, or in worst-case load the corrupted data. Data corrupted by migration code may remain hidden from developers. Therefore, assessing the quality of the data migration code, after each test execution, is vital.The quality of data migration code can be assessed by logging necessary information during the run time. For example, data migration code should capture key indicators, such as the number of records extracted, transformed and loaded, for every target table. Any difference in the number of records extracted and loaded should initiate further investigation. Migration code should also keep a log of any failures in each stage of the ETL pipeline.

Accessing quality of migration code from the beginning facilitates a quick feedback cycle, which in return validates assumptions, reveals potential data clean-up in the legacy system, and eventually mitigate the risk of data migration. Therefore, after only few iterations, it builds the trust in the data migration process and ensures data integrity and consistency of the migrated data.

Regular quality reports

No matter how expert a data migration developer is, she needs inputs from business system analysts or system matter experts managing the legacy system. Therefore, closely involving business system experts into the data migration efforts is necessary. Inputs, feedbacks, and anecdotes on legacy system from system matter experts are critical in forming business rules, transform logic and validating assumptions. Providing business system experts, who mostly have little time to spare, with the aggregated view of data migration result can help them to highlight key issues and share valuable insights.

The data migration quality report should involve key statistics about the overall status of data migration, the numbers of successfully migrated data and the failures due to poor data quality or technical error. Regularly sending a data migration quality report to key business system experts can assist them in productively share their domain expertise with the data migration team and evaluate any potential risk associated with the data migration.

Agile and robust processing

More often than not, during the development of data migration code, the target system is in the ongoing development state. Although the majority of the data model is defined, development of new features and major bug fixing in the target system can alter the structure of target data model.

Data migration code should be flexible enough for continuous development and at the same time robust enough to prevent the execution of migration code from unexpected failures. Thus, implementing the exception handling to control and log all types of problematic data is highly recommended.

Data cleansing

Having higher data quality checks in the target system prevents migrating data with poor data quality. For example, the customer phone number field is a free text field in the legacy system, thus containing alphanumeric values in many cases. Whereas, the analogous field in the target system only allows digits, brackets and hyphen. Manual data cleansing is required for such instances. However, dirty data due to repetitive data entry glitches should be identified and fixed on the fly by blending data cleansing scripts in the data migration code.

Highest possible data quality is the primary prerequisite for any data analytics task. Data migration project brings the golden opportunity to improve the overall data quality of the organisation. Apart from mandatory data cleansing, due to stronger data quality constraints in the target system, it is recommended to define data quality rules for other data elements within the scope of data migration. Data with poor data quality, which violates the pre-defined data quality rule, can be then cleansed or identified during the data migration.

Incremental ETL

Data migration execution time is restricted because it is not desirable to keep business processes offline for a long interval when the data migration process is under execution. Therefore, an incremental approach should be considered, in which an initial heavy data migration load is performed before the go-live. After the initial load, only the data which has changed are migrated before the target system is live.

Incremental ETL can drastically reduce the downtime of the legacy system. However, in some cases, incremental approach may not be feasible due to technical limitations in the legacy system. It is also not worth the effort for a simpler migration where the data in the migration scope is limited, relatively smaller, and takes only a few minutes or hours to migrate successfully.

Living documentation

Documentation is crucial to the ongoing success of any data migration project but is often overlooked; mostly due to lack of desire and resource constraints. An accurate and complete documentation on data migration can prove to be a very useful resource for the business, in particular the BI/reporting team. During a data migration project, it is common to find business system experts and super users anxious about missing out any critical field in the migration, mostly those fields which are used for managing some exceptional cases; completely different from what is documented in the data dictionary. Such scenarios are common when legacy system is timeworn. On the other hand, BI/Reporting team of an organisation, with existing reporting infrastructure, is always curious about the new data model of the target system. Because all current SQL procedures, used in building data warehouse and operational reports, must be modified with new field names and table names immediately after the go-live.

Maintaining the living documentation can provide visibility of the process and progress of the data migration project to the entire organisation. It improves the response time of business system experts in validating fields and lookup data mappings and in providing data migration developers with business rules for data quality and migrating outliers. Initially, a living document can simply be a bunch of excel files, one file per target table (or upload template), which can gradually move to a collaborative platform for wider visibility and amendments. Data migration documentation should include comprehensive information about ETL data pipelines such as target field name, mapped legacy field(s) and table name(s), extract logic, data quality rule, business rules for handling special cases and a boolean field denoting whether the mapping of lookup data is required or not. Such documentation, for every target field in a particular target table, also provides data migration developers with the code outline, which indeed optimises the development time.

Conclusion

Often the data migration project, with its unique challenges and multiple stakeholders, is a part of a bigger organisation-wide change. Therefore, a holistic and coherent solution approach is required to facilitate a reliable solution, which is visible to all stakeholders. Venus Informatics follow above-stated solution approach to deliver end-to-end data migration service covering all stages of data migration projects; these include consulting, building data migration code, evaluating data quality on agreed measures and fixing bad data, developing new ETL processes for existing data warehouse and post go-live support.

Edit Sep 2023:

Thank you for the claps. I wrote this content on data migration ( two blogs) after successfully leading one major migration project in the public healthcare domain in 2018. Since then, I got to work on two more migration projects in the same sector. After my third successful migration, I prepared an outline for compiling a white-paper on data migration plan. I will be covering each component of data migration plan in the next blogs on data migrations.