Modernize the Legacy — Software Archaeology — Data

Published in

CodeX

5 min readAug 19, 2021

As already stated in my last blog about “Software Archaeology” (see link) I’ll have a closer look on another dimension: Data!

Archaeologists typically study the material remains of the past to understand how people lived. To achieve this Archaeologists, ask questions and develop hypotheses. They choose a dig site and observe, record, categorize, and interpret what they find.

For legacy data in IT it’s the same you must analyze and interpret the data models and the implicit semantic in the actual data to fully understand the meaning of the persisted data. For this you can do interviews, reverse engineer the data model, and based on this you interpret. As data is typically the treasury of a lot of businesses one must be extremely careful when working on it!

In the process of replacing a legacy application with a new one the data migration (move the data from A to B) is an essential step.

With all this said, we can have a look on typical challenges during data migration.

Challenges

Unclear scope: Once you start with the migration it’s quite common that several additional data sources are identified over time. This leads to additional complexity and effort. On top of that the dependencies and sovereignty of the data is not well understood and that can lead to additional, unplanned tasks.
Lacking knowledge about the legacy data model: Often reverse engineering is the only way to get an understanding of the existing data model in the legacy system. Thus, the knowledge is very limited and implicit assumptions are really hard to discover. Another possibility is to ask the legacy experts if they are not already retired or not in your reach.
Data quality is not high enough: In legacy applications the quality of the persisted data is often a challenge. Over time requirements change and this often leads to quality issues because provisional solutions are implemented (e.g. re-use of existing data fields for different purposes, encoding structured data in text fields, …).
Semantic dependencies are not well understood: The existing data models often include implicit semantics. The documentation is poor or even nonexistent and the experts might not remember all the details. Reverse engineering in this case is tricky and time consuming. Often the data evolves over time, e.g. the implicit semantic holds only true for parts of the persisted data.
New plausibility checks are strict: Strict validations on data are often specified and implemented in new applications. This can lead to issues during the import in the new data base. The old data must comply with those rules and an automatic enhancement/correction is not always possible. This may break the import and thus stop the migration!
Runtime: The migration imposes a downtime and thus needs to get scheduled in advance. The ETL processes will take time and often the given time frame is exceeded. You need to know your time frame and implement, measure, and optimize the ETL processes accordingly.

Based on those challenges we identified the following key success factors:

Key success factors

Early fit-gap analysis: The differences between the legacy and new data model incl. plausibility checks must be understood. The gaps must be identified and filled with data cleansing approach.
Early data cleansing to increase data quality: You cannot start early enough to improve the data quality in the legacy system. This can be done with the application or also with scripts on database level. Please be aware that working directly on the database is dangerous and needs extensive testing and a working testing strategy.
Early changes on legacy systems: Sometimes it is necessary to fix the legacy system and fill existing data gaps. In addition, the implementation of delta protocols to support an incremental migration will help to execute the migration and reduce the downtime.
Early systematic and repeatable tests: As we’re working on the treasury of the company it’s key to have extensive tests. You should automate as much as you can since you need to repeat the tests typically several times. In addition, a focus on non-functional requirements like performance is essential.

As you might already noticed it’s key to start with all this early. I saw a lot of migrations fail because they started to late, and it wasn’t possible to migrate the data properly in the remaining time. To overcome the challenges, I recommend having a closer look on the following best practices:

Best practices

Start as early as possible: Don’t wait with this topic. It’s best to start on the first day of the engagement. It needs to get planned and staffed accordingly. The migrated data is also beneficial for the testing of the new application (if it’s available early enough).
Be agile and iterative: As requirements will change over time it’s a best practice to perform the migration several times in iterations. In agile engagements this is the only way to succeed. One can learn a lot by executing the processes several times especially in an early phase.
Use complete production data: For tests it’s key to use the production data as early as possible. This helps to predict the runtime and check if the given time frames are sufficient. In addition, the data can be use for functional tests of the new application as addition to synthetic test data.
Reduce complexity as much as possible: Transformation of data can add a good amount of complexity to the migration. Therefore, this should be minimized. The KISS principle also holds true for data. Sometimes it’s necessary to reduce the plausibility checks in the new application to safeguard the data migration.
Right environment: Use comparable production environment or special environments for migrations. This helps you to safeguard the non-functional requirements (mainly performance and security). As the time frames for downtime are typically very strict you must plan on which environment you will perform the migration. Please note that the access and movement of data might be restricted by security requirements.
Plan in advance: Backward planning (starting from the actual migration data) with several buffers and iterations is key. You should link this with the project plan of the “new” application. Be aware of the close linkage of both streams. If one fails the other cannot succeed either.
Onboard legacy experts: If you have the chance to onboard experts from the legacy system (e.g. Business & Technical Architects, Testers, Developers) you should not hesitate to do so. The knowledge they bring in is priceless. Don’t be too harsh to them because it’s their “baby” you’re talking about.

As already stated, there are other dimensions in the context of software archaeology. Stay tuned for further blogs around Software Archaeology!

Modernize the Legacy — Software Archaeology — Data

Challenges

Key success factors

Best practices

Written by Thilo Hermann