Lift, shift and replatform

Published in

Engineering at Alfa

8 min readApr 27, 2022

Alfa’s customers manage over $250bn of assets on our platform, Alfa Systems, in a world of increasing uncertainty. They need reliability, accuracy and flexibility in everything from the infrastructure used to run the platform; to the detailed accounting underpinning their organisation; up to the business processes used to deliver value to their customers.

Traditionally, our software has been deployed on-premise and managed by our customers or a third party. However, in 2017 we moved to a cloud first strategy: our customers can benefit from the same great product in our cloud hosting service, which moves the responsibility for infrastructure automation, security, scalability, performance and support to Alfa — allowing them to focus on their business, not IT.

With that background, in 2019 we were approached by a long-standing customer who was very happy with us and our product, but their third-party data centre provider wasn’t aligned with the rest of their project team, and they had the typical problems of an on-premise deployment that had been running for many years:

Difficulty maintaining consistency between environments; including production and pre-production, as well as test environments
Delays in provisioning new test environments, caused by hardware availability and manual provisioning
Manual deployments were at risk from human error

This was an opportunity we definitely wanted to embrace: not only would it be good for us, it would allow us to improve our customer’s experience.

The challenge

Migrate from their third party, on-premise, Oracle-backed installation to Alfa-managed PostgreSQL in AWS
…with minimal downtime (ideally limited to Saturday afternoon — Sunday night)
…with 100% data fidelity and verification of the move
…maintaining the existing Oracle-based reporting layer
…with a go-live in November 2020, in the middle of a pandemic.

Technologies

Alfa Systems is designed to be flexible, not just in terms of business ownership of processes (powered by our functional configuration and open API) but also the technical platform. This has allowed us to evolve the platform: changing deployment approaches (such as embracing containerisation and cloud) and supported databases.

Since launching Alfa Systems v5, we have supported the following database platforms:

Alfa Systems’ database platforms

This flexibility was a key tool in enabling the move: we could run the exact same version of the software against an Oracle (on-premise) database and a PostgreSQL (in AWS Aurora) database. Not only did this reduce the number of moving parts — and so risk — it allowed for a clear reconciliation between the two databases, validating the move.

This flexibility in our software is delivered by Hibernate and our own open source library, Alfa Morf:

GitHub - alfasoftware/morf: Morf is a library for evolutionary database mechanics

Morf is a library for cross-platform evolutionary relational database mechanics, database access and database…

github.com

The other key technology we used to enable a minimal downtime migration was AWS Database Migration Service (DMS), which provides a real-time replication across database technologies and deployment architectures.

Architecture

The client’s architecture had users connecting to our platform running on a cluster of IBM WebSphere Application Server instances, over an Oracle 12c database:

The reporting solution worked well, and although improvements were identified, the project team agreed it would deliver faster time-to-value, at lower risk, to defer those improvements until after the replatform. This meant we would find a way to maintain the reporting solution’s features:

Use of Oracle GoldenGate to capture additional Change Data Capture (CDC) metadata against the Alfa Systems data stream
Multi-system scheduling and orchestration using Oracle Data Integrator (ODI) to pull together information into their Data Warehouse
Hundreds of reports in Oracle Business Intelligence Enterprise Edition (OBIEE) that supported both strategic and operational reporting, possibly using Oracle-specific SQL
Bulk data extracts from their data, possibly using Oracle-specific SQL

After the completion of the replatform, we have a cluster of Docker images running our platform in Jetty, over an AWS Aurora PostgreSQL database:

Diagram showing the logical architecture post-replatform — New architecture in Alfa Hosting

AWS DMS is used as part of the ongoing operation: it maps, in real-time, data from the new Alfa-managed database in AWS back into an Oracle environment that GoldenGate can consume. This allows the customer to continue to use their existing reporting and data integration infrastructure, without having to rewrite to avoid any Oracle-specific features they might have used.

Transition

A cutover plan and runbook were prepared, and tested on various sub-production environments, to ensure the full scope of the changes at the point of go-live were understood and well practised. Activities went beyond the move of the data and included:

Data reconciliation
Integration reconfiguration
User verification testing
Stakeholder sign-off

To maximise the chance of success, AWS DMS was used in “full load + CDC” mode, ahead of the cutover, to ensure that moving ~4TB of data wasn’t on the critical path. This meant the new production database in AWS was initialised in advance, while users continued to use the on-premise installation:

AWS architecture diagram showing the users continuing to use the old system, while AWS DMS populated both an Oracle replica and a PostgreSQL replica. — Pre-cutover transitional architecture showing the use of AWS DMS

To avoid opening a hole in the existing production firewall (to allow Alfa’s DMS appliance to connect to the primary database server in the data centre), AWS Transit Gateway and AWS DMS were used by the customer to replicate the Oracle database to the cloud; where an Alfa-managed DMS appliance could use VPC peering to securely replicate the transitional copy into our new Aurora PostgreSQL instance.

The initial load of the cloud databases and the configuration of CDC connections to keep them up-to-date with ongoing production transactions were completed about a week before the cutover.

Cutover

There were many business activities to conduct as part of the cutover, but the database shift was straightforward:

Stop any new transactions being ingested (for example, by shutting down the application servers)
Wait for the DMS queues to flush the changes through to the target
Stop the CDC DMS tasks — in other words, stop the copying; and the downstream consumption
Reconcile the databases, ensuring there were no DMS errors
Complete any other cutover activities, including reconfiguring data sources for reporting
Bring the application cluster up over the new production database.

AWS architecture diagram showing the in-transition/in-cutover architecture: applications are stopped, so the data changes flush through. — Data architecture during the cutover

Lessons learned

Problems

As with any complex technology change, problems were encountered during testing. The biggest problem is that AWS DMS is designed for data migration and data replication: its support for structural changes (Data Definition Language, DDL) is limited.

This has meant we have processes to apply the DDL to both the source and target schemas independently, and use a CDC-only DMS task to replicate the Data Modification Language (DML) parts of our upgrade process separately:

Morf generates the upgrade script (containing both DDL and DML) for both PostgreSQL and Oracle.
We stop the replication DMS tasks and the application platform
The upgrade scripts are applied to both the Alfa-managed source and the customer-managed replica
We restart the DMS appliance and replication
Finally, we restart the application. Morf detects there are no schema differences and the application starts normally.

We’re adding application-layer support to use Morf’s multi-platform support to have the application manage multiple target schema, as a manual process like this doesn’t fit with our strategy of automation.

The other big ongoing issue is related: a lack of TRUNCATE support in DMS. We use TRUNCATE as part of our end-of-day financial snapshotting. We’ve managed this through application-level use of batch deletes, and fed it to the DMS team to include on their roadmap.

Using any product, including DMS, at significant volume over a long period of time will also uncover edge cases — we’ve had a positive experience engaging with AWS’s support and product teams to diagnose these, workaround them and test fixes.

Hints

We’ve also collected a few hints along the way, for others users of DMS:

Test, test and test again. Don’t assume it’ll be alright on the night! You’ll find your own set of pecularities and hints.
Load large tables first. If you want to minimise the runtime and have some tables that are a lot larger than others, configure DMS to load them first: the sooner they start, the sooner they’ll finish.
Use mapping rules to map schema or table names, or to exclude unnecessary or internal-use only tables that don’t need replicating.
Be aware of the unique index constraints in DMS, particularly so you can use batchApplyEnabled=true for high throughputs.
Understand the different recovery options available, and determine which might be appropriate for your use cases. Depending on the circumstance, we will just resume the DMS task; reload individual table data; or — in extremis — reload everything.
Use CloudWatch: monitor the source, the target and the DMS appliance. Key metrics include: CDC incoming changes, CDC errors, DMS task status, DMS storage, and standard RDS metrics like CPU. Also use standard database diagnosis tools, such as top/long running SQL statements and Oracle AWR reports. In particular, the stats of both the DMS applicance and the target database can tell you if you have the DMS appliance sized correctly for the performance you want.

Impact

The impact on our customer has been significant: they’ve got better value for money, with better reliability, better performance of key processes and easier upgrades.

With responsibility for the infrastructure, its performance and the software all with a single vendor, they’ve got a clearer escalation route if/when something goes wrong. But the number of times things do go wrong has decreased with our focus on automation and infrastructure-as-code, and an incentive to identify potential problems with the solution before they become issues.

For Alfa, this means a happier customer — where users don’t view “Alfa Systems” as a problem when there are third party infrastructure issues. We have ongoing subscription revenue, as well as better metadata and experience of eating-our-own-dogfood to improve our products and services.

It’s also provided us the confidence that our cloud solution can work at scale, with over 3 million payments a month, their accounting and the associated business processes being handled with — so far — 100% availability.