Creating a Legacy Data Archive in Azure

Simon Harris
Capgemini Microsoft Blog
6 min readMar 9, 2021
Photo by Ula Kuźma

Capgemini recently completed a customer project, which was an excellent example of using Azure Resources to create solutions quickly and cost effectively for business problems.

Background

Our customer had 2 legacy Oracle ERP Applications (HCM/ELM) containing many years of important/sensitive historic data. The customer had moved away from these on-premise legacy applications to new Cloud SaaS applications. However, they didn’t want to migrate the historic data into the new SaaS Applications as:

  • Historic Data quality was questionable/patchy
  • High effort/cost to transform and load data into new applications
  • Existing data didn’t conform to new SaaS application schemas well

For regulatory and business reasons the historic data needed to be kept for 6 or 15 years. However, the business was keen to remove the current high costs for legacy hardware, software and licences.

Our solution was to create a Data archive solution using Azure resources.

Solution Architecture

The new archive was based on Azure SQL DB with Power BI reports for users to search and view data. Azure Logic Apps were created to automatically purge data beyond its retention date. A Power App was added late into the project to allow minor data updates.

Data Migration

The diagram below shows the legacy on-Premise and Azure elements used in the one-off data extract and load to Azure:

  1. Custom SQL scripts written to extract each data entity to CSV files
  2. Extracted CSV files loaded to Azure Blob storage
  3. Azure Data Factory copy operations created to load custom SQL schema
  4. Azure AD groups used to secure DB, Power App & Power BI access
  5. Azure Logic App used to purge data past retention date
  6. Power BI reports allow users to search and view data
  7. Power App allows 2 fields to be updated in the archive

CSV Data Extract vs On-Premises Data Gateway

The customer didn’t have a dedicated network connection from On-Premise to Azure, so we needed extra steps to extract and load data. We considering using Azure On-Premise Gateway, however this was ruled-out for various reasons. Therefore, our solution was to extract data from the legacy Oracle DB to On-Premise CSV files, which were uploaded to Azure Blob storage. Azure Data Factory then loaded these CSV files into Azure SQL DB as below:

Database options

For our SQL DB we reviewed Azure SQL DB compared to Azure Synapse Analytics (formerly SQL DW). SQL DB is relational database-as-a service, whereas Azure Synapse Analytics is a massively parallel processing cloud-based analytics DB. Our requirements were low power for less than 10 users with intermittent usage including periods of inactivity and the total size was no more than 5GB. Therefore, we selected Azure SQL DB as being sufficient for our needs and much more cost effective.

For the selected Azure SQL DB, we had 2 options:

  1. Serverless — price performance optimised for single databases with intermittent, unpredictable usage patterns that can afford some delay in compute warm-up after idle usage periods.
  2. Provisioned — price performance optimised for single databases or multiple databases in elastic pools with higher average usage that cannot afford any delay in compute warm-up.

Scenarios well-suited for serverless compute:

  1. Single databases with intermittent, unpredictable usage patterns interspersed with periods of inactivity and lower average compute utilisation over time.
  2. New single databases without usage history where compute sizing is difficult or not possible to estimate prior to deployment in SQL Database.

Given our intermittent usage pattern and to save cost we selected serverless Azure SQL DB. We accepted the downside that the first connection after the DB paused/was sleeping could receive a connection error while the DB ‘wakes’.

Estimated Azure SQL serverless DB would be around £6 per month compared to equivalent Synapse Analytics around £227 per month, based on Microsoft Azure Calculator and 5GB of storage.

Data Purging

The customer had specific data retention rules for each entity defining the maximum duration those records should be kept. We created an Azure Logic App for each data entity to be purged that would wake once a week and delete records out of retention period. This ensured the customer was not storing data past allowed/justified retention durations.

Security

Some of the data from Legacy systems included Employees personal information including names, addresses and payroll so security was very important to this solution. To increase security, we did the following:

  • We turned off DB accessible from the Internet — only from Azure/O365
  • Restricted DB Access to minimal users
  • Power BI split into 4 separate reports — 1 for each business area
  • Access permissions in Power BI using Azure AD Groups
  • Multi-Factor Authentication turned on to access Power BI for our users

Team/Plan

To deliver this solution we only needed a small team of 2 developers and 1 tester for 7 2-week sprints, then 2 weeks UAT. The most complex area of the project was creating the custom SQL extract scripts for the Legacy HCM and ELM systems. The Legacy PeopleSoft systems had been customised many years ago by another company and we struggled to find knowledge or documentation of the systems, so we had to reverse engineer the schemas and then heavily test the extracts compared to the source systems.

Issues

During the project we encountered the following issues:

SQL DB Roles — We planned to use SQL DB Roles to further restrict which users could SELECT from which tables. However, Power BI Data sets include credentials and seem to ignore DB security users. Our work around was to split the Power BI reports into 4 separate reports, so we could limit access to data areas that way.

Azure RBAC — the customer’s Azure tenant had ‘evolved’ over time and a number of previous integrators, which meant a number of users (too many) were assigned Owner or Contributor at the Subscription level. This was a problem as those users would inherit those high-power roles to our new Database and could possibly change/access the data. Our solution was to create a new subscription in the tenant with a new Resource Group and only assign users at the Resource Group. We considered using Azure Blueprints to try and deny users inheriting RBAC roles but decided the quickest and most reliable solution was a new/separate subscription.

Azure AD MFA —We wanted to activate MFA only for our new Power BI reports, but unfortunately found this can only be set at Power BI level as an application. This meant all users in our solution/AD groups when using Power BI (any report) would be challenged for MFA. This wasn’t a big problem, but possibly would have been nice to target the MFA to specific reports.

Summary

Overall, the customer was very happy with the new archive and I think it shows a great example of using Azure Resources together to quickly and efficiently create a business solution that solved real-world business challenges. The estimated costs for running a serverless DB and Logic Apps are tiny compared to the costs of the legacy HCM and ELM solutions for support, hardware and licenses. On top of that the solution is using PaaS services that Azure will automatically update, support, maintain, back-up and fail over.

Do these kind of solutions interest you? If so, consider joining a team that implements them. The Capgemini Microsoft Apps & Cloud Engineering Team are hiring — see our open roles here.

--

--