Business Won’t Wait — Migrating to Azure for Data & Analytics

Accelerating Business Outcomes Using Azure — Databricks, Data Factory, PowerBI, & Snowflake Cloud Data Warehouse

Hashmap, an NTT DATA Company

Published in

Hashmap, an NTT DATA Company

8 min readApr 25, 2019

by Ed Fron, Enterprise Architect

I recently worked with an industrial company that was retiring an on-premise, traditional, Big Data platform that was operational, but it just wasn’t keeping up with business scenarios that spanned multiple business areas or data domains, and it was also struggling with handling data volumes and data loading requirements without requiring significant administrative overhead and budget.

In this post, I’ll highlight the solution set they chose and discuss some of the benefits they are realizing already in moving to a new approach that takes full advantage of the cloud.

Getting to Azure for Data & Analytics

Importantly, they weren’t looking for a like-for-like style replacement (moving to another traditional Big Data platform or IaaS-based solution), but were focused on reviewing the range of SaaS-based, cloud-native solutions that were available in Microsoft Azure (their chosen cloud partner) so that they could either eliminate or significantly reduce the infrastructure effort associated with operating and maintaining existing analytics applications and delivering new apps at a more rapid pace.

It was also compelling to move to cloud solutions that would eliminate static limitations on usage, configuration, and capabilities while providing a consumption-based approach and paying only for the services that they used versus a static capital sunk cost model — an IaaS cloud model wasn’t going to cut it.

What Was Being Moved Out

Before jumping straight into the cloud solution set they chose, below is a quick recap on how the existing traditional solution was being used. They had centralized a significant number of structured and semi-structured datasets into their Hadoop-based enterprise data lake for original data fidelity and then worked with a combination of Apache Hive as a consumption zone and a big data analytics workbook solution that ran on the core platform that helped their business analytics parse through the datasets to extract value and insights.

From a data flow perspective, analysts would create workbooks which would be processed in the Hadoop environment and then exported to Hive tables for broader use by data consumers that were using the workbooks for data interaction or PowerBI for visualization. Workbooks sourced data from other workbooks or from datalinks, and a series of joins, calculations, etc. produced a final sheet containing all required data for export within a single workbook.

This architecture and overall process, while it worked, came with a corresponding amount of overhead and limitations.

The Azure Solution

When we got the call to assist the client in their cloud migration and deployment in Azure, they reviewed with us the core solution components that they had selected (after an in-depth review and testing process) and that we would be working with during the project — I’ll quickly highlight those below to give you a sense for the architecture and how each component is being used.

Original Format Data Store — Azure Blob Storage — Provides scalable, cost-effective cloud storage for all data and a wide variety of services for connecting and using the data
Analytics and Reporting Data Store — Snowflake Cloud Data Warehouse — Delivers a fully relational cloud native SQL data warehouse in Azure providing comprehensive support for both structured and semi-structured data and full support for SQL as well as a full range of ETL and BI tools — you can read about my introduction to Snowflake about a year ago here
Processing and Transformations — Azure Databricks — Provides a VERY quick and easy experience to get an Apache Spark-based analytics platform up and running (and continually optimized) for the Azure cloud service platform; Spins up clusters and allows you to start building quickly with reliability and performance
Shared Workspace — Azure Databricks Notebooks — Delivers an interactive workspace that enables data engineers, data scientists, and business users to collaborate and comment on shared projects as a team; Supports a range of tools for Python, Scala, R, and SQL, as well as deep learning frameworks and libraries like TensorFlow, Pytorch, and Scikit-learn
Data Orchestration Service — Azure Data Factory — Extracts data and publishes data from/to multiple data sources (a wide range of data source and target connectors and file formats supported); Provides a graphical interface to build, monitor, and manage data pipelines
Data Replication, CDC, and Integration for SQL — Attunity Replicate — Provides automation of bulk data loading from multiple database sources (SQL Server, Oracle, SAP, etc.) into Snowflake’s Cloud Data Warehouse (and a range of other targets) continuously with zero downtime; Supports change data capture (CDC), data integrity checks, control and auditing and secure data transfers
Interactive Data Visualization — Microsoft PowerBI — Enables visual exploration and analyzing of data all in one view; Provides a way to collaborate on and share customized dashboards and interactive reports.

Below is a high level architectural diagram with each component slotted in.

Putting it to Work

Moving from left to right on the diagram you’ll see the common data sources such as on-prem data shares, cloud storage, on-prem data warehouses, and both Oracle and SQL databases.

Azure Data Factory is a managed orchestration service that allows moving data using multiple data source connectors from a source into Azure Blob Storage for original format storage. Once the data lands in Blob Storage it’s then available for processing by the Azure Databricks Spark engine. The file ingestion into Snowflake’s Cloud Data Warehouse was defined in an Azure Databricks Notebook and orchestrated using Azure Data Factory pipeline.

Dropping down to the lower part of the diagram, for relational data the Attunity Data replication solution was used to simplify and accelerate the process of migrating and consolidating data from different internal and external database sources into Snowflake. Attunity moves data from Oracle and SQL databases into Snowflake without having to write complex ETL code and works in for both batch and real time. Once the lands in Snowflake it can be picked up and processed by the Azure Databricks Spark engine.

From that point, each “traditional” Big Data workbook was analyzed and refactored into one or more Azure Databricks Notebooks. Each Azure Databricks notebook read data from Snowflake tables into a Spark dataframe, and then executed transforms (aggregations, grouping, joins, filters, calculations, etc.) to ultimately produce the desired final Spark dataframe which is published to Snowflake for highly concurrent, interactive data consumption by PowerBI users (interactive dashboards and reporting) and also other Azure Databricks notebooks. Azure Data Factory provided an orchestration service for the pipelines to ensure that the schedule was enforced and to track any runtime dependencies.

That’s the basic pattern that was used to migrate well over 1,000 existing Big Data analytics workbooks to Spark jobs in Azure Databricks.

I’d say that the biggest challenge during the cloud migration was interpreting how each individual business analyst wrote their individual workbooks since the tool that was used previously did not enforce and normalize concepts on breaking the workbooks down into smaller workbooks, transforming them, and creating the final workbook — there were alot of one-offs that required fairly significant business analysis.

Realizing Benefits of Moving to Azure and Cloud Native SaaS Solutions

Looking back, here’s my perspective as an implementation and consulting partner on some of the high level benefits that were realized by the client:

Speed — Rapid migration, deployment, configuration, and build out of a cloud-native solution for data orchestration, processing, data storage, and data warehousing enabling real progress towards a vision of having all data in one place for ongoing analytics applications.
Outcome Realization — Acceleration of new analytics applications and business benefit realization
Lower Management Overhead — Ongoing solution operations require just a small fraction of the previous team size
Cost Savings — Significant cost savings associated with a consumption based cloud pricing model and elimination of ongoing capital sunk costs in licensing and subscriptions
Low Touch — Users are also benefiting from 2–5–10x decreases in query times with higher overall data volumes — Azure Core Services, Azure Databricks and Snowflake native integration greatly improves the experience for everyone to get started faster with less set-up plus stay automatically and immediately up-to-date with any improvements or enhancements to the cloud solution set
Minimal Maintenance and Support — The previous solution was challenging to keep current on maintenance patches, fixes, and upgrades — the SaaS solution set that is being used now provides a low/no maintenance and support profile compared the on-prem deployment
Cloud Goodness — The focus is shifting to outcomes and impacting the core business versus dealing with infrastructure — as business grows or contracts scaling up/down is immediate and dynamic
Increased Competency — Business Analysts have grown their existing skill sets while still being able to leverage tried and true SQL
Future Applications — With support for Spark Streaming support in Azure Databricks, the platform can be used to create streaming analytics applications and quickly connects to Azure Event Hub, Azure IoT Hub, or other streaming data sources

It’s a Reusable Azure Cloud Migration Template

I hope this gave you a sense for the solution used to retire a significant base of on-prem, traditional Big Data infrastructure and associated workbooks using a combination of high value, complementary Microsoft Azure services.

Although you should always work from your own desired business outcomes and use case requirements keeping in mind restrictions around organization, existing technology frameworks, data access, etc., in general, the approach above can be used as a template for other data and analytics cloud migrations and the benefits that can be expected.

We continue to be asked to assist clients with this type of cloud-first solution approach while building and engineering end-to-end pipelines to derive quicker, more cost-effective value from their data. We look forward to seeing more customers succeed and doing a lot more together in the near future!

Need Cloud Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and assessment services, cloud migration services, and consulting service packages — we would be glad to work through your specific requirements — please reach out.

Data & Cloud Migrate & Modernize Workshop | Hashmap

We help map out your digital transformation journey to the cloud with insights, perspectives, team activities, and…

www.hashmapinc.com

Feel free to share on other channels and be sure and keep up with all new content from Hashmap by following our Engineering and Technology Blog and subscribing to our IoT on Tap podcast.

Ed Fron is an Enterprise Architect with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers. Connect with Ed on LinkedIn.

In you enjoyed this story, here are some other recent posts from Ed for quick access:

SnowAlert! Data Driven Security Analytics using Snowflake Data Warehouse

This is Worth Trying Out — An Open Source Project for Security Analytics with Snowflake

medium.com

Snowflake’s Cloud Data Warehouse — What I Learned and Why I’m Rethinking the Data Warehouse

Achieving Performance and Simplicity in the Cloud for ORC Data Loads and Tableau Visualization

medium.co