Running Hadoop Big Data workloads using Azure Data Services

4 min readAug 13, 2020

More and more organizations are leveraging Big Data to extract value and insights from the data. As the data grows exponentially, organizations find that the data warehouse and ETL systems cannot keep up. Distributed storage and compute solutions like Hadoop provide the framework for storing, processing, and analyzing Big Data. However, Hadoop systems deployed on-premise often run into multiple challenges. They require substantial upfront Capital investment since they are built for the peak capacity. On-premise Hadoop also provides limited elasticity and scale. Adding new nodes has a long turnaround cycle of procuring, installing, and configuring the hardware before being put to use. On-premise Hadoop also requires a large IT support team to maintain the environment.

With the rise of the cloud environment and the agility, flexibility, and scale it introduces, running on-premise Hadoop looks like play Atari video game in the age of XBOX and Playstations. Microsoft Azure provides several options to run Big Data workloads in the cloud. Big Data environment can leverage the cloud benefits using Azure. The benefits will vary depending on the option chosen.
The benefits depend on the big data workload, migration of existing Hadoop, and the Big Data team’s skill set.

Hadoop using IaaS
Cloudera offers Cloudera Enterprise Data Hub as an Azure marketplace solution to be installed on Azure virtual machines. Cloudera enterprise data hub builds a unified enterprise data platform using the Hadoop framework. Cloudera Azure option provides full compatibility with an on-premise deployment. Azure VM provides the leased infrastructure that can be scaled on-demand without upfront Capex investment.

Pros:
A full version of Cloudera Hadoop running in the cloud
Compatibility with applications and workloads running on-premise
Converting Capex into Opex
On-demand scaling to add additional nodes or workloads

Cons:
Limited ability to scale on demand. Azure VM can only have limited scaling up to a higher number of cores per machine. It depends on the type of machine used in VM.
IaaS services only offer infrastructure benefits. A dedicated Hadoop admin team (similar to on-premise) will be required to maintain the environment.

Hadoop using PaaS (HDInsight)
Microsoft Azure offers managed Hadoop service using the Azure cloud platform. HDInsight option of Hadoop is leveraging Hortonworks distribution. Hortonworks Data Platform 3.0 is available through HDInsight version 4.0. HDInsight offers a first-party Hadoop service. Hortonworks integrated its distribution to run natively using Azure PaaS features. HDP services like HDFS, Tez, Hive, Hbase, Spark, Storm, Kafka, etc. are available in HDInsight

HDInsight offers different cluster types depending on the Hadoop services and components used for the workload

Apache Hadoop cluster — provides HDFS, YARN, and Map Reduce framework
Apache Spark — Spark cluster for in-memory parallel processing to boost big data analysis performance
Apache Hbase — NoSQL database based on Hadoop
ML Services — distributed R processes for data science workloads
Apache Storm — real-time processing of streaming data
Apache Interactive — in-memory caching for hive queries
Apache Kafka — streaming data pipelines

Pros:
Separation of storage from computing. HDFS is built using ADLS, which is separate from HDInsight cluster
Benefits of Platform as a service (pay as you go, on-demand scale, elasticity, flexibility)
The open-source codebase

Cons:
Only available via Hortonworks distribution (Cloudera version is not available)
Separate clusters for Spark and HDFS
HDFS running in ADLS run slower than local storage

Big Data using Azure cloud-native services
Big Data analytics can be achieved by leveraging Azure native cloud services like Databricks, data lake storage, Synapse Analytics, etc. Databricks provides a unified environment for data engineers, data scientists, and data analysts. Data is stored in a data lake, and processed, analyzed, and reported using Databricks. Databricks provides managed Spark engine and tight integration with other Azure services.

Pros:
On-demand scale and option to go serverless
Native integration with Azure services like Synapse, ADLS, etc
Integrated security through Azure AD, key vaults
Auto-scaling and option to pause the cluster

Cons:
Multiple Azure services for running big data workloads

These are the typical scenarios for running big data workloads in Azure. Depending on the maturity of the existing Hadoop applications, organizations will decide one way or the other. IaaS option provides the most control on the platform. However, it also requires the most work to be managed by the application/admin team. Organizations looking to migrate existing Hadoop workloads to Azure should first perform feasibility, cost, and value analysis. Typically Azure cloud-native big data services like Azure Databricks provide the best performance and the most optimized cost. In our experience, by moving Hive and Spark jobs to Databricks, organizations gain more 20% performance gain. Re-architecture of the jobs to leverage cloud-native benefits can further improve performance by multiple factors. More and more organizations find it appealing to modernize the monolithic Hadoop platform and high leverage ROI on the migration investments.

Running Hadoop Big Data workloads using Azure Data Services

Written by Tarun Agarwal