The Modern Cloud Data Platform war — Hadoop Data Platform (Part 1)

Published in

Data Arena

3 min readAug 8, 2021

Yes, Hadoop!

This article is a part of a multi-part series Modern Cloud Data Platform War (parent article). Previous part — Modern Cloud Data Platform War — DataBricks (Part 4) — Machine Learning and Analytics.

Let us assume Company X is using on-premise Hadoop several variants such as Cloudera Hadoop, Apache Hadoop, etc. leverages commodity hardware for large-scale distributed systems involving several 100’s of nodes. For the processing layer, they have been using Map Reduce for the past 8+ years and the firm is spending several millions of dollars every year.

Unlike the previous DataBricks article, I am not just talking about one option for the Hadoop on-premise migration and just treating it like a generic Cloud Migration. Instead, let us explore various options as part of this article that will help Company X.

Before jumping to conclusions about what is the technology choice we are going to have, below is our roadmap

Get phase — Discovery, the As-Is — current technology, people, and process.

In the Get phase, before delving deep into the As-Is state, it is important to understand why Company X wants to migrate — this rationale will help understand if it is a cost optimization project, maintenance optimization, to enable agility and high availability, enhancing ability with modern data technology stack or simply organization’s mandate on the cloud initiatives. For company X, it is a mix of cost, maintenance, and enhanced ability with modern data technology platforms. This phase will be discussed through the options.

Technology: Understand current technology landscape, architecture, what is the complexity, data complexities and dependencies, integration with other systems, frequent/infrequent data access and types, volume, velocity and veracity of data, etc.

Users: What are the different types of users from simple users/power users and super users and what are each set of these users doing with the data and how they are accessing it? This will reveal how many duplicates of data are there, how the users are permissioned to access the data etc. The main starting point of creating Data Swamps out of Data Lake is data permission issues with several slice and dice of data.

Assessment & Selection phase — Cloud-native ready, best of three, Roadmap

Assess the current situation on whether the platform or project or solution is already cloud-ready, application complexity, time, money, effort, etc.

Do on-paper analysis and talk to different teams within the organization for Big Data solutioning, different product’s solution architects, read a lot of materials around this, attend webinars and conferences to understand what each of them has to offer, understand competitors’ materials to understand the defects of the others and what additional benefits they are offering over others, understand what competitors are using, etc. Once the initial on-paper strategy exercise is complete, choose the best of 3 and do hands-on with the team — start with the POC / MVP with a highly representative dataset.

As part of the Assessment and Selection phase, we will discuss the below options

Option1: Lift & Shift with Roadmap to Decommission
On-Premise Hadoop and Big Data Architectures (Lambda, Kappa, etc).
Option 1a: Cloud-native and Decommission Tactical
Spark with HDFS on Kubernetes
Option 2: AWS
EMR on AWS (HaaS on AWS)
Databricks on AWS
Cloudera on AWS
Spark and HDFS on EKS
DynamoDB vs Hadoop
Option 3: Azure
HDInsight
Databricks on Azure
Cloudera on Azure
Spark and HDFS on AKS
Azure Synapse vs Hadoop
Option 4: Cloudera Data Platform (CDP on Public cloud / Private cloud)

To-Be phase — Technology, Process, and Target operating model is decided.

Once the POC and MVP are complete decide with the Cloud approval committee and pick the single or hybrid options and work on the Target operating model.

Approval Phase — do POC’s / MVP’s, present, and get approvals

Go phase — Production-ready application build in Development, UAT, and Go-Live

Summary:

Many organizations are using Hadoop for their Data Lake and Big Data processing on-premise and our Organization X is no exception. With the merger of Cloudera and Horton Works and Cloudera’s original positioning from Cloudera’s Hadoop to the current Cloudera Data Platform (CDP — on-premise / Private / Public cloud), Hadoop As A Service by AWS and Azure, etc.

The Modern Cloud Data Platform war — Hadoop Data Platform (Part 1)

Summary:

Written by LAKSHMI VENKATESH