The Modern Cloud Data Platform war — Hadoop Data Platform (Part 1)

LAKSHMI VENKATESH
Data Arena
Published in
3 min readAug 8, 2021

Yes, Hadoop!

This article is a part of a multi-part series Modern Cloud Data Platform War (parent article). Previous part — Modern Cloud Data Platform War — DataBricks (Part 4) — Machine Learning and Analytics.

Let us assume Company X is using on-premise Hadoop several variants such as Cloudera Hadoop, Apache Hadoop, etc. leverages commodity hardware for large-scale distributed systems involving several 100’s of nodes. For the processing layer, they have been using Map Reduce for the past 8+ years and the firm is spending several millions of dollars every year.

Unlike the previous DataBricks article, I am not just talking about one option for the Hadoop on-premise migration and just treating it like a generic Cloud Migration. Instead, let us explore various options as part of this article that will help Company X.

Before jumping to conclusions about what is the technology choice we are going to have, below is our roadmap

Get phase — Discovery, the As-Is — current technology, people, and process.

In the Get phase, before delving deep into the As-Is state, it is important to understand why Company X wants to migrate — this rationale will help understand if it is a cost optimization project, maintenance optimization, to enable agility and high availability, enhancing ability with modern data technology stack or simply organization’s mandate on the cloud initiatives. For company X, it is a mix of cost, maintenance, and enhanced ability with modern data technology platforms. This phase will be discussed through the options.

Technology: Understand current technology landscape, architecture, what is the complexity, data complexities and dependencies, integration with other systems, frequent/infrequent data access and types, volume, velocity and veracity of data, etc.

Users: What are the different types of users from simple users/power users and super users and what are each set of these users doing with the data and how they are accessing it? This will reveal how many duplicates of data are there, how the users are permissioned to access the data etc. The main starting point of creating Data Swamps out of Data Lake is data permission issues with several slice and dice of data.

Assessment & Selection phase — Cloud-native ready, best of three, Roadmap

Assess the current situation on whether the platform or project or solution is already cloud-ready, application complexity, time, money, effort, etc.

Do on-paper analysis and talk to different teams within the organization for Big Data solutioning, different product’s solution architects, read a lot of materials around this, attend webinars and conferences to understand what each of them has to offer, understand competitors’ materials to understand the defects of the others and what additional benefits they are offering over others, understand what competitors are using, etc. Once the initial on-paper strategy exercise is complete, choose the best of 3 and do hands-on with the team — start with the POC / MVP with a highly representative dataset.

As part of the Assessment and Selection phase, we will discuss the below options

  • Option1: Lift & Shift with Roadmap to Decommission
    On-Premise Hadoop and Big Data Architectures (Lambda, Kappa, etc).
  • Option 1a: Cloud-native and Decommission Tactical
    Spark with HDFS on Kubernetes
  • Option 2: AWS
    EMR on AWS (HaaS on AWS)
    Databricks on AWS
    Cloudera on AWS
    Spark and HDFS on EKS
    DynamoDB vs Hadoop
  • Option 3: Azure
    HDInsight
    Databricks on Azure
    Cloudera on Azure
    Spark and HDFS on AKS
    Azure Synapse vs Hadoop
  • Option 4: Cloudera Data Platform (CDP on Public cloud / Private cloud)

To-Be phase — Technology, Process, and Target operating model is decided.

Once the POC and MVP are complete decide with the Cloud approval committee and pick the single or hybrid options and work on the Target operating model.

Approval Phase — do POC’s / MVP’s, present, and get approvals

Go phase — Production-ready application build in Development, UAT, and Go-Live

Summary:

Many organizations are using Hadoop for their Data Lake and Big Data processing on-premise and our Organization X is no exception. With the merger of Cloudera and Horton Works and Cloudera’s original positioning from Cloudera’s Hadoop to the current Cloudera Data Platform (CDP — on-premise / Private / Public cloud), Hadoop As A Service by AWS and Azure, etc.

--

--

LAKSHMI VENKATESH
Data Arena

I learn by Writing; Data, AI, Cloud and Technology. All the views expressed here are my own views and does not represent views of my firm that I work for.