Migrating Big Data Workloads to AWS EMR

8 min readApr 8, 2020

Overview

Maintaining an on-premise Hadoop/Spark infrastructure eventually leads to lack of agility, excessive cost, and unnecessary administrative overhead. It needs to be manually built, managed and hand-fed.

At Intuit, the Small Business Group maintains data pipelines with thousands of ETL jobs and interactive queries. One of the engineering projects we had was to migrate these workloads from “Pet” servers which are hosted in our data-centers to “Cattle” services in the cloud.

In this blog post, we are sharing our experiences and key insights during the successful migration of workloads from on-premise Hadoop/Spark clusters to AWS cloud with zero disruption. Resulting, increased availability of the jobs that produce critical Dashboards and Reports, along with faster business insights and better cost control.

Why Amazon EMR?

As a team, our primary considerations were to achieve reduced downtime of our jobs, along with faster business insights and better cost. We identified that the following design goals would help us to achieve those.

Compute & Storage shall be decoupled.
Seamless data lake with Amazon S3.
Stateless Compute Infrastructure using Amazon EMR.
Persistent and Transient cluster capabilities in AWS EMR.
EMR Cluster segregation by business unit for better isolation, cost attribution and hardware customization.

Requirement CheckList

Migrations are also opportunities to assess gaps in the existing system, understand the benefits of the new system, and plan accordingly.During the migration effort a requirement checklist was created as mentioned below :-

High Availability: The clusters should be highly available, enabling Analysts around the globe to perform exploratory analyses, execute ETL and reports at scale.
Heterogeneity: The cluster should have the capability to support a wide variety of workloads across different engines (Hive, Spark etc.).
Query concurrency: The task was to have onboarding of 200+ analysts across several business units that can run Ad Hoc queries simultaneously, hence the implementation demanded the ability to support this degree of concurrency without exceeding cost boundaries and yet provide good performance.
Reliability of Jobs & Data Consistency: Jobs need to run reliably and consistently to avoid breaking SLAs.
Interoperability (with existing tools and IDEs): A wide variety of internal/external tools need to connect to EMR including but not limited to Alation, Tableau, PyCharm and Qlikview.The Implementation should support these existing tools and make it seamless for users to connect to EMR.
Cost-effective and Scalable: Shift to AWS EMR is to take advantage of the scalability it offers out-of-the-box. We achieved great performance by keeping the cost under budget with proactive and efficient cost monitoring
Security/Compliance: Given the sensitivity of data, AWS EMR environment requires fine-grained access control and appropriate security implementations.

EMR Migration Use Cases

Exploratory & Scheduled are the two primary use case were implemented in the migration journey.

Scheduled Use Case — These are the N+ scheduled jobs which were run on EMR. Many of these jobs are long running and have strict SLA’s as they feed critical reports. A subset of them had a low frequency, hence Transient clusters were spun to process the jobs & minimize the cost. Mentioned below is the high level conceptual diagram.

Exploratory use case — Exploratory queries are the use cases that are run by different groups that included Data Engineers, Data Scientists, and Data Analysts. The use cases had a need to connect from numerous clients to the cluster via JDBC including dashboard products Mentioned below is the high level conceptual diagram.

Work Stream

In order to meet the check-list of requirements and implementation use case the work streams were created in the following work-streams with the specific goals

Platform Readiness

Implemented the concepts for EMR -Exploration & Scheduled via CI/CD
Programmatic Method to evaluate Performance & Cost
Shadow Testing for User Experience, Performance, Scaling
Connectivity with various 3rd party Tools
Run-books for Operational Excellence

Migration of Data, Dashboards, Jobs

Ingestion completion, Count Parity,Sampling Parity, Lag
Complete Identification of jobs & dashboard
Onboarding of users & Data Validation

Analyst/ Data Scientist Onboarding

Training Sessions with follow up & follow through
Fine grained Access management to users
Visibility of the Platform via Command & control

Operationalization

Best Practices i.e Tagging,
AMI Rotation
Provisioning & Management of EMR
Monitoring & Alerting

Cost & Capacity Management

Use Spot & Reserved instance to lower cost
Instance fleet for advanced spot provisioning
Lower Cost with Autoscaling
Cluster By Business Unit & have Cost Visibility

Security Governance & Auditing

Authentication & Authorization & use of Kerberos
SQL Authorization
Security Encryption
Governance & Auditing

Foundation Architecture

To achieve various methodologies were leveraged and a massive effort towards an automated framework was done to ensure we have an apples to apples comparison between the Hosted Infrastructure & AWS EMR. Automation was done in following mentioned areas and frameworks were developed to support each of the initiatives

Workload Management
Shadow Testing Framework
Monitoring
Cluster Performance. & Tuning
Data Analyst Onboarding
Command & Control
Cost & Capacity Management

WorkLoad Management

In order to manage/maintain the AWS infrastructure & support operations of multiple EMR clusters (Exploratory, Schedule, Transient ) were created based on the workload and appropriate tagging was done in order to effectively manage the operationalization & cost of clusters.

Shadow Testing Framework

In the hosted Big Data environment Cloudera Hadoop distributions were running on physical servers where as in AWS, EMR (Elastic Map Reduce) managed services was leveraged. One of the most critical parts of the AWS migration journey is migrating and optimizing the EMR environment to match and exceed performance and scaling of current data centers. Hadoop system vs EMR, below were the high level differences between the two environment

Execution Engine change from MR/Spark to Tez in EMR.
Data Storage was replaced with S3 in EMR.
Multiple EMR clusters in AWS vs 1 common cluster in IHP.
Hive SQL authorization is enabled in EMR.

To address these unknowns for our workloads, we need a scientific method to evaluate queries performance in EMR compared with IHP i.e., an apples-to-apples comparison hence Methodology of Shadow testing was followed that involved capturing of Big Data queries from hosted data center and re-running them in AWS EMR to evaluate production readiness. Shadow Testing allows evaluating whether the product is satisfying workload needs in terms of performance, latency, concurrency, and query compatibility. The goal is to simulate a production environment with real workload to proactively detect issues.

Cluster Performance

Staying on top of your EMR Cluster performance and availability can become challenging if you have to manage multiple clusters across different accounts. Monitoring and continuous performance tuning helped us with three main objectives:

Minimize : AppsPending, Apps Failed
Maximize : Both Memory and cores utilization
Cost Optimization : Minimize Cluster resources being Idle by using proper auto scaling policies based on HDFS and Task nodes utilization
Specification : Based on benchmarking, we have finalized below cluster level specifications by considering all points, came finalized as below
Instance type for Tez/Spark application type as r5.* family
emr-fs config for max connections
Yarn/tez/hdfs configs optimizations for default values based on benchmark
Measuring Cluster Performance: This was evaluated based on Ganglia reports for Memory/Disk IO’s/Network related metrics across the cluster
Controlling Cluster Cost: By comparing Cluster Utilization against Number of Nodes and Apps Submitted

Monitoring

Several tools were utilized for setting up end to end monitoring of our EMR Clusters like:

Proper tagging of AWS resources enabled us to identify different types of clusters (Scheduled, Exploratory, Testing etc) under different accounts and attaching them to various cluster metrics which were monitored to maintain the availability of clusters and ensure optimal performance.
Visualization of Cluster Health and Performance was important for Leadership, developers and admins to gain insight into EMR clusters usage, performance and availability.
The Account-Wise View dashboard was developed to give holistic view of EMR clusters in a selected account with metrics like Core Nodes Count, Task Nodes Count, Utilization %,Apps status like Pending, Killed, Submitted, Completed and Failed.
A Cluster-Wise View dashboard was set up for providing an in-depth view of a selected EMR cluster in an account and statistics were displayed only for a single cluster. Alerts were set up and triggered in the event any of these metrics have crossed the threshold . i.e
Cluster Overview: Covers Utilization %, Number of Core and Task Nodes, Apps Status, Containers Pending, Active Users, Dead Nodes
Green/Red Status for Services Health: Covers HS2, Resource Manager, SHS, HCatalog, NameNode, SafeMode
Monitoring on Master Node Activity: CPU, Free Physical Memory, Connections
Monitoring on Data Nodes Activity: Average CPU, Total Free Physical Memory, Disk Capacity, FS Block Stats and Capacity
HS2 Activity: Execution Engine Task Count, Active Queries Count, HS2 Queries Stat (Successful, Failed, Submitted, Compiled etc), Heap Memory Usage %, GC Stats, Sessions Count (Open, Active, Abandoned)

Cost & Capacity Management

In the journey towards migrating the workload from the hosted Data Center to AWS, doing right sizing of EMR clusters i.e capacity planning and managing optimum cost was highly critical. Unlimited EC2 resources can be spin up via AWS console to manage EMR but if cost was not accounted properly then spending can drastically increased compared to budgeted amount and Lack of Auto Scaling policy can increase the cost of clusters, sizing, tagging & workload management and come up with a mechanics to manage cost of clusters along with accurate capacity planning in order for a seamless & successful migration with cost management.

Methodology followed for AWS EMR Capacity Planning

Shadow Testing framework was developed for Intuit workloads that provided a scientific method to evaluate queries performance in EMR compared with Intuit DC i.e., an apples-to-apples comparison.
In order to ensure all the parameters were used appropriately and have a repeatable pattern cluster were created via serverless framework, based on the workload the cluster were segregated to multiple small, medium, large clusters there-by providing higher reliability, availability, user segregation & managing cost in an optimum manner..
Cluster were set up based on the usage pattern i.e Exploratory — These are the EMR Cluster that are persistent and users will be using them to get insight. Schedule — These are the persistent EMR clusters where the users will be running ETL jobs for processing. Transient — These are the short lived EMR clusters which will get created, ETL job will be processed and the cluster can be terminated.
EMR Clusters for Exploratory — Multiple EMR Exploratory were created segregated by the group of users and access was granted based on the role they would be performing on the EMR Cluster hence the sizing can be done based on the number of users and kind of queries they will be running for insights and cost can be aggregated across multiple EMR.

Key learning & benefits

Cloud Managed Infrastructure
Single pane of glass to manage various cluster types
Persistence, scheduled, transient
Isolation of Jobs based on Business units
Tag based Cost Monitoring & Visibility
Instance type & Cluster sizing
Capability to make Self service platform
Infrastructure as a code
Monitoring & Alerting patterns
On-Call & Instance Slack Support
Data Engineers, Data Analyst & Stakeholders Feedback

Credits

We would like to thank the following persons for contributing to this blog :-

Gaurav Doon, Khilawar Verma Navin Soni, Anil Kumar Batcho, Giriraj Bagdi, Rajesh Saluja , Veena Bitra, Sanooj Mananghat,

Summary

Finally, meticulous planning and seamless user onboarding has helped achieve the milestone of Provisioning, Migrating of Scheduled jobs, Exploratory queries with precise Operationalization runbook with appropriate AWS cost.