Advantages of Migrating to Cloud for Enterprise Analytics Environment

Sridhar Leekkala
Walmart Global Tech Blog
8 min readFeb 5, 2021

Introduction

We are a data team. We spend the bulk of our efforts building out data pipelines from operational systems into our Decision Support infrastructure. We synthesize the analytical data assets from operational data flow and publish these assets for consumption across the enterprise. Our ETL pipelines are built using an in-house ETL framework with workflows that run on Map Reduce and tuned with TEZ parameters and some workloads using Apache Spark. Data flows through a series of logical stages from various sources across the organization into a “Raw Zone”,” Cleansed”, and “Transformed” to build multiple fact tables suitable for the Enterprise team’s use-cases. The data is then flattened and loaded to the consumption layers for ease of business analysis and reporting. These works might be common among most of the companies today, and we hope that our story about overcoming a series of challenges through a cloud migration resonates with you and your teams.

Image taken by author

The below journey is related to the analytics environment we use for internal reporting, and it is not a customer-facing production environment.

Situation Before the Migration:

We used an On-prem multi-tenant static Hadoop cluster for our data loads. All the tenants have their peak demand for the resources at the same time which pushes the cluster into an over utilized state and finishing our jobs right before our Service Level Agreement deadlines. There’s no time for upstream delays, data validations, and any potential data issues that we may have to address. The On-prem multi-tenant shared cluster design complicates the security policy implementation and makes using Ranger less efficient/effective. It is also not very suitable for streaming data as it could not support the fluctuating data demands throughout the day.

Migration to Cloud Platform

To overcome all these constraints, we started looking into the cloud service providers with managed Hadoop and Spark environments and optimum to migrate our Hadoop workflows and the data. We will briefly discuss about how we overcame the issues outlined above.

Cluster Elasticity:

Cloud provides the flexibility to choose the type of clusters and spin up as many clusters as required. All these clusters have managed Hadoop services with Hive meta store. You can use any of the clusters or combinations of cluster types as mentioned below to overcome the limited cluster elasticity based on your use case.

Cluster Types:

Static clusters have a certain number of nodes and doesn’t scale up/down automatically unless more nodes are manually added or removed.

Auto scale clusters are mostly of two types: Time based clusters (cluster scale-out on configured time intervals) and Metric based clusters (cluster scale out based on certain metrics such as utilization % of the cluster).

Ephemeral Clusters are the types that only exist until the job is finished.

We choose the number and type of clusters based on our domain groups, nature of the workload (Incremental, History, and Correction), resource needs, timings, and frequency.

The smaller workloads or the workloads that runs more frequently; for example, a job that runs every 30 mins can be scheduled on the static clusters as cost of the ramp up/down time essentially reduces any benefits of autoscaling.

The workloads which have same data volumes every day can be scheduled on time-based clusters. For example, The SLA bound jobs may start around 1AM and finish up around 6AM every day and after these time intervals, these clusters can be scale down and runs with minimal core definition. It is important to include some buffer time before scaling down to ensure the jobs complete prior to resources being scaled down.

The workloads, which runs frequently, and their data volume varies significantly can be scheduled on Metric based clusters. It is important to make sure the scale up and scale down timings should not impact the job execution.

Another alternative is to go with Hybrid approach by keeping some of the cores using time based clusters and scale-out on metric-based for the rest of the day or even spin up Ephemeral clusters if required.

Using the above options gives flexibility to scale up/down the clusters based on calendar events or automatically based on the resource demand. This assortment of cluster options has provided us with a custom selection of static & Auto-scale (Time based or metric-based) configurations while planning for the cluster resources.

Cluster utilization:

As the storage and compute were separated, the Cloud enables us to configure the clusters better in terms of memory, cores and availability and it can also be auto scaled as per the requirement. Autoscaling solves the cluster over-utilization and the compute needs are met easily for the holiday and handling fluctuating data demands.

Low Maintenance DR Environment:

In Cloud, the storage can be multi regional, and even if a cluster goes down, we can access the cloud storage through another cluster. We already have multiple clusters setup with overlapped accesses setup for a few select data Assets, and it is quite easy to spin up a new cluster with the same configurations and switches. The cloud environment provides multi regions /zones buckets for High Availability. In Cloud Disaster recovery requires less maintenance and comes with better monitoring tools.

Separate Dev and Prod Environments:

In the Cloud, we can isolate the dev and prod clusters with an underlying data source on Cloud Storage with proper Access controls. It will be cost effective and developers can spin off clusters as needed thus enabling the faster development cycles.

Security and Access Setup:

In On-prem, the data on HDFS is controlled through the Ranger policies. Since multiple teams have access to manage the policies, it becomes more complicated with less control over the access to the data assets.

In Cloud, we created multi-region cloud storage locations with required access control using IAM (Identity and Access Management). Each Cluster has a secure and non-secure service account along with a login AD (Active Directory) Group. Depending on the use case and the data sensitivity, the user groups, are added to the login AD Group configured to that cluster. The data assets on cloud storage will have Read/write access through its cluster SA (Service Accounts).

How the migration was done:

Data and Jobs Migration:

Before starting this migration, we froze any new requirements from our business. We had to migrate all our data from the Raw, Staging, and Consumption layers with a total size of around 500 Terabytes and 300 + Workflows. Data was partitioned and stored in ORC compressed with Snappy. We DistCp the data from On-prem to cloud and then validated against the On-prem data. In addition to this effort, we migrated all our jobs to Spark with Scala for better performance with reduced job runtimes.

Engineering readiness:

Scrum teams were formed around a particular data asset, and each team worked on moving the data to Cloud and migrating the workflows. The entire team was trained on Spark with Scala and the Cloud Technologies. We formed a small group of leads to co-ordinate all these efforts across all teams. We had a daily Sync up with our architects, platform teams, and cloud provider teams to discuss the issues and clarifications.

Business Readiness:

We worked closely with the business partners to validate the data and get UAT sign-off. We worked with compliance teams to meet all the criteria. To ensure our migration was successful and our end-users were comfortable with the new tech stack, we delivered one key data asset two months ahead of time. Training the end-users on this one key data asset helped them see familiar data in a new environment to create support/advocacy for the new technology stack. Early migration of one critical data asset made it easy for end-users to see the benefit of the new technology and reduce angst with the overall change.

We sent the communication to all the stakeholders on migration status and cut off dates. Our business partner and communication team ensured that the communication is sent to the end-users and making sure they are migrating to the new environment. At the same time, they also published the necessary information related to trainings, webinars and help links available. Also, we run the jobs in parallel for a month for end users and application teams to migrate their existing On-prem flows to the Cloud. We also followed the train the trainer methodology to create advocacy within the customer base for this new platform. We identified few critical business partners who are tech-savvy and trained them in the new tech stack, and in turn they helped us to prepare the end-users.

Post Migration and Key take-aways:

Access Control & Security: Cloud has enabled us to add access controls at various levels like Schema, Table & Partition, a finer subset than a data asset. It also provides more visibility on the applications/users accessing the datasets.

Performance: The dedicated clusters helped us to improve the job performance significantly and meet SLA every day. In the last three months after we migrated to the Cloud, we have seen a 57.24% reduction in the overall core utilization. We were able to see an overall gain of 30 to 45 minutes even though we were processing on an average of 2.5x more data.

Scalability: It gave us higher flexibility to scale up/down the clusters based on calendar events or automatically based on the resource demand. This feature has provisioned us with a very custom selection of static & Auto-scale (Time based or metric-based) configurations while planning for the cluster resources.

Usability/Accessibility: No other application installations/utility software is needed to access the data at the user-end. The simple cloud service provider’s user interface is no-hassle for an end-user to access the data irrespective of operating systems.

Data as a Service: As this Infrastructure can process a massive volume of data and derive Insights with low latency, we can abstract the data and provide it as an API to the other teams. It helps separate the usage of data from the cost of a specific software environment or platform and centralizes the data’s quality, security, compliance, and governance.

--

--