Cloud Migration for Enterprise Analytics Environment with On-Demand Clusters

harisai gangadhar
Walmart Global Tech Blog
4 min readJan 7, 2022

This blog was co-authored and researched by Sridhar Leekkala

We are a data team. We spend the bulk of our efforts building out data pipelines from operational systems into our Decision Support infrastructure. Our journey from an on-prem to cloud environment has been described here:

This was the first phase of our cloud journey.

Challenges with phase one:

In phase one, we primarily had time-based clusters. The clusters scaled up to the full capacity at a particular time and subsequently scaled down. There were a few disadvantages with this approach:

Flexibility:

The approach was less flexible. In case of any delays, the production support teams were required to take adhoc steps to ensure the required capacity was available for an additional time period. This added more operational challenges.

Cost:

From the cost perspective, there was an additional burden of the standby nodes, which were the minimum number of nodes that were kept active throughout the day. Thus, a certain cost was incurred throughout the day even if the cluster was idle.

Availability of edge nodes:

Although the data in the cloud environment was guaranteed to be available over 99% of the time, the edge node was not. There was always a very slim chance of the edge node going down and impacting the data availability, SLAs and eventually the business.

Resources:

Having a dedicated edge node per cluster and installing certain licensed software like scheduling tools further added to the cost since every edge node had to utilise a new license to use the product.

To overcome these challenges and make our process more robust, cost effective and flexible, we planned to utilise ephemeral clusters in the second phase of our cloud journey.

Phase two

Ephemeral clusters:

Ephemeral clusters were created on demand and were fully terminated once the applications were executed. Whenever a particular process had to run, it first created the cluster with appropriate capacity, ran the necessary jobs and eventually terminated the cluster. Further scale-up/scale-down operations could also be triggered by the application. This effectively made the cluster operations like creation/deletion/scale-up a part of the ETL pipeline designed and managed by the application teams.

Creating ephemeral clusters:

Since the cluster operations were integrated with the ETL pipelines, it became important to judiciously create the clusters. Before creating the clusters, a thorough analysis had to be carried out to identify the right capacity. Creating numerous clusters with higher capacity could have resulted in wasted resources.

If there are limited number of ETL pipelines, then a cluster can be created for each pipeline. If there is a large number of pipelines, then a cluster can be created for a group of them.

Appropriate dependencies must be established between the pipelines so the cluster is created before the first pipeline runs and is terminated after the last pipeline has finished.

The application teams can define the magnitude of the pipelines, ranging from a single job or application id to multiple jobs. The processes can be grouped based on subject areas or the time of execution.

Pipelines running at the same time can be triggered on different clusters to avoid concurrency. In scenarios where the processes are run very frequently (half hourly, quarter hourly, streaming, etc.) a persistent cluster can be considered.

Pool of edge nodes:

Another architectural change we are considering, is to avoid dedicated edge nodes for each cluster and instead have a pool of edge nodes. These edge nodes would be common, used by all teams and managed centrally.

Every node in the pool would have all the necessary applications installed and have access to the application code. There would be a load balancing mechanism which would distribute the load equitably among all the edge nodes.

A sufficient level of access controls would also be in place, so a particular code would be accessed only by the respective team.

This not only ensures the software licences are judiciously used, but also ensures the high availability of the edge nodes.

For a successful implementation, we would require the following:

In order to have an effective on-demand mechanism sufficient capacity should be planned for.

A process to ensure effective distribution of resources among the clusters should be established.

An easy and effective process to create, delete, scale-up/scale-down the cluster should be defined.

Define the an effective load balancing mechanism to support the idea of a pool of edge nodes.

Correct access protocols should be established to ensure the necessary business restrictions while the access to the code is seamless.

A seamless and a simple process to establish the environment and to submit the processes needs to be established.

Appendix:

Some statistics of autoscaling that we observed in phase one and two:

--

--