Economics of Scaling Machine Learning Workloads — Part II

Siva Anne
IBM Data Science in Practice
7 min readMay 28, 2020

Architectural Lessons from Data Science Engagements

The two-part blog series shares architectural lessons from multiple data science engagements. The first part of this blog series focused on factors impacting the scalability of ML solutions. The second part looks at solution architectures that help address the critical challenge of optimizing costs associated with scaling machine learning workloads.

Optimizing Compute Costs for ML at Scale

Machine learning workloads operating on data at scale require tremendous computing resources. Organizations looking to drive high volume ML workloads can often incur high computing costs. Compelled by frugal IT operations budgets, enterprises face the necessity to optimize compute costs of large scale ML workloads.

Using a multi-stage, end-to-end ML pipeline as a reference, let’s look at what I consider to be best practice architectures to scale the pipeline and optimize computing costs.

Reference ML Pipeline

Here is a sample ML pipeline with end-to-end stages to train and score on terabyte-scale datasets.

· Stage 1: Create sample dataset from data sources

· Stage 2: Select features using sample dataset

· Stage 3: Create training dataset with selected features

· Stage 4: Train ML model on the training dataset

· Stage 5: Create universe dataset for Scoring

· Stage 6: Score universe dataset

Scalable Implementation for the ML Pipeline

In the reference implementation, the ML pipeline scales to process large datasets using a combination of traditional data science techniques and scalable architectures. Horizontally scalable architectures have proven to be better than vertical scalability, as they offer higher flexibility, fault tolerance, and near unlimited scale. The pipeline uses Apache Spark as a horizontally scalable, distributed data processing engine to scale all ML stages. Statistically, adequate sampling has been a proven technique to scale down big data. The data preparation stages capture representative samples for feature selection. The training and scoring stages of the pipeline scale using distributed XGBoost as the model.

The end-to-end pipeline is built, run, and managed from an enterprise Data Science platform realized with IBM’s Cloud Pak for Data. Cloud Pak for Data offers flexible deployment options for on-premises or any cloud platform. The pipeline executes as a series of scheduled jobs that drive processing on external Spark clusters. The architectural choice of using external compute clusters for Spark can provide immense flexibility to handle scale with the opportunity to optimize costs.

Optimizing the Economics of Spark Computing

Apache Spark scales the processing horizontally by running multiple executors on a cluster of nodes. The executors drive distributed, parallel computations on data. The type of executor nodes and the number of nodes determine the compute capacity of a Spark cluster. The compute capacity is scaled by adding compute nodes to the cluster. Higher volumes of ML workload demand higher compute capacity driving up the operational costs.

The widespread adoption of Spark has led to multiple implementations, targeting both on-premises and cloud platforms. Each variant of Spark offers different levels of flexibility to add nodes or run numerous workloads at scale. The operating costs incurred for Spark computing will vary based on the variant of Spark adopted to drive the workloads.

From the need to keep operating costs flexible, cloud services often offer the best economics. Let’s look at a few best practices to optimize Spark computing costs using cloud services.

Dynamically Provision Spark Clusters with Auto Scale

A natural first step to contain costs is to dynamically provision Spark clusters, enable autoscale, and terminate the cluster on job completion. Cloud platforms allow programmatic provisioning and termination of Spark clusters with support for Python SDKs, APIs. They also integrate auto-provisioning and immediate termination of the cluster into the ML pipeline jobs to minimize idling costs. Automated dynamic provisioning ensures that costs are based on usage / consumption.

Size the Cluster Nodes to Match Resource Needs of Spark Job

Compute resources define CPU, Memory, and I/O capacity. Typically, all Spark jobs of an ML pipeline are not identical in resource needs. Some jobs are memory intensive, while others are CPU or I/O intensive. A series of test runs will help profile the resource needs of Spark jobs in each ML stage. Optimize computing costs by provisioning a cluster that closely meets the resource requirements of each stage. In the reference ML pipeline, the Stages 1, 3, and 5 are bound by memory more than CPU, unlike other stages. Automating the ML stages using right-sized clusters can provide cost savings.

Leverage Cheap, Transient Compute Nodes for Spark Cluster

Cloud platforms offer transient compute nodes, like EC2 spot instances on AWS or transient nodes on IBM Cloud. The transient nodes are provided at a compelling price point, usually at steep discounts of 60% or more, to drive utilization of the excess capacity available in the cloud. These compute nodes offer the opportunity to drive significant savings in operational costs for Spark computing.

While the transient nodes are cheap, their continuous availability is not guaranteed. The transient nodes can be reclaimed anytime. Using transient compute nodes in the Spark cluster could lead to potential job failures when nodes get reclaimed. Even with the built-in fault tolerance of Spark, the long-running jobs will fail to complete after a few retries if the cluster loses its compute nodes. The ML pipeline must resubmit failed Spark jobs. Essentially this is a trade-off; the dollar savings realized are at the cost of reliable execution for Spark jobs. For workloads that are less sensitive to job completion times, the transient nodes can offer remarkable cost savings.

Architect Spark Jobs for Higher Resiliency

Spark reruns the tasks assigned to lost executors on reclaimed transient nodes. For long-running jobs, this leads to an all-or-nothing situation. Each ML stage in the reference pipeline implements processing using a single Spark session. A failed Spark job requires the entire ML stage computations to be re-executed. Breaking each stage into a chain of much smaller Spark jobs, we increase the resiliency levels. In parallel, we tune the retry attempts by Spark before it fails a job. With highly resilient Spark jobs, we can leverage the cost savings of transient nodes for most workloads.

Manage Spark Jobs Efficiently on Shared Cluster

Running Spark jobs on shared clusters will optimize costs. Every Spark implementation uses a resource manager to manage jobs on a cluster, e.g., Spark on Hadoop uses YARN, and Spark on Mesos uses Mesos as a resource manager. The workload management capabilities vary with each resource manager.

IBM Spectrum Conductor (part of WML Accelerator) uses EGO (Enterprise Grid Orchestrator) as a resource manager to manage Spark, Deep Learning, and Mongo workloads on HPC clusters. The Conductor offers the best workload management with support for sophisticated policies to share resources, manage concurrent workloads with priority groups, and burst to any cloud provider for dynamic compute capacity. Here is the reference ML pipeline with Spark jobs directed to the Conductor cluster.

Scaling compute with cloud burst from Spectrum Conductor.

The Conductor hosts multiple Spark Instance Groups (SIG). Each SIG is assigned with cluster resources optimized to run different types of Spark jobs. Each stage in the ML pipeline submits the Spark jobs to the SIG optimized for the stage. The SIG uses cloud burst capability to provision cheap and transient computing nodes from IBM, AWS, or Azure. The unique, enterprise-class capabilities in Conductor drive significant cost savings for running Spark workloads at scale for on-premises, cloud, or hybrid cloud environments. For the reference ML workload, the Conductor configuration typically realized 40% savings in computing costs at high volumes.

Use Serverless Compute for Spark Processing

The serverless computing approach runs code on request, hides infrastructure details, scales transparently with demands, and generally gives better performance for workloads that gain from parallel processing. Serverless often offers the best of cloud benefits. The pricing is based on a per-request basis, making it the most cost-effective form of computing.

In the reference ML pipeline, data processing across stages is primarily a mix of SQL and ML operations. The Stages 1, 3, and 5 prepare and transform datasets with Spark SQL operations. Stage 4 trains and Stage 6 scores the model driving ML operations. And, Stages 3 and 5 have the potential to be executed in parallel to optimize pipeline execution times.

IBM Cloud offers serverless SQL processing with SQL Query Service priced at a few dollars per TB of processed data. The service is priced based on volumes of data processed. In comparison, a moderately sized Spark cluster can cost multiple times more per hr with the total cost going up based on job execution times. By implementing pipeline stages 1, 3, 5 to use serverless SQL processing coupled with executing stages 3 and 5 in parallel can yield faster execution times and cost savings compared to using provisioned Spark clusters. Here is the implementation of ML pipeline with SQL processing stages using SQL Query service and ML processing stages targeted to the Spark cluster.

For the reference ML workload, the SQL Query implementation can reduce the computing costs by a factor of 8x to 20x across the three stages. As a guiding principle, architecting ML pipelines to use serverless computing for end-to-end processing will help build the most cost-effective solutions.

Summary

Apache Spark has emerged as the natural choice for scalable data processing in ML pipelines. Running ML workloads at scale incur high computing costs. Enterprises are challenged by budget constraints to contain the Spark computing costs for running ML workloads at scale. Dynamically provisioned, right-sized compute clusters with deeply discounted transient compute nodes on cloud help optimize the costs. Efficient cluster resource managers like IBM Spectrum Conductor and serverless computing services like IBM SQL Query service help maximize the savings.

--

--