Deep Learning Training with Ray and Ludwig using Elotl Luna

Justin Willoughby
Elotl blog
Published in
4 min readApr 17, 2024

Originally published at https://www.elotl.co.

In this brief summary blog, we delve into the intriguing realm of GPU cost savings in the cloud through the use of Luna, an Intelligent Autoscaler. If you’re passionate about harnessing the power of Deep Learning (DL) while optimizing expenses, this summary is for you. Join us as we explore how innovative technologies are revolutionizing the landscape of resource management in the realm of Deep Learning. Let’s embark on a journey where efficiency meets intelligence, promising both technical insights and a practical solution.

Deep Learning has and continues to transform many industries such as AI, Healthcare, Finance, Retail, E-commerce, and many others. Some of the challenges with DL include its high cost and operational overhead:

  1. Compute Costs: Deep learning models require significant computational resources, which lead to high costs, especially for complex or large-scale projects. This is even more true when the compute remains provisioned when it’s not needed.
  2. Instance Management: Managing cloud instances for training, inference, and experimentation creates operational overhead. This includes provisioning and configuring virtual machines, monitoring resource usage, and optimizing instance types for performance and cost efficiency.
  3. Infrastructure Scaling: Scaling deep learning workloads in the cloud involves dynamically adjusting compute resources to meet demand. This requires optimizing resource allocation to minimize costs while ensuring sufficient capacity.

Open-source platforms like and have broadened DL accessibility, yet DL model’s intensive GPU resource demands present financial hurdles. Addressing this, Elotl Luna emerges as a solution, streamlining compute for Kubernetes clusters without the need for manual scaling which often results in wasted spend.

Running Ray and Ludwig on cloud Kubernetes clusters using Luna, an Intelligent Kubernetes Cluster Autoscaler, is a great approach to mitigating the challenges often faced with DL and public cloud GPU resource demands. Luna dynamically adjusts GPU resources based on workload needs, resulting in substantial efficiency gains.

Luna showed significant improvements over a fixed size Ray cluster on AWS, all while preserving AutoML performance quality:

  • Reduced elapsed time by 61%
  • Reduced compute cost by 54%
  • Reduced idle Ray cluster cost by 66%

The exploration and testing encompassed ML experiments utilizing Ludwig v0.4.1, leveraging its AutoML capability. These results were obtained during the ML training workload aimed at validating the newly added AutoML feature in Ludwig v0.4.1. Luna’s resource management can be used to provide just-in-time compute for Ludwig’s AutoML across various datasets, employing Ray Tune for hyperparameter search on GPU-enabled workers. Results prove competitive with manually-tuned models, showcasing Luna’s adaptability and efficiency in DL workflows.

Lessons learned underscore the substantial savings achieved in workload elapsed time, execution costs, idle costs, and operational complexity. This is just a glimpse into the transformative impact of Luna on DL training workloads in the cloud. For a comprehensive understanding, dive into the full details of the Managing public cloud resources for deep learning training: experiments and lessons learned blog on the Cloud Native Computing Foundation site.

Furthermore, we encourage you to explore our subsequent research, which validates the efficacy of Ludwig v0.5.0 AutoML for text classification datasets. In this study Luna also showed significant savings as well.

  • Reduced elapsed time by 7%
  • Reduced compute cost by 59%
  • Reduced idle Ray cluster cost by 66%

The full details of this experiment can be found by viewing the slides and/or video recording from the Efficient AutoML with Ludwig, Ray, and Nodeless Kubernetes session from Kubernetes AI Day Europe.

In both cases, Luna was able to dramatically lower the cost and enhance the performance of the Deep Learning jobs.

While this summary has provided a glimpse into the fascinating world of GPU cost savings with an Luna, we must acknowledge that it merely scratches the surface of the comprehensive insights offered in the original blog and subsequent presentation. We hope this summary has sparked your curiosity and motivated you to explore the full depth of knowledge available. For a more detailed understanding, we encourage you to dive into the original blog and presentations linked above.

To explore the robust features and capabilities of Luna in greater detail, visit our Luna Product page. For comprehensive guidance, refer to our documentation. Ready to experience firsthand the seamless management of compute for GPU workloads? Start testing Luna today and discover the efficiency and flexibility it offers for your cloud environments.

Author:
Justin Willoughby (Principal Solutions Architect, Elotl)

Authors/Contributors from the full blog from which this summary blog is based:
Anne Holler, Chi Su, Travis Addair, Henry Prêcheur, Paweł Bojanowski, Madhuri Yechuri, and Richard Liaw

--

--