AI Booster - MosaicML (3)

Nomad Ape

Follow

Published in

IMU Framework Design

Apr 12, 2023

--

MosaicML Platform: the Software Infrastructure for Generative AI

這系列文章是透過蒐集、分享我覺得有意思AI服務、相關文章、影片，期許自己能更了解這世界上正在發生什麼。

本篇介紹的 MosaicML Platform是MosaicML設計來處理訓練大型模型像是ChatGPT、LaMDA、Stable Diffusion等所面臨的挑戰的平台。如同之前的介紹，MosaicML的願景是讓模型訓練更普及，這當中尤以大規模的模型更被MosaicML所重視，這些大型模型有著高複雜度、高訓練成本，因此被科技巨頭們所壟斷，一般企業很難將其客製化、專業化並用於其產業中，且模型型和資料的自主權也是讓大眾卻步的原因。

處理器是訓練大型模型的第一道天險，Google和Meta分別使用1024顆TPUs和1024顆輝達的A100 GPUs來分別訓練LaMDA和OPT-175，並分別花了57天和33天才訓練好。

複雜的訓練套組(machine learning training stack)是緊接而來的問題，下圖僅為MosaicML提供的簡化版的套組，但可以看出要如何從中挑選、策畫、配置事非常耗時耗成本的，當中一個細節的出錯可能會讓模型的速度顯著變慢。

Image credit: MosaicML - *A simplified view of a typical machine learning training stack.*

大型模型使用動輒幾百到幾千億的參數，正確的GPU布局策略(distribution strategy)、函式庫挑選與整合、超參數(Hyperparameter)的調整等皆是挑戰。

當以上的要點都已備妥，MosaicML以處理錯誤為例，提供像是Meta的OPT-175B訓練日誌，在在顯示實際的訓練過程才是考驗的核心，

2021–5–12…It took 50 minutes to resume training from checkpoint_15_45000!
2021–11–18…Unable to train continuously for more than 1–2 days … Many failures require manual detection and remediation, wasting compute resources and researcher time…”

MosaicML Platform 提供以下功能已解上述的問題:

訓練套組整合: 訓練套組如同訓練模型的基礎設施，MosaicML幫助用戶在混亂的訓練套組市場中挑選並加以優化，MosaicML Platform包含整套的訓練框架(distributed training framework- Composer)、串流資料加載(streaming data loader- StreamingDataset)等。

Image credit: MosaicML- *The MosaicML platform addresses all layers of the training stack.*

2.多點雲端訓練: 用戶能在MosaicML Platform中使用自選或自有的雲服務，Control Plane 會作為中間協調者進行多節點編排、錯誤偵測、邏輯管理等。

Image credit: MosaicML- *The MosaicML platform has three parts: the client interfaces, the control plane, and the compute plane.*

3.部屬前的準備工作: 針對最佳化並行性配置(optimized parallelism configurations)、資料如何裝載等，MosaicML Platform 讓用戶能先進行小批量試驗後再大規模執行。

4.自動錯誤偵測(Automatic Failure Detection)並快速復原(Fast Recovery)

Image credit: MosaicML - *Automated resumption of training after hardware failures or loss spikes.*

Thank you and enjoy it!

AI Booster - MosaicML (3)

Written by Nomad Ape