MLOps คืออะไร

Published in

Equinox

6 min readAug 3, 2022

เนื่องจาก Machine Learning และ AI เริ่มยุคมาได้ในช่วงระยะเวลานึงแล้ว ตอนนี้หลาย Product ในไทยก็เริ่มนำ Machine Learning หรือ AI เข้ามาไว้ใน Product กันมากขึ้น ซึ่งการนำมาใส่ใน Product นั้นก็ต้องมีระเบียบหรือวิธีการในการ Implement อย่างถูกต้องเป็นระบบ คล้ายๆกับการทำ Software โดยในมุมของ MLOps นั้นจะเป็น Practices นึงที่นำ Tools มาใช้ในการ Test, Deploy, Manage และ Monitor Machine Learning ของเรานั่นเอง

MLOps we strive to avoid “technical debt” in machine learning applications.
MLOps ไม่ใช่ Roles แต่เป็น Practices นึงเพียงเท่านั้น

MLOps

Process ของ MLOps ประกอบไปด้วย 3 Process กว้างๆ ดังนี้
1. Designing the ML-powered application
2. ML Experimentation and Development
3. ML Operations

Designing the ML-powered application

สำหรับ Phase แรกเน้นไปที่ business understanding และ data understanding จากนั้นทำการออกแบบ ML software เช่น เราจะ Deliver feature ให้กับ users กลุ่มไหน และ solve ปัญหาอะไร โดยส่วนมาก 2 ปัญหาใหญ่ ที่มักจะนำมาใช้คือ
1. Increasing the productivity of the user
2. Increasing the interactivity of our application.

ML Experimentation and Development

สำหรับ Phase ถัดมาเน้นไปที่ การทำ ML Experiment หรือ POC ML Model ต่างๆ ว่า Model โอเคมั้ย Feature ที่ใช้สามารถตอบ Business ได้มั้ย และผลการ Test อยู่ใน Threshold ที่รับได้มั้ย โดย Phase นี้จะเป็น Iterative ที่มากกว่า Phase อื่นๆ เพราะว่าอาจจะต้องทำหลาย ML Experiment จนกว่าจะพอใจ
ซึ่งแน่นอนว่า Goals ของ Phase นี้นั้นคือ Quality Model ที่เพียงพอสำหรับการใช้ใน Production ของ Product เรานั่นเอง

ML Operations

สำหรับ Phase สุดท้ายจะเน้นไปที่ การ Deliver Model ที่ Develop มาโดยใช้ DevOps Practices เช่น Testing, Versioning, Continuous Delivery and Monitoring Model

Automation in Machine Learning

ในการ Adopt MLOps Practices ในส่วนของการ Automation นั้นจะเริ่มจาก Manual train machine learning model และ deployment และไปต่อที่การทำ ML Pipeline และ CI/CD Pipeline

Manual Process : Process นี้มักจะเป็น Process เริ่มต้นของทุกที่ก็คือการ Manual Train Model และมักจะเป็น Interative ไปเรื่อยๆ เพราะว่าเราจะต้องหา Experiment ที่เหมาะสมที่สุด ร่วมไปถึง การเตรียม Data Pipeline การทำ Data Preparation การทำ Data Validation การทำ Model training และ Model Testing ทั้งหมดจะทำเป็นแบบ Manual

ML Pipeline Automation : ระดับถัดจาก Manual คือการเริ่มทำ Automate ในการทำ ML Pipeline เมื่อใดก็ตามที่มี Data เข้ามาใหม่ อาจจะเป็นรอบๆ ตามแต่กำหนด ก็จะเริ่ม Process ในการทำ Model Retraining และก็ Model Validation ตามลำดับ

CI/CD Pipeline Automation : ในระดับสุดท้ายเราจะนำ CI/CD มาใช้กับ Model deployment ใน production ความแตกต่างของระดับนี้กับระดับก่อนหน้านี้ คือ เราจะ Automate build, Test, Deploy Model และ ML pipeline components

MLOps: Continuous delivery and automation pipelines in machine learning

MLOps Stage

Stage ต่างๆ ใน MLOps มีอยู่ประมาณ 6 Stage แต่ละ Stage มี Output ดังนี้

Development & Experimentation (ML algorithms, new ML models)
Output : Source code ของ pipelines เช่น Data extraction, validation, preparation, model training, model evaluation, model testing
Pipeline Continuous Integration (Build source code and run tests)
Output : Pipeline components ในการ deploy เช่น packages และ executables file
Pipeline Continuous Delivery (Deploy pipelines to target environment)
Output : Deployed pipeline ด้วย new model
Automated Triggering (Pipeline is automatically executed in production. Schedule or trigger are used)
Output : Trained model แล้วเก็บใน model registry
Model Continuous Delivery (Model serving for prediction)
Output : Deployed model prediction service เช่น REST API
Monitoring (Collecting data about the model performance on live data)
Output : Monitoring model บน Production เพื่อดูว่าเกิดปัญหาอะไร และยังใช้ได้ดีอยู่มั้ย

MLOps Setup Components

Components ต่างๆที่ต้องมี ซึ่งแต่ละ Components ก็จะใช้ Tools ต่างๆ ได้ตามแต่ละเงื่อนไข ของแต่ละบริษัท ถ้าจะ Adopt MLOps Practices มี components ดังนี้

Source Control
สำหรับการทำ Versioning ของ Source code, Data และ Model Artifacts
Test & Build Services
ใช้ CI tools สำหรับ ML artifacts และ Building packages สำหรับ Test และ Build pipelines
Deployment Services
ใช้ CD tools สำหรับ deploy pipelines ไปที่ target environment
Model Registry
ใช้สำหนับ store ML models ที่ train เสร็จแล้วพร้อมใช้งาน
Feature Store
ใช้สำหรับ Tracking metadata สำหรับ model training เช่น example model name, parameters, training data, test data, และ metric results
ML Pipeline Orchestrator
ใช้สำหรับ Automating Model ให้พร้อมสำหรับการ Deploy

Continuous in Machine Learning

ถ้าเรา Target ว่าเราจะเอา Model ที่ได้ไปใช้กับ Product ที่เป็น Software ทางที่ดีที่สุดก็คือออกมาเป็น API นั่นก็คือสุดท้ายแล้ว Model ที่ทำมาต้องพร้อมใช้อยู่ใน Microservices ตัวนึงสำหรับเรื่องๆนั้น และหากเรามี Model เป็นหลายร้อย Model การมานั่ง Deploy ที่ละ Model ก็คงจะไม่ใช่เรื่องที่ดีนัก

MLOps เป็น ML Engineering practices ที่ Follow practices ของ Continuous ดังนี้

Continuous Integration (CI) การทำ testing การทำ validating code รวมไปถึง components ของ data เช่น testing, validating data และ models แล้ว Build package สำหรับ Deploy
Continuous Delivery (CD) การ Deliver ML training pipeline โดยการ automatate deploys ML model เป็น prediction service
Continuous Training (CT) Unique สำหรับ ML systems โดยเฉพาะ ที่จะ automatate retrains ML models สำหรับ re-deployment
Continuous Monitoring (CM) การ Monitoring production data และ Model performance metrics ที่เป็นตัวชี้วัดของ Business metrics

Versioning in Machine Learning

เป้าหมายของการทำ Versioning คือการเก็บรักษา ML training scrips, ML models และ data sets ของ model เหมือนกับ Practices ของ DevOps

สาเหตุหลักๆ ของการทำ Versioning Model และ Data Change มีดังนี้

ต้องการ Train Model ใหม่ด้วย New Data
มีการเปลี่ยน Approach ของ Model
Model อาจจะไม่แม่นยำเหมือนเดิม
การนำ Model ไปใช้ใน Feature ใหม่

จะเห็นว่าจริงๆแล้ว Versioning in Machine Learning ก็คล้ายกับการทำ Software แต่ก็อาจจะต้องใช้ Tools ที่แตกต่างกันไปตามความสมควรของแต่ละเรื่องออกไป

Experiments Tracking

ในมุมของ Machine Learning Development มักจะมี Iterative process และเน้นไปที่การ research model เป็นหลัก ใน ML Development สามารถทำควบคู่ไปพร้อมๆกันหลายคนได้ เหมือนกับ Software Development ก่อนที่จะตัดสินใจว่าจะเลือก Model ตัวไหนไปใช้งาน

แต่ใน Software Development มักจะเก็บ Source code ไว้ใน Git ใน ML Development ก็สามารถใช้ Git และแยก Branch ได้เหมือนกัน แต่ก็ต้องมีการกำหนด Metrice ที่ชัดเจนในการ Compare Model ในแต่ละ Brach ต่างกับ การทำ Software ที่แยก Branch ออกไปทำงานแล้วก็ Merge รวมกันเข้ามา

ML Development จึงต้องใช้ Tools ที่แตกต่างกันออกมาในการเก็บ Experiment ต่างๆของ Model ว่าแต่ละ Experiment มี Feature อะไรต่างกันแบบไหน มี Hyperparameters ของ Model แตกต่างกันยังไง ซึ่ง Tools ที่นิยมใช้ก็จะเป็นพวก DVC, MLflow, Kubeflow เป็นต้น

Testing in Machine Learning

Figure source: “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” by E.Breck et al. 2017

Features and Data Tests
เป็นการ Tests ในส่วนของ Data Prepocessing ว่าข้อมูลที่เข้ามามีลักษณะตรงตามที่ต้องการมั้ย เช่น Type ผิดหรือป่าว หรือมี Outlier หรือป่าว

Tests for Reliable Model Development
เป็นการ Tests ในส่วนของการ Config Model ว่า Model ที่เราพัฒนาขึ้นมาเนี่ยถูกตรงตามกับที่เรา Expect ไว้ใน Test มั้ย และยังรวมถึงการ Tests predictions ด้วยว่า Model ของเรายังมี Accuracy อยู่ใน Benchmark ที่รับได้เหมือนเดิมมั้ยเมื่อมีการเปลี่ยนแปลง

ML Infrastructure Tests
เป็นการ Tests ในส่วนของ ML API usage หรือพวก Algorithmic correctness ว่ายังถูกต้องอยู่มั้ยเมื่อไปอยู่บน Environment ต่างๆ

การทำ Test สำหรับ Machine Learning ผมเคยเขียนไว้แล้วเมื่อหลายปีก่อน สามารถเข้าไปอ่านเพิ่มเติมได้ที่นี่

Unit Testing สำหรับ Machine Learning

Emergency Meeting — เราเจอคนไม่เขียน Tests

medium.com

Monitoring in Machine Learning

เมื่อมีการนำ Machine Learning Model ไปใช้แล้ว ก็จะต้องมีการตรวจสอบสม่ำเสมอว่า Model ของเรายังทำงานได้ดีอยู่มั้ย

ใน Paper : “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” by E.Breck et al. 2017:
แนะนำการทำ Monitoring ไว้ดังนี้

Monitor dependency changes ทั้ง pipeline และทำ notification เมื่อ
- Data version change
- Changes in source system
- Dependencies upgrade
Monitor data invariants ใน training และ serving inputs และ Alert เมื่อ data ไม่ match กับ schema ที่ใช้ตอน training step
Monitor performance ของ Model ว่ายัง Predict ได้ถูกต้องตามที่คาดหวังไว้อยู่มั้ย

ซึ่ง Tools ที่จะนำมาใช้ในการ Monitor เช่น Kibana, Prometheus, Uptime Monitoring เป็นต้น

MLOps vs DevOps

DevOps เป็น Practices นึงในการลดระยะเวลาในการพัฒนา(development life cycle) และ ส่งมอบ High software quality อย่างต่อเนื่อง

ถ้าจะเปรียบเทียบ MLOps นั้น เป็น Process ของการทำให้ Software และ Workflow ของ Machine Learning เป็น Automation เพื่อที่จะส่งมอบ High ML-software quality
ทั้ง DevOps และ MLOps มีเป้าเดียวกันที่จะว่าง Automate Process ของการทำงานซ้ำๆได้ แต่ใน MLOps แค่เพิ่ม Components ที่เกี่ยวกับการทำ Machine Learning เข้ามานั่นเอง

แน่นอนว่า พอพูดแบบนี้ปุ๊ป เราก็สามารถมองหรือสรุปเจ้า MLOps เนี่ยเป็น Subset ของ DevOps อีกทีนึงนะ !

MLOps vs DevOps Delivery Metrics

เปรียบเทียบ Metrics 4 ตัว ในแง่มุมของ MLOps กับ DevOps ในแง่ของการ Deliver High ML/Software quality ดังนี้

Deployment Frequency
DevOps : ความถี่ในการ Deploy หรือ Release feature ไปสู่ End-users
MLOps : ความถี่ในการ Deploy ML Model ไปสู่ End-users
Lead Time for Changes
DevOps : ระยะเวลาในการเปลี่ยน Code ที่ Commit ไปสู่ Production
MLOps : ระยะเวลาในการทำ Experiment model ไปจนได้ Model ที่พร้อมใช้ใน Production
Mean Time To Restore (MTTR)
DevOps : ระยะเวลาในการกู้คืน System ที่ทำงานผิดพลาดใน Production
MLOps : ระยะเวลาในการทำ Model Debugging เมื่อ Model ทำงานผิดพลาด รวมไปถึงการ การทำ Model Retraining และการ Rollback กลับไป Versions ก่อนหน้า
Change Failure Rate
DevOps : อัตราที่ทำให้เกิด Incident บน Production จนต้อง Hotfix หรือ Rollback
MLOps : อัตราที่ทำให้เกิด Incident ในการ Prediction ของ Model จนต้อง Rollback โดยจะต้องแยกตามแต่ละชนิดของ Model เช่น Precision, Recall, F-1, accuracy, AUC, ROC, false positives และ A/B Testing เป็นต้น

Conclusion

จะเห็นได้ว่า DevOps เป็น Practices นึงที่ต้องการ Deliver high quality software ในทางของ MLOps ก็เป็น Practices เหมือนกันแต่ต้องการ Deliver high quality ML-software และแน่นอนว่าทั้ง DevOps และ MLOps ไม่ใช่ Roles ในการทำงาน แต่เป็นเพียง Practices นึงเท่านั้น

#MLOps ไม่ใช่กีฬาคล้าย Golf ถ้าให้เปรียบเทียบก็คงจะเหมือน Football team มากกว่า และแน่นอนว่า Team ที่ต้องการชนะ ทุกคนในจะต้องรู้วิธีในการฝึกซ้อม วิธีในการเตะบอล เทคนิคในการรับ ส่ง และทำประตู และไม่จำเป็นที่ทุกคนจะต้องสามารถเตะได้ในระยะ 30m MLOps ก็เช่นกัน ทุกคนควรรู้เพื่อที่จะให้ทีม Deliver ของออกมาได้ดี แต่ก็ไม่ได้หมายความว่า ทีมจะต้องมีคนเก่งขั้นเทพทาางด้านนี้และยกให้คนนั้นเป็นตำแหน่ง MLOps ทางด้านนี้