gft-engineering
Published in

gft-engineering

MLOps made easy using Titan

Building a production-ready ML pipeline to predict hotel cancellations.

Foreword

MLOps is gaining relevant momentum in the IT industry. The growing use cases for machine learning applications is driving the increasing need for tools to manage the whole lifecycle of AI/ML models.

Titan (https://akoios.com/titan), a MLOps suite built by Akoios, is a data-science friendly, minimalistic and easy-to-use solution to reduce the hassle when putting models into production and when dealing with the complexity of ML pipelines.

Titan offers three building blocks to allow the users to build their own ML ecosystems:

  • Titan services: Automagic transformation of Jupyter Notebooks into ready-to-use REST API endpoints.
  • Titan Jobs: Easy execution of arbitrary workloads (e.g. a computationally costly neural network training) on any public or private cloud supporting K8S.
  • CI/CD integration: Easy integration of both Titan Services and Jobs into all types of pipelines (Gitlab CI, GitHub Actions, Jenkins, CircleCI…).

Introduction

Titan offers building blocks (Services and Jobs) to allow Data Science Teams to build their own pipelines and solutions in a simple yet powerful manner.

In this tutorial, it will be shown how to build a complete real ML pipeline to predict hotel cancellations based on historical data.

This tutorial will help us illustrate how to combine the different capabilities of a tool like Titan in order to deploy and maintain this prediction service.

The following figure depicts the structure of the pipeline:

  • Step 1: We will have our data stored as a table in Google BigQuery which will serve as our Data Warehouse in this example.
  • Step 2: We will create a Titan Job to execute an ETL (Extract, Transform and Load) process to regularly prepare the data for its further use for prediction purposes.
  • Step 3: Using another Titan Job, we will train server prediction models (Logistic Regression, Gradient Boosting and Random Forest) and calculate their main performance metrics.
  • Step 4: Depending on the performance metrics of the previously trained models, they could be (or not) automatically (re)deployed as API services using Titan Services.

Let’s go into detail with each of the steps.

Step 0: Prerequisites

This tutorial requires several tools. The good news is that you can get free trial account for each of them:

  • Titan License: Ask for a free trial account here
  • Google Cloud Platform Account: You can get a free trial account here
  • A GitLab account: You can access a free trial here

The dataset for the tutorial can be found here: Hotel Cancellations Dataset

Step 1: Setting up our Data WareHouse

As it was mentioned, we will be using Google BigQuery to store the data. First of all, it is needed to upload the full dataset to BigQuery as it is explained here.

Once uploaded to BigQuery, it is possible to run some SQL queries to confirm that the data has been correctly loaded. For example, this query will return the last 10 rows.

SELECT * 
FROM `datasets.hotel_reservations`
ORDER BY ReservationStatusDate DESC
LIMIT 10;

Step 2: Processing the data

Once the data is uploaded to BigQuery, we can now create a Titan Job to process the data and prepare it to be used by our ML prediction model.

The aim of this Job is:

  1. Access Google BigQuery.
  2. Retrieve the desired information (number of rows and selected columns) and save in CSV format.
  3. Upload the CSV file to Google Storage for its later use. NOTE: You will need access to a Google Storage bucket to save the data as shown in the code.

The code of this Titan Job is quite simple and can is shown below:

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from gcloud import storage
import pandas as pd
import os
# Access to Google Cloud Servicesos.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/route/to/jsonauthfile"credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
bqclient = bigquery.Client(
credentials=credentials,
project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=credentials
)
# Download query results and store them in a Dataframequery_string = """
SELECT
Country, MarketSegment, ArrivalDateMonth, DepositType,
CustomerType, LeadTime, ArrivalDateYear, ArrivalDateWeekNumber,
ArrivalDateDayOfMonth, RequiredCarParkingSpaces, IsCanceled
FROM. `datasets.hotel_reservations`
ORDER BY. ReservationStatusDate DESC
LIMIT 10000
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
# Create a CSV file and upload it to Google Storagedataframe.to_csv('hotel_reservations.csv', index=False)
client = storage.Client()
bucket = client.get_bucket('tutorial-datasets')
blob = bucket.blob('hotel_reservations.csv')
blob.upload_from_filename('hotel_reservations.csv')
blob.make_public()

The query_string in the code shows which features we will be using for our model:

  • Country
  • MarketSegment
  • ArrivalDateMonth
  • DepositType
  • CustomerType
  • LeadTime
  • ArrivalDateYear
  • ArrivalDateWeekNumber
  • ArrivalDateDayOfMonth
  • RequiredCarParkingSpaces
  • IsCanceled

Step 3: Training the models

In this step of the pipeline we will training the different prediction models we will later transform into API services. These are the prediction models we are going to use:

  • Logistic Regression
  • Gradient Boosting
  • Random Forest

The current Job will perform the following tasks:

  1. Read the .csv file with the data
  2. Identify and convert the categorical variables
  3. Define the the predicted variable (IsCanceled) and predictors (the rest of variables)
  4. Split the dataset
  5. Train the 3 different models
  6. Calculate the accuracy score for each of the models
  7. Save the trained models for its later use in Google Storage

The code of this Job is shown below:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import pickle
from gcloud import storage
df = pd.read_csv("https://storage.googleapis.com/tutorial-datasets/hotel_reservations.csv")# Identify the categorical features in our datacategorical_features = ['Country', 'MarketSegment', 'ArrivalDateMonth', 'DepositType', 'CustomerType', 'IsCanceled']df[categorical_features]=df[categorical_features].astype('category')# Define the predictors (X) and the predicted variable (y)y = df['IsCanceled']X = df.drop(['IsCanceled'],axis=1)# Encode the categorical variablesX_dum=pd.get_dummies(X,prefix_sep='-', drop_first=True)# Split the datasetX_dum = np.array(X_dum)
y = np.array(y)
X_train,X_test,y_train,y_test = train_test_split(X_dum, y, test_size=.25, random_state=40)# Prepare the GCloud storageclient = storage.Client()
bucket = client.get_bucket('tutorial-models')
# Train and store the Logistic Regression modellogistic=LogisticRegression()logistic.fit(X_train,y_train)
filename='logistic_model.sav'
pickle.dump(logistic, open(filename, 'wb'))
blob = bucket.blob('logistic_model.sav')
blob.upload_from_filename('logistic_model.sav')
blob.make_public()
# Train and store the Random Forest modelrand=RandomForestClassifier(n_jobs=10, random_state=40)rand.fit(X_train,y_train)
filename='random_forest_model.sav'
pickle.dump(logistic, open(filename, 'wb’))
blob = bucket.blob('random_forest_model.sav')
blob.upload_from_filename('random_forest_model.sav')
blob.make_public()
# Train and store the Gradient Boosting modelgb=GradientBoostingClassifier(random_state=50)gb.fit(X_train,y_train)
filename='gradient_boosting_model.sav'
pickle.dump(logistic, open(filename, 'wb'))
blob = bucket.blob('gradient_boosting_model.sav')
blob.upload_from_filename('gradient_boosting_model.sav')
blob.make_public()
# Check the accuracy score for each modely_pred= logistic.predict(X_test)
rand_pred=rand.predict(X_test)
gb_pred=gb.predict(X_test)
accuracy_score(y_test,y_pred)
accuracy_score(y_test,rand_pred)
accuracy_score(y_test,gb_pred)

Step 4: Defining the prediction endpoints

In this last step, we will use Titan Services to deploy the different prediction models that have been previously trained.

The following Jupyter Notebook does the following:

  • Load the trained models
  • Define different endpoints for each prediction (Logistic Regression, Random Forest & Gradient Boosting)
In [ ]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
import pickle
import cloudpickle as cp
from urllib.request import urlopen
import json
In [ ]:
logistic = LogisticRegression()
rand = RandomForestClassifier(n_jobs=10, random_state=40)
gb = GradientBoostingClassifier(random_state=50)
In [ ]:
# Load the stored models (Logistic regression, Gradient Boosting and Random Forest)
logistic = cp.load(urlopen(‘https://storage.googleapis.com/tutorial-datasets/logistic_model.sav'))
rand = cp.load(urlopen(‘https://storage.googleapis.com/tutorial-models/random_forest_model.sav'))
gb = cp.load(urlopen(‘https://storage.googleapis.com/tutorial-models/gradient_boosting_model.sav'))
In [ ]:
# Mock request object for local API testing
headers = {
‘content-type’: ‘application/json’
}
body = json.dumps({
“data”: [[46,2017,32,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1]]
})
REQUEST = json.dumps({ ‘headers’: headers, ‘body’: body })
In [ ]:
# POST /logistic_prediction
body = json.loads(REQUEST)[‘body’]
input_params = json.loads(body)[‘data’]
print(logistic.predict(input_params))
In [ ]:
# POST /rf_prediction
body = json.loads(REQUEST)[‘body’]
input_params = json.loads(body)[‘data’]
print(logistic.predict(input_params))
In [ ]:
# POST /gb_prediction
body = json.loads(REQUEST)[‘body’]
input_params = json.loads(body)[‘data’]
print(logistic.predict(input_params))

Step 4: Putting the pieces together through CI

Now that the components of the pipeline are ready, we can put them all together in the CI/CD platform of our choice.

For this example, we will be using GitLab CI for this purpose:

In a regular and scheduled (daily, weekly, monthly) basis, this pipeline will bring together the aforementioned Titan Jobs and Services. This would be the code needed to build this pipeline made of three steps: data, train and deploy:

stages:
- data
- train
- deploy
data:
image: python:3.8
stage: data
script:
# Install Titan CLI
- curl -sf https://install.akoios.com/beta | sh
# Run Titan Job
- titan jobs run data_job.py
train:
image: python:3.8
stage: train
script:
# Install Titan CLI
- curl -sf https://install.akoios.com/beta | sh
# Run Titan Job
- titan jobs run training_job.py
artifacts:
paths:
- train_result.txt
deploy:
image: python:3.8
stage: deploy
script:
- echo "Deploying to production!"
# Install Titan CLI
- curl -sf https://install.akoios.com/beta | sh
# Deploy Notebook API service
- titan deploy — image scipy prediction_service.ipynb

One additional and interesting feature for this pipeline would be adding an Evaluation step, to just deploy the models in case their accuracy is above a predetermined threshold after their training. This way, the automatic deployment of poorly performing models could be avoided.

Wrap-up and closing comments

In this tutorial we have mixed together many of the features we saw in previous tutorials in order to build a more complex and production ready ML pipeline.

Combining Titan building blocks (Titan Services and Titan Jobs ) with any sort of data source, makes it really easy to create and maintain robust data-based services for all types of projects.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gonzalo Ruiz de Villa

Engineer, Google Developer Expert , co-founder of Adesis Netlife and Kenobi Ventures. CTO @ GFT Group