How to Train Custom Model and Deploy on Google Cloud Vertex AI

Published in

Google Cloud - Community

5 min readJan 14, 2023

Google provides a bunch of pre-built containers for frameworks like Tensorflow, SKLearn and XGBoost but there are cases where we need to train a model on vertex using some other framework like LightGBM in situation like this we need to build a custom container.

In this blog I will walk you through steps to create a custom container for LightGBM and deploy the same as the endpoint on Vertex AI. I will be using the IRIS Dataset and building a classification model using LightGBM.

The post is divided into four sections

Part 1 : Setting up Artifact Registry
Part 2 : Training Container
Part 3 : Serving Container
Part 4 : Deployment on Vertex AI

You can find Full Code here :

Part 1 : Setting up Artifact Registry

Artifact Registry enables you to centrally store artifacts and build dependencies as part of an integrated Google Cloud experience.

For creating a Repository :

Navigate to Artifact Registry page on GCP
Click on Create Repository
Fill in Name and Select Format, mode, location and encryption
Click on create

Create repository tab in Artifact Registry

Part 2 : Training Container

Create a new folder which contains two files :

Task.py : The python code file which handles data preprocessing, model training and artifact upload.

Some environment variables are autoset when code runs as a vertex custom training job for e.g. : AIP_MODEL_DIR, AIP_STORAGE_URI more details here

model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
prediction=model.predict(X_test)
joblib.dump(model, model_file_name)

# Initialise a client
storage_client = storage.Client(project=PROJECT_ID)

# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
print("AIP_MODEL_DIR==>>>", model_directory)
storage_path = os.path.join(model_directory, model_file_name)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(model_file_name

The above code trains a LightGBM model and saves the model pickle file to ‘AIP_MODEL_DIR’ path.

Once the file pickle file is available in the gcs bucket, custom training job registers and uploads the model artifact in model registry.

Dockerfile : contains commands for installation of the required packages, copies the python file created in the previous step to container and provides entrypoint to container.

FROM python

# Installs additional packages
RUN pip install lightgbm pandas numpy scikit-learn google-cloud-aiplatform protobuf==3.20.3 google-cloud-storage

# Copies the trainer code to the docker image.
COPY task.py ./task.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]# Installs additional packages

Once we have Dockerfile and task.py file in place we use terminal/image shell in Vertex Workbench to build and push training container to artifact registry created in the previous step:

Run the following command to configure gcloud as the credential helper for the Artifact Registry domain associated with this repository’s location:

gcloud auth configure-docker us-central1-docker.pkg.dev

Navigate to the folder containing Dockerfile using cd command and the use

Docker build Command :

docker build -f Dockerfile -t us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_model .

Docker push command :

docker push us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_model

In the above command — modify as per your config:

Region : us-central1

PROJECT_ID : your-project-name

Artifact Repository name : testrepo

Container name : lightgbm_model

Part 3 : Serving Container

Similar to training container building, we have two files :

Python file : it has 3 main function

Loading the model file from ‘AIP_MODEL_DIR’ path specified
Creating a health check endpoint which responds with code 200 if healthy.
Creating serving endpoint which takes json input and responds with model predictions

Sample Code :

#Loading model from AIP_MODEL_DIR path.
model_f = "model.pkl"
model =  joblib.load(model_f)

@app.route(os.environ['AIP_HEALTH_ROUTE'], methods=['GET'])
def health_check():
   return {"status": "healthy"}


@app.route(os.environ['AIP_PREDICT_ROUTE'], methods=['POST'])
def add_income():
    request_json = request.json
    request_instances = request_json['instances']
    prediction=model.predict(request_instances)
    prediction = prediction.tolist()
    output = {'predictions':
                   [
                       {
                           'result' : prediction
                       }
                   ]
               }
    return jsonify(output)

Docker File packages the above code to container and also sets environment variables which will be required for prediction container :

FROM python

# Installs additional packages
RUN pip install lightgbm flask scikit-learn

ENV AIP_STORAGE_URI=gs://aakash-test-env/model
ENV AIP_HEALTH_ROUTE=/ping
ENV AIP_PREDICT_ROUTE=/predict
ENV AIP_HTTP_PORT=8080
# Copies the API code to the docker image.
COPY . ./

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "serving_model.py"]

Navigate to the folder containing Dockerfile using cd command and the use

Docker build Command :

docker build -f Dockerfile -t us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_serve .

Docker push command :

docker push us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_serve

After Building and pushing the above two containers the snapshot of artifact registry should look like this :

Part 4 : Deployment on Vertex AI

Now that we have training and serving container ready, We will go ahead and create CustomContainerTrainingJob.

Code :

from google.cloud import aiplatform

REGION = 'us-central1'
PROJECT_ID = 'your-project-name'
bucket = 'gs://your-bucket-name/model' # Should be same as AIP_STORAGE_URI specified in docker file
container_uri='us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_model'
model_serving_container_image_uri='us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_serve'
display_name='Custom Job from Code'

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=bucket)

# Create CustomContainerTrainingJob and start run with some service account
job = aiplatform.CustomContainerTrainingJob(
display_name=display_name,
container_uri=container_uri,
model_serving_container_image_uri=model_serving_container_image_uri,
)

model = job.run( model_display_name=display_name, service_account="")

# Finally deploy the model to endpoint
endpoint = model.deploy(
deployed_model_display_name=display_name, sync=True
)

The code snippet initiates an aiplatform object then triggers a custom training job and finally deploys the model to the endpoint. Please refer to documentation to customize the endpoint by passing relevant parameters to deploy function.

Once the model is deployed to endpoint it can serve online requests.Usually the deployment process can take 20–40 mins. You can check the status and logs in the Vertex AI dashboard.

References:

Thanks for reading.

Your feedback and questions are highly appreciated. You can connect with me via LinkedIn.