How to Train Custom Model and Deploy on Google Cloud Vertex AI
Google provides a bunch of pre-built containers for frameworks like Tensorflow, SKLearn and XGBoost but there are cases where we need to train a model on vertex using some other framework like LightGBM in situation like this we need to build a custom container.
In this blog I will walk you through steps to create a custom container for LightGBM and deploy the same as the endpoint on Vertex AI. I will be using the IRIS Dataset and building a classification model using LightGBM.
The post is divided into four sections
- Part 1 : Setting up Artifact Registry
- Part 2 : Training Container
- Part 3 : Serving Container
- Part 4 : Deployment on Vertex AI
You can find Full Code here :
Part 1 : Setting up Artifact Registry
Artifact Registry enables you to centrally store artifacts and build dependencies as part of an integrated Google Cloud experience.
For creating a Repository :
- Navigate to Artifact Registry page on GCP
- Click on Create Repository
- Fill in Name and Select Format, mode, location and encryption
- Click on create
Part 2 : Training Container
Create a new folder which contains two files :
- Task.py : The python code file which handles data preprocessing, model training and artifact upload.
Some environment variables are autoset when code runs as a vertex custom training job for e.g. : AIP_MODEL_DIR, AIP_STORAGE_URI more details here
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
prediction=model.predict(X_test)
joblib.dump(model, model_file_name)
# Initialise a client
storage_client = storage.Client(project=PROJECT_ID)
# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
print("AIP_MODEL_DIR==>>>", model_directory)
storage_path = os.path.join(model_directory, model_file_name)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(model_file_name
The above code trains a LightGBM model and saves the model pickle file to ‘AIP_MODEL_DIR’ path.
Once the file pickle file is available in the gcs bucket, custom training job registers and uploads the model artifact in model registry.
- Dockerfile : contains commands for installation of the required packages, copies the python file created in the previous step to container and provides entrypoint to container.
FROM python
# Installs additional packages
RUN pip install lightgbm pandas numpy scikit-learn google-cloud-aiplatform protobuf==3.20.3 google-cloud-storage
# Copies the trainer code to the docker image.
COPY task.py ./task.py
# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]# Installs additional packages
Once we have Dockerfile and task.py file in place we use terminal/image shell in Vertex Workbench to build and push training container to artifact registry created in the previous step:
Run the following command to configure gcloud as the credential helper for the Artifact Registry domain associated with this repository’s location:
gcloud auth configure-docker us-central1-docker.pkg.dev
Navigate to the folder containing Dockerfile using cd command and the use
Docker build Command :
docker build -f Dockerfile -t us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_model .
Docker push command :
docker push us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_model
In the above command — modify as per your config:
Region : us-central1
PROJECT_ID : your-project-name
Artifact Repository name : testrepo
Container name : lightgbm_model
Part 3 : Serving Container
Similar to training container building, we have two files :
Python file : it has 3 main function
- Loading the model file from ‘AIP_MODEL_DIR’ path specified
- Creating a health check endpoint which responds with code 200 if healthy.
- Creating serving endpoint which takes json input and responds with model predictions
Sample Code :
#Loading model from AIP_MODEL_DIR path.
model_f = "model.pkl"
model = joblib.load(model_f)
@app.route(os.environ['AIP_HEALTH_ROUTE'], methods=['GET'])
def health_check():
return {"status": "healthy"}
@app.route(os.environ['AIP_PREDICT_ROUTE'], methods=['POST'])
def add_income():
request_json = request.json
request_instances = request_json['instances']
prediction=model.predict(request_instances)
prediction = prediction.tolist()
output = {'predictions':
[
{
'result' : prediction
}
]
}
return jsonify(output)
Docker File packages the above code to container and also sets environment variables which will be required for prediction container :
FROM python
# Installs additional packages
RUN pip install lightgbm flask scikit-learn
ENV AIP_STORAGE_URI=gs://aakash-test-env/model
ENV AIP_HEALTH_ROUTE=/ping
ENV AIP_PREDICT_ROUTE=/predict
ENV AIP_HTTP_PORT=8080
# Copies the API code to the docker image.
COPY . ./
# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "serving_model.py"]
Navigate to the folder containing Dockerfile using cd command and the use
Docker build Command :
docker build -f Dockerfile -t us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_serve .
Docker push command :
docker push us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_serve
After Building and pushing the above two containers the snapshot of artifact registry should look like this :
Part 4 : Deployment on Vertex AI
Now that we have training and serving container ready, We will go ahead and create CustomContainerTrainingJob.
Code :
from google.cloud import aiplatform
REGION = 'us-central1'
PROJECT_ID = 'your-project-name'
bucket = 'gs://your-bucket-name/model' # Should be same as AIP_STORAGE_URI specified in docker file
container_uri='us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_model'
model_serving_container_image_uri='us-central1-docker.pkg.dev/your-project-name/testrepo/lightgbm_serve'
display_name='Custom Job from Code'
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=bucket)
# Create CustomContainerTrainingJob and start run with some service account
job = aiplatform.CustomContainerTrainingJob(
display_name=display_name,
container_uri=container_uri,
model_serving_container_image_uri=model_serving_container_image_uri,
)
model = job.run( model_display_name=display_name, service_account="")
# Finally deploy the model to endpoint
endpoint = model.deploy(
deployed_model_display_name=display_name, sync=True
)
The code snippet initiates an aiplatform object then triggers a custom training job and finally deploys the model to the endpoint. Please refer to documentation to customize the endpoint by passing relevant parameters to deploy function.
Once the model is deployed to endpoint it can serve online requests.Usually the deployment process can take 20–40 mins. You can check the status and logs in the Vertex AI dashboard.
References:
- https://blog.ml6.eu/deploy-ml-models-on-vertex-ai-using-custom-containers-c00f57efdc3c
- https://cloud.google.com/vertex-ai/docs/predictions/use-custom-container
- https://cloud.google.com/vertex-ai/docs/training/containers-overview
- https://cloud.google.com/vertex-ai/docs/training/create-custom-job
Thanks for reading.
Your feedback and questions are highly appreciated. You can connect with me via LinkedIn.