Deploying a ML Model with FastAPI on Google Kubernetes Engine(Google Cloud Platform)
Training a model with pretty high accuracy(or whatever the metric you use) is great, definitely deserves popping a bottle of champagne. But how are you going to use it now? Batch predict everything on your computer? It would the machine learning equivalent of showing your website from your laptop on your local coffee shop.
What we are going to do is making everybody able to reach it through an API. If you don’t know what an API is, think of it as an endpoint to your users. They can send data to predict to your API which is a POST request, API will response with prediction.
This post consists of two parts:
- Developing the API
We will use pre-trained yolo-v3 and yolo-v3-tiny models since the focus here not training the model, rather deploying it. We will wrap FastAPI around your model. When we finish this part, we will be able to send requests to API on our localhost.
2. Deploying it on a Kubernetes Engine on Google Cloud
After the first part, we will be able to send requests to API on our localhost. In order to open your model to the world, we need to deploy it to a public endpoint. I picked cloud and Kubernetes to eliminate scalability problem.
Developing the API
Some parts of the code is from Andrew NG’s MLOps course but I intend to go further than him. Code is pretty self-explanatory if you are familiar with computer vision, but I have to explain FastAPI parts for newcomers.
FastAPI is pretty much Flask on steroids, so we will creating different pages like website.com/home or website.com/about. We can do that using @app decorator; i.e. @app("/") means home page. What is more, you can test your FastAPI application via /docs page which is a big plus compared to Flask. This becomes possible thanks to type hinting. In other words, FastAPI documents your application automatically.
We also create a directory on local system to save images client sends us and images we will draw boxes around objects. We'll stream latter back to client therefore they can see our model's prediction.
After running this code you are able to reach it from localhost; if you go to localhost you can see the message “API is working as expected.” or http://localhost:8000/docs show you see a screen identical to this:
You can try predict method to see if your model works correctly.
Deploying Application on a Kubernetes Engine on Google Cloud
We are halfway through now, but this part is a bit complicated; this is why I collaborated with my colleague for this part. We'll divide this process into 4 parts:
- Containerize the App
- Upload it to Container Registry
- Create a Kubernetes Cluster on Google Cloud
- Run Container on Cluster
- (Bonus) Set up an Auto-Scale Policy
Since you know the steps now, whenever you feel lost you can head back and take a look at this steps then figure out where we are.
If you set up your account, projects and billing we can head right into action. You put your app under a folder named "app". Outside the app create a Docker file named "Dockerfile". Just like this:
Then writing a Dockerfile:
I am not going into detail about docker containers. If you are new to Docker you can read Docker docs or check my Docker notes on my github.
Base image is python, then we defined working directory. Then we copied everything do main directory in container; installed required libraries; exposed port 80 and told Docker which commands to run when we run container.
We built the container and run on our local machine. Second one is optional, it is just for demonstration purposes; just to see if it works. Third command is pushing our container to a registry which we prefer not Docker Hub, Google Container Registry(gcr.io).
If you are not logged in before third command, open terminal, type gcloud init follow the steps. Then type gcloud auth configure-docker .This part is just to run your gcloud commands from your local machine which I won't get into details.
If you wanna see the all running nodes you can type kubectl get nodes -A to cloud console. (-A flag means All nodes)
For the next step we are going to need cloud shell editor instead of cloud shell terminal.
We are going go to create the yaml file for Kubernetes deployment. Here it is:
For more details about kubernetes yaml files visit this site. Then we need to use this build.yaml file.
We can also see deployments with second command. We are almost there; all we have to do is expose our kubernetes workload which can be done on UI. Then we'll have an IP address to send requests to our API.
After this job is completed, you can head to Services&Ingress tab and see your endpoint.
You can click the endpoint and you will see: “API is working as expected.”If you want to test your model, you can add "/docs" to the IP address and send files for prediction. All you need to is upload a file and press execute.
And that's it, folks. Now you served your model.
Bonus: Set Up Auto-Scale
If we type this commands to console we will say kubernetes that after 90% CPU utilization, it will autoscale up to 6 nodes.