Machine Learning in Production: Using Istio to Mesh Microservices in Google Kubernetes Engine

Published in

The Blog of RETINA-AI Health, Inc.

12 min readMar 20, 2020

As the world grapples with Coronavirus and seeks a well-calibrated effective response to the pandemic, there is increasing recognition of the important roles Artificial Intelligence and TeleHealth will play in building resilient healthcare systems that work. To be useful, machine learning algorithms have to be deployed into the real world in a form that is accessible, reliable, secure, and that makes a positive impact in our daily lives. To accomplish these goals, a number of infrastructural components have to be firmly in place. Specifically, our machine learning models need to be deployed into an ecosystem or network of other services which depend on them and vice versa. And this network infrastructure must be readily customizable to the specific needs of our patients and their families.

In this tutorial, I describe the key components required to achieve the above described goals, and how they interact with each other. In particular, I focus on an abstraction layer called the Istio Service Mesh which facilitates network security, routing, and telemetry of the various microservices within our machine learning production architecture. In the recent past, I wrote tutorials on each of the requisite components. Here is a brief review and glossary:

Docker: A device, called a container, which isolates applications and confers portability, modularity, and independence. See my previous tutorial on docker, virtual machines, and python virtual environments
Tensorflow Serving: A server that hosts a trained machine learning model and serves inference results when called upon. See my previous tutorial here: Single ML Model and Multiple ML Model
Kubernetes: This is the platform that orchestrates or manages the microservices. The containers run in abstractions known as pods, the units of deployment, which in turn run in nodes. The nodes are virtual machines (VMs) which together constitute the cluster. Kubernetes provides orchestration and robustness in the form of load-balancing, ingress-management, autoscaling, amongst others. See here.
Microservices: The design pattern in which the various services in one’s application are separated. It is in contrast to the traditional monolithic architecture.

All of the above functionalities are very welcome. However, there is still a problem. How do you manage the network connections and connectivity policy between the microservices in your cluster? This can become very difficult, and increasingly so as the number of microservices in your cluster grow; and as their interconnections and interactions become more complex. A service mesh solves this problem for you. Service meshes can be thought of as a layer that runs on top of the orchestration layer (e.g Kubernetes), and provides a uniform declarative “clean” way of implementing network, security, and monitoring policy.

Istio Functionality

Networking
Security management
Telemetry (monitoring)

The istio service mesh in particular is an open source service mesh that spun out of a collaboration between Lyft, Google, IBM, Red Hat, and a number of other contributors. It runs on top of Kubernetes and mainly works via a certain device called a “side-car” or “proxy” or “envoy.” This containerized device can be injected into any pod in the cluster and intercepts all traffic to and from that pod. Given that containers in a pod communicate via local host, the “side car” can effectively serve as a proxy for its pod and enforce any network policy such as routing rules, security, re-tries, circuit-breaking etc, as well as record all traffic flow and make it readily available for telemetric display. Notably, originally developed by Lyft, the sidecar envoy is well-suited for its role in that it is written in C++ for high performance; it is an L7 proxy with HTTP, HTTP/2, and gRPC support; and it is light weight, taking up less than 200MB.

Istio Architecture

The istio architecture is made up of two abstraction layers:

The Control Plane
The Data Plane

In istio 1.4.x and earlier, the Control Plane API consists of three components:

The Pilot (For Policy injection)
The Mixer (For Adapter plugins and Telemetry)
The Citadel (For Security)

While in Istio 1.5 (released March 3rd 2020), the architecture has been simplified and the Control Plane API consolidated. More on this later in this tutorial.

Architecture for Istio 1.4.x and earlier. Picture Source

Notice in the above diagram that each pod in the cluster has both a service and a proxy. The proxy intercepts all traffic going to and from the service. Communication is happening at Layer 7 (L7) via http, gPRC, or TCP protocols. These may have mutual TLS authentication enforced or not, all dependent on the designer’s wishes. One theme that you’ll see as you get familiar with istio is the high degree of configurability that it allows.

Looking at the Citadel module of the Control Plane, we see that security certificates (TLS certs) are generated and then injected into the proxies for use in mTLS authentication. The Citadel is the security headquarters of the istio architecture. The Galley is often considered part of the Mixer, and together Galley and Mixer serve to enable telemetry via third party Adapter plugins such as Prometheus, Grafana, Zipkin, and Kiali, to name a few.

The Pilot module handles service discovery and policy configuration of each service’s side-car.

Before we will be able do a hands-on evaluation of istio, we need to install it first. Let us take a look at how to do this for the Google Kubernetes Engine.

Of note, some changes are being implemented to the istio architecture starting with istio 1.5. In particular, the Mixer is being moved into the envoy, and the Pilot, Citadel, and Sidecar-injector services are being consolidated into a single service called Istiod binary. Otherwise, the idea and implementation are essentially unchanged, except for third party adapter writers who would now be interfacing with WebAssembly (wasm). For those using a broker such as GKE that handled the details of install, there will be minimal notable change to the developer experience. One such change is the greater ease of setting up and using telemetry. See Istio 1.5 architecture below:

Installing Istio

gcloud beta container clusters create demo-istio-cluster \
    --addons=Istio --istio-config=auth=MTLS_PERMISSIVE \
    --cluster-version=1.15.9-gke.9 \
    --machine-type=n1-standard-2 \
    --num-nodes=4 \
    --zone=us-west1-b

where CLUSTER_VERSION is the GKE cluster version and CLUSTER_NAME is the name you choose for your cluster. auth can either be set to MTLS_STRICT in which case all inter-cluster communication is MTLS encrypted; or it can be set to MTLS_PERMISSIVE in which case no connections are MTLS encrypted by default and have to be explicitly set. In particular, make sure to specify --zone or —-region flags, otherwise you’ll get an error message (The docs omit this). Entering the above command on my system in cloud shell, I get the following:

Which when complete generates the following on console:

The following workloads have also been created for the cluster:

In the above picture, if we Zoom-in on the istio-specific deployments and job we’ll see the following:

The other option for installation of istio is in the scenario where we already have an existing cluster. To add istio to it we simply do the following:

gcloud beta container clusters update CLUSTER_NAME \
    --update-addons=Istio=ENABLED --istio-config=auth=MTLS_STRICT

where CLUSTER_NAME is the name of our cluster and MTLS_STRICT is the MTLS setting and can be substituted for MTLS_PERMISSIVE if desired. We can further inspect our newly created with the following commands:

gcloud container clusters list

To inspect our istio-specific services we can use the following command:

kubectl get services -n istio-system

yielding the following on my system,

Now that we have created our cluster, we can make it the default cluster and transfer credentials to the kuberntes CLI tool, kubectl as follows:

$ gcloud config set container/cluster my-istio-cluster
$ gcloud container clusters get-credentials my-istio-cluster --zone us-central1-a

Deploying Microservices to the Node Cluster

Now that we have created our istio-meshed Kubernetes cluster, we can deploy a number of microservices to it. We can run the following yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: vjan-deployment
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: vjan-server
    spec:
      containers:
        - name: vjan-container
          image: gcr.io/ml-production-257/vjan-deployment-server@sha256:*****...
          ports:
            - containerPort: 8501
---
apiVersion: v1
kind: Service
metadata:
  labels:
    run: vjan-service
  name: vjan-service
spec:
  ports:
    - port: 8501
      targetPort: 8501
  selector:
    app: vjan-server
  type: LoadBalancer

I can now create the deployments along with the services specified in the yaml file by using commands:

$ kubectl -f create vjan-deployment.yaml 
$ kubectl -f create jupyter-deployment.yaml

where jupyter-deployment.yaml is another of our Microservices similar in format. After which kubectl get services yields,

And kubectl get pods yields,

There are two of each pod because we specified in our yaml that we want two replicas.

Sidecar Injection

To inject the proxy sidecars into pods within specified namespaces one can use the command:

kubectl label namespace NAMESPACE istio-injection=enabled,

where NAMESPACE is the name of the namespace containing the pods we want to inject sidecars into. Let’s inject sidecars into the default namespace

kubectl label namespace default istio-injection=enabled, we then see namespace/default enabled echoed back to us. Next we must restart the pods in order to see our sidecar injection reflected.

We see that the restarted pod now has two running containers, one for our app and one for the newly injected sidecar. Restarting the other pods will yield same result as shown:

We can view the services

Now that we have completed sidecar injection, let us demonstrate the three uses of istio: Traffic management, Security, and Telemetry.

Network Management: Canary Deployments

When your machine learning or other application has been deployed into production and is in continual use, great care is needed when making any changes. The last thing you want is to make a change, break something, and clients are unable to get service. As such, it makes sense to release the new version to just a few clients or a few requests at a time, while observing the performance. Any issues can then be rolled back and addressed. If there are no issues and all is well, then you can gradually replace the older version with the newer one. This strategy is named after the canary bird which was used to detect presence of dangerous gases in coal mines. The bird being more fragile than the coal miners would die when levels of carbon-monoxide rose to dangerous levels, signaling the miners to escape while they were still able to do so. The measured doses of a new version of your deployment is analogous to the canary in that scenario.

Consider an application vjan-service with two versions, v1 and v2. Assume we would like 90% of traffic to that service to be routed to v1, and 10% routed to v2. We could accomplish that by specifying the distribution in a virtual service, which in turn speaks with a destination rule to know the identities of the endpoints. An example of such a virtual service YAML file is as follows:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: vjan-service
spec:
  host:
  - vjan-service
  http:
  - route:
    - destination:
        host: vjan-service
        subset: v1
        weight: 90
    - destination:
        host: vjan-service
        subset: v2
      weight: 10

Destination rules are used to instruct the Virtual Service where to find the destinations referenced in the virtual service yaml. For example, the script below:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata: 
  name: vjan-service
spec: 
  host: vjan-service
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

One additionally can use request headers to selectively route requests to desired destinations. This is sometimes called “Smart canary,” because it is able to more intelligently route traffic based on specificity of the request payload. Take a look at the following virtual service below for example. It routes some header to version v2 of the vjan-service if the header contains the word “chrome.” Otherwise, it routes the request to version v1.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: vjan-service
spec:
  hosts:
  - vjan-service
  http:
  - match:
    - headers:
        <my-header>:
          regex: .*chrome.*
    route:
    - destination:
        host: vjan-service
        subset: version-v2  - route:
    - destination:
        host: vjan-service
        subset: version-v1

Multiple ML Models: Routing Approaches

Real world machine learning applications often involve multiple machine learning models which interact in sequence or asynchronously depending on the state and dynamics of the environment. There are multiple deployment options for multi-model applications. Two notable ones are:

Tensorflow Serving of Multiple ML models with signatures prescribed in a models.config file as detailed in my previous tutorial. The model names are specified in the http request payload, and tensorflow serving handles the routing and inference accordingly. In this approach, the machine learning model aspect of the overall deployment is essentially monolithic, i.e. a single microservice in a network of microservices. One advantage of this is that all inference is cleanly located at some IP address and in a single pod (and its replicas). While one disadvantage is that you lose modularity, and now have to carry likely more weight than necessary. In other words you may only need one of several models for a given application, yet you must take up the resources to host all those models.
Smart Canarization: Each ML model can be hosted in a separate container as a separate microservice. At inference time, the model name is similarly specified as a header. But in this case, the header information is read by the virtual service which routes the request to the appropriate microservice. This approach has the advantage of maximal modularity. Each model is housed separately and models that are not needed do not take up any resources. The disadvantage is that there is more complexity because there are more microservices to deal with.

Telemetry

If one does not need service level telemetry and is content with node-level telemetry, that is available out-of-box on GKE as shown:

Otherwise, To see how to install prometheus on the Google Kubernetes Engine, go here. And to install grafana go here. These can then be forwarded to ports which enable service-specific telemetry. See one sample panel from my dashboard:

Of note, installation of dashboard has been simplified in istio 1.5 version which is not yet available on GKE.

Conclusion

Istio enables declarative configuration and management of network policy, reliability, security, telemetry, and testing options. It can be thought of as an abstraction layer running on top of and managing the orchestration infrastructure, Kubernetes. All traffic in, out, and between microservices in our cluster is intercepted by sidecar proxies. These therefore have all the data regarding our network and enable a wide panoply of management and monitoring functionality. Though we only scratched the surface in this tutorial, examples of some of the functionality easily implemented and managed by istio include: routing rules, canary-implementations, mutual TLS specification, ingress & egress rules, white-listing, failure injection for testing, telemetry, retries, circuit-breaking, A/B testing, rate-limits, to name a few.

References: Hyperlinked in text above.

AUTHOR BIO: Dr. Stephen G. Odaibo is CEO & Founder of RETINA-AI Health, Inc. He is a Physician, Retina Specialist, Mathematician, Computer Scientist, and Full Stack AI Engineer. In 2021 he was issued a U.S. Patent for inventing an AI system that automatically detects diseases from ophthalmic images. In 2017 he received UAB College of Arts & Sciences’ highest honor, the Distinguished Alumni Achievement Award. And in 2005 he won the Barrie Hurwitz Award for Excellence in Neurology at Duke Univ School of Medicine where he topped the class in Neurology and in Pediatrics. He is author of the books “Quantum Mechanics & The MRI Machine” and “The Form of Finite Groups: A Course on Finite Group Theory.” Dr. Odaibo Chaired the “Artificial Intelligence & Tech in Medicine Symposium” at the 2019 National Medical Association Meeting. Through RETINA-AI, he and his exceptionally talented team are building AI solutions to address the world’s most pressing healthcare problems. He resides in Houston Texas with his family.

Twitter: @sodaibo