Your next KServe ML service: gRPC vs JSON-REST

Published in

Bumble Tech

10 min readJul 19, 2023

When Machine Learning Engineers need to deploy a newly trained model, they tend to take the shortest path, which is often to leverage one of the popular ML serving platforms.

These platforms are designed with a focus on being flexible and user-friendly. This is why Python and Pythonic REST services were quickly adopted as the preferred options by the platform developers. However, while focusing on flexibility and development speed, it’s also important to keep in mind performance and maintainability requirements, which are particularly important for production-grade systems. It might be hard to meet these requirements without involving more complex technologies and protocols.

This article discusses tradeoffs between flexibility and performance, development speed and long term reliability in the context of ML serving protocols design. We’ll also illustrate the ideas through one of the serving options we use at Bumble Inc.

Text vs Binary protocols for serving ML models

Quick start for small jobs

The most popular protocols nowadays are JSON-based REST. These protocols enable easy service access for a great variety of environments. Someone needs as little as a basic web browser to start calling a service. JSON content can be edited as you go without any hassle. These features make JSON RESTful protocols very popular in public ML services as well as in quick prototypes. OpenAI has good examples of such an API. To run inference on a text sentence one just executes the following command:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "Why is REST so popular?"}],
     "temperature": 0.7
   }'

In contrast, binary protocols often require specialised SDKs for making calls and inspecting traffic. This additional, technical complexity can be a real blocker for users who shop around for a brand-new cool AI API. Easy-to-use APIs often win.

Serving numerical features

Things tend to go smoothly with text-based protocols, until one passes a big array of numerical data through the service API. This is typical for models consuming tabular data.

For example, there is an int32 feature with uniformly distributed values. The text representation of values of this feature will take up to 3x more space than binary int32. To illustrate this, let’s sample 1024 values for the feature and store them in a numpy array:

a = np.random.randint(np.iinfo(np.int32).max, dtype=np.int32, size=1024)

The memory footprint is around 4KB (4 bytes per one data point). To transfer this numpy array in textual form, integer values are serialised into strings. The string length is a random variable with a mean of 9.48 and a standard deviation of 0.60 (the statistics are calculated for the samples of int32 values converted to strings). The transfer size blows up to ~10KB:

(1024 utf-8 symbols * 9.48 avg length + 1023 separators) * 1 byte ~ 10.48 KB

Arrays serialised to a binary format are usually more compact and have fixed memory footprint

Of course, traffic compression can be enabled and the difference might not be that dramatic, but still compression consumes CPU and contributes to increased latency.

Serving binaries

In general, text-based protocols become less efficient as the volume of transferred data grows. This becomes particularly problematic in computer vision and speech recognition services which transfer a lot of binary data.

If we want to include an image in a JSON REST message, the binary has to be encoded into base64 format. This in turn increases traffic by 1.5–2x times compared to the unencoded size of the image. A workaround for RESTful protocols is to transfer an image as a binary mime-type (e.g. image/png). If there are any other input parameters, they go inside POST multipart/form-data. Alternatively, they are passed as a JSON object in a multipart HTTP request (application/json). This makes the communication protocol less intuitive and more cumbersome.

A binary protocol would be a natural alternative. For example, protobuf-based protocols simply let a JPEG image be serialised into a uint8 or byte array without losing the flexibility of passing other data inside the same message.

Validation as part of protocol

Input validation in MLOps and wider software engineering

There is a popular phrase in computer science: “Garbage in and garbage out”. This expresses the idea that incorrect or poor-quality input will produce faulty output. Software Engineers pay a lot of attention to checking the correctness of algorithm inputs. Invalid data causes crashes in classic software and renders applications unusable. In contrast, invalid data in ML systems often lead to a decrease in quality metrics without instantly visible faults, which makes the addition of input validation in earlier development stages even more important.

Validation in the context of MLOps ensures that the input distribution seen by a learning algorithm during training stays the same during inference in the production environment. There are multiple reasons why training/serving skew can unintentionally creep into the production pipeline.

Input validation stands against serving skew

In the MLOps community, the natural and widely discussed cause of skew is a simple but fundamental change in the background distributions of data and labels. There are different flavours of it in the form of data distribution/concept drift or covariate shift.

However, a distribution shift may happen also because of human errors/bugs in the code.

For example, there is an ML service that accepts a list of categorical features, which are encoded into integers inside the service. The mapping between string values and encoded ids is fixed and unseen string values are encoded as “unknown”. The service client can easily mess things up if the service silently accepts “unknown” categories without reporting any errors or triggering internal alerts.

The mistyped input value of a categorical variable is mapped to Unknown category

Ideally, a protocol should be fault-proof by design. Strong typing in protocols makes the production environment safer.

Looking into text and binary protocols from this angle, we see that protobuf-based communication is usually safer. The list of ML features can explicitly be defined in the protocol.

message UserFeatures {
    uint8 age = 1,
    Int64 last_visited_timestamp = 2,
    bool is_verified = 3
}

Clients will not be able to pass a feature which is not known to the service or set a field to a wrong type. To achieve the same in a JSON-based protocol, we’d need to maintain a JSON schema and run json object validation against this schema for every call. An example of a separate validation schema is shown below:

{
 "type": "object",
 "required": [
  "age",
  "height",
  "is_verified"
 ],
 "properties": {
  "age": {
   "$id": "#root/age", 
   "type": "integer"
  },
  "height": {
   "$id": "#root/height", 
   "type": "number"
  },
  "is_verified": {
   "$id": "#root/is_verified", 
   "type": "boolean"
  }
 }
}

There’s always a tradeoff between safety and flexibility, and moving boundaries between these two is not just limited by the selection of text or binary type-safe protocols.

An example of KServe GRPC tradeoff

The KServe team offers a third option. KServe GRPC 2.0 is designed in a way that it stays relatively generic, but still possesses performance advantages of GRPC.

KServe goes away from business entities and suggests looking into all input data as named tensors. This makes it possible to have a single protocol implementation and reuse it across different ML services:

message ModelInferRequest
{
  // An input tensor for an inference request.
  message InferInputTensor
  {
    // The tensor name.
    string name = 1;
    // The tensor data type.
    string datatype = 2;
    // The tensor shape.
    repeated int64 shape = 3;
    ...

    // The tensor contents using a data-type format. This field must
    // not be specified if "raw" tensor contents are being used for
    // the inference request.
    InferTensorContents contents = 5;
  }
  ...

  // The input tensors for the inference.
  repeated InferInputTensor inputs = 5;
}

In fact, this protocol doesn’t restrict sending arbitrarily named tensors to the service and it is left to the service developer to check that a tensor has the correct name and type.

In our experience, this doesn’t look like a disadvantage when the service has few inputs. However, when the number of input features grows, it becomes essential to check input tensor names and types systematically.

At Bumble Inc., KServe GRPC 2.0 is used in the context of serving models via the Triton server. The input type specifications are defined as a part of the server config. However, adding more custom validation rules can be tricky. For performance reasons, internal systems usually benefit from a client-side validation layer. This layer is usually integrated into API client libraries distributed by API maintainers.

Adoption of KServe GRPC protocol at Bumble Inc.

Since the Bumble Inc. team migrated many ML-related development and production processes to the KubeFlow platform, KServe has become our default platform for deployment of new ML services.

KServe supports REST and GRPC protocols. REST protocol is the default option in KServe for simplicity as discussed at the beginning of this article. GRPC is also supported in many serving runtimes, but it is always seen as a secondary option. KServe python custom model runtime added GRPC just recently in 0.10.0.

Our team doesn’t limit itself to either REST or GRPC, but rather tries to choose the best match for a task. Generally, when a service receives high traffic or relies on a lot of input features, we opt for GRPC. In all other cases, REST is a preferable option.

Enabling GRPC comes with a few non-obvious configuration changes which we’ll go into now and hopefully, make you feel more comfortable taking on that task.

Switching GRPC on

The tricky part about GRPC in Kubeflow is networking. To give you an idea of how requests are proxied and routed in a Kubernetes cluster, let’s look into an illustrative example. In the diagram shown below, a request virtually goes through multiple steps before it reaches KServe web server:

There are few interesting points to note about this configuration. Let’s look at them one by one.

Different ingress ports for GRPC and HTTP traffic

The ingress ports are open for application layer protocols. If port 80 is specified as HTTP, this lets the proxy check HTTP response codes and act accordingly. However, any GRPC traffic going through 80 is seen as malformed.

To receive GRPC requests, corresponding Gateway ports needs to be opened:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: knative-ingress-gateway
  namespace: knative-serving
spec:
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"
  - port:
      number: 9010
      name: grpc
      protocol: GRPC
    hosts:
    - "*"

All downstream envoy proxy instances operate on TCP level and don’t block GRPC traffic.

One port per KServe service

KServe InferenceService is a Kubernetes custom resource which on one hand makes deployments quick and easy and on the other hand, puts strict limits on how underlying resources are configured.

In particular, KServe inference service can route incoming traffic to only one port. This means that GRPC and REST can’t work together in one service instance on different ports. This limitation is unlikely to be fixed anytime soon for a variety of reasons related to networking and KServe technical design. If somebody needs multiple protocols for a service there are two workaround options:

multiplexing (having two protocols served through one port) or
maintaining separate services instances, one per protocol

Luckily in most cases, we don’t need two protocols in one service and having GRPC-only or REST-only service is enough.

Requests to ingress ports 80 and 9090 end up routed to 8080 port of the kserve-container. By default KServe exposes 8080 as HTTP/1. This doesn’t work for GRPC and the port must be explicitly declared as HTTP/2 (GRPC works over HTTP2) in the manifest:

containers:
  - image: <...>
    name: kserve-container
    ports:
      - name: h2c // http or http2
        protocol: TCP
        containerPort: 8080

KServe GRPC implementation

KServe provides us with plenty of serving runtimes. At Bumble Inc., we stick to two options which cover both quick prototypes and mature production use-cases.

When we make a quick proof of concept (PoC), which leverages Python-only components, we usually create KServe custom models. Eventually, the PoC can be replaced with a production-friendly system. To facilitate this transition, we design an implementation agnostic protocol, which helps to make transition between PoC and more mature implementation seamless.

Keeping this in mind, we might need to enable GRPC for the prototype service. This is achieved as easily as this:

server = kserve.ModelServer(enable_grpc=True, grpc_port=8080, http_port=8085)
server.start(models=[…])

HTTP endpoint can’t be disabled, but is moved to a random free port.

When the PoC passes initial probation, we design migration of the model to Triton Inference server, which also supports KServe GRPC. In most cases, we just need to change the endpoint address to point to the Triton server, no other protocol changes needed.

Sometimes, we don’t use the same protocol in all development stages. To make development iterations on our services easier, we ship them with client libraries providing business-level APIs. These APIs are abstracted from the underlying communication protocols and let us switch implementations transparently.

Wrapping up

Choosing a network protocol isn’t always the first thing people think about when they deploy a new ML model. Each serving framework has a default option, which works for most cases. However, it’s important to understand the limitations of the chosen option and think in advance if any future migration to a new protocol will be needed.

Questions? Feel free to contact me on LinkedIn: https://www.linkedin.com/in/andrei-potapkin-32873b2b/

Disclaimer: None of the external links are endorsed by Bumble Inc. and are used merely as examples and what the author found useful to share.