Building a Custom Classification API on Google Cloud: A Technical Deep Dive

Maximilian Weiss
Google Cloud - Community
13 min readJun 25, 2024

Introduction

In the ever-evolving landscape of machine learning and natural language processing, the ability to classify text, images, and videos into custom categories is a game-changer. Imagine a tool that can categorize keywords for ad campaigns, annotate documents, or even organize your media library based on your own unique taxonomy. That’s precisely what a custom classification API can achieve. In this blog post, we’ll give an overview on how to build such an API on Google Cloud that can handle large volumes with minimal latency, exploring its architecture and infrastructure components. You can find the code for the API on GitHub.

Background & Business case

Originally we were presented with the challenge of overhauling large Google Ads accounts to adhere to Google’s account structure best practices. One of the main takeaways from this comes down to grouping keywords with similar themes together in order for Google’s bidding models to train and optimize the best way possible. The reality is that many businesses have grown their Google Ads accounts organically for years neglecting proper theming of keywords. Moreover, many advertisers have millions of keywords spread across thousands of campaigns and accounts. From what we heard, advertisers don’t want a random grouping of keywords into arbitrary clusters but have their own taxonomy that aligns with business lines or internal processes. This quickly led us to the problem of classifying keywords (or any text content) based on a custom taxonomy as the foundation for account restructuring.

We were intrigued to look into building a custom classification API on Google Cloud for the following reasons:

  1. Scalability: To be applicable for small to large scale marketing use cases, potentially classifying millions of Google Ads keywords, a potential solution has to be able to process thousands of inputs in seconds.
  2. Pre-trained Models: The availability of pre-trained large language models (LLMs), removing the need to train our own classifier.
  3. Integration: A Cloud API can be easily integrated into various existing applications and workloads.

While we were able to solve this particular problem of classifying keywords, there are many more applications of a custom classification API which include but are not limited to the following:

  • Marketing and Advertising: Analyze customer feedback, and personalize content recommendations.
  • Content Management: Automatically tag articles, videos, and images for easier organization and search.
  • E-commerce: Classify products into relevant categories for improved browsing and filtering.
  • Social Media: Analyze social media posts to identify trends, sentiment, and brand mentions.

A custom classification API should take any text input and assign it to the most relevant categories within a user-defined taxonomy.

Taxonomy

In its simplest form a taxonomy is a list of categories. There are various taxonomy types such as flat, hierarchical, networked and others. For our purposes we assume either a flat or hierarchical taxonomy, i.e. the below example illustrates a hierarchical taxonomy with 3 categories/nodes and a maximum depth of 3.

  • Parent
  • Parent > Child
  • Parent > Child > Child

Having a well-defined taxonomy is crucial for managing and understanding vast amounts of data. It enables efficient search and retrieval, facilitates data analysis and reporting, and ensures consistency in how information is categorized across different departments and systems. A taxonomy can serve as a roadmap for navigating the complex landscape of business data, making it easier to extract valuable insights and make informed decisions.

Classification using embeddings

Embeddings

In essence, embeddings are dense vector representations of text or other types of data, capturing their semantic meaning in a numerical form. Imagine each word or phrase as a point in a high-dimensional space. Embeddings place these points in a way that words with similar meanings are closer together, while those with different meanings are farther apart. This spatial arrangement allows algorithms to reason about language in a way that aligns with human intuition. For instance, the embeddings for “king” and “queen” would be closer than those for “king” and “apple.” By comparing the embeddings of input content with the embeddings of each of the categories in the taxonomy, we can determine the best match.

While alternative approaches like fine-tuning an LLM on labeled data might seem appealing, they often require significant amounts of training data and computational resources. Embeddings, on the other hand, can be extracted from a pre-trained LLM and efficiently compared. Additionally, if the taxonomy changes over time we only have to update the category embeddings database without the need for any training.

Classification

Classification is achieved by calculating the distance between the embedding vector of an input text and the embeddings vectors of each of the categories in the taxonomy. The lower the distance the higher the confidence that the passed text and category definition are in the same semantic space. Note that there are various distance measures and this article gives a good overview. Cosine or dot product distance are the recommended approaches for classification.

Objectives

We expect our classification API to handle the following tasks:

  1. Taxonomy Import: Import a taxonomy from a spreadsheet.
  2. Category Embeddings: The API generates embedding vectors for each category in your taxonomy. These vectors capture the semantic meaning of the categories.
  3. Storage: The embeddings, along with their corresponding category names, are stored in a database for efficient retrieval.
  4. Input Embedding: When you provide text or image input, the API generates embeddings for this content as well.
  5. Similarity Search: The API performs a similarity search, comparing the input embedding with the stored taxonomy embeddings. The categories with the highest similarity scores are considered the best matches.

Assuming that the taxonomy definition does not change over time Steps 1 through 3 only have to happen once while Steps 4 and 5 run for every request. This means the API will need separate endpoints respectively.

Infrastructure & Justification

In this section we will outline the high level architecture of the API and explain the infrastructure choices we made.

Custom Classification API Architecture.

Cloud Run Service

The API endpoints are exposed via a Cloud Run Service. We chose this option because it allows running stateless HTTP containers without worrying about the provisioning of machines and its ability to automatically scale, which makes this a cost-efficient option. A viable alternative is Google Cloud AppEngine, see here for a comparison. The Classify Service classifies a given text according to the taxonomy written by the taxonomy job.

Cloud Run Job

The Taxonomy Job is responsible for importing the taxonomy (list of categories) from a Google Spreadsheet, attaching the text embeddings vectors for each category and writing to Google Cloud Storage. A Cloud Run Job is a great option because the task of getting the embeddings for each taxonomy node can take some time. A Cloud Function could be a viable alternative but has the downside of timing out after 30 minutes. If your taxonomy is sufficiently small in size, using a Cloud Function is perfectly fine.

Vector Search

By combining the storage capabilities of GCS with the specialized indexing and search capabilities (ScaNN) of Vector Search, this approach offers a high-performance and scalable solution for custom classification tasks.

There are alternative approaches to performing similarity searches with embedding vectors. One alternative we explored that could have fit our needs with regards to scale and latency was using a Postgres database. This approach leverages PostgreSQL, a popular relational database, along with the pgvector extension for storing and querying vector data. The HNSW (Hierarchical Navigable Small World) index is used to speed up similarity searches. Check out this colab that illustrates how this could be built into an application.

In terms of infrastructure we found two options on Google Cloud: Cloud SQL and AlloyDB. Cloud SQL is a fully-managed relational database service for MySQL, PostgreSQL, and SQL Server, offering ease of use and automated backups, but with potential limitations for very large-scale applications. AlloyDB is a more powerful and scalable PostgreSQL-compatible database service optimized for high performance and demanding workloads, but may require more hands-on management and potentially incur higher costs.

There are few caveats with this approach compared to using Vector Search:

  • Cost: Postgres databases come at a higher fixed cost compared to Vector Search, since a database requires a machine to be always on. Additionally if you want more vCPU you have to make that choice when setting up the instance, more CPU comes at higher cost. Vector Search endpoints can automatically scale up or down depending on the request volume.
  • Latency: In order to reduce latency database reads have to be run concurrently requiring many connections. Load testing this approach we frequently ran into connection issues and timeouts. We were able to improve this using multiple AlloyDB Read Pool Instances which can have up to 10k connections each to distribute the workload, however this came at a significant amount of additional cost because every read pool instance requires its own machine to be always running.

Technical details

API Design

The API itself is built using FastAPI and exposes 2 endpoints.

The /generate_taxonomy_embeddings endpoint is used to generate the category embeddings, store them on GCS, create the index and endpoint and deploy the index to the endpoint. Since this is a long running operation the endpoint itself takes the required information about the location of the taxonomy (spreadsheet) and kicks off a Cloud Run Job — for details on this job see Implementation of Vector Search section.

The /classify endpoint generates the embeddings for the input text and performs ANN searches against the category embeddings and returns the closest matches.

Generating embeddings

There are numerous embedding models available, each with its strengths and weaknesses (see here) for a comparison of popular models. The textembedding-gecko-multilingual model is a great choice when classifying content in multiple languages without needing to specify the language of the text. The model generates embedding vectors with a dimensionality of up to 768. LLM embedding dimensionality is a trade-off: lower dimensions mean faster computation and less storage, but potentially less nuanced representations, while higher dimensions capture richer semantic relationships at the cost of increased complexity and resources. See here an example on how to modify the dimensionality of the generated embedding.

Below an example of how to use VertexAI to retrieve embeddings for a specific category.

import vertexai
from vertexai.language_models import TextEmbeddingModel

_PROJECT = 'your-cloud-project-id'
_MODEL = 'textembedding-gecko-multilingual@001'

vertexai.init(project=_PROJECT, location='us-central1')
model = TextEmbeddingModel.from_pretrained(_MODEL)
embeddings_vectors = []
embeddings = model.get_embeddings([
'agriculture & farming>farm animals & livestock>sheep'
])
for embedding in embeddings:
vector = embedding.values
embeddings_vectors.append(vector)

print(embeddings_vectors)
>>[[0.02517577074468136, -0.029102273285388947, 0.05243232846260071,...]]

This is pretty straightforward and we can pass up to 250 pieces of text in a single request. It is worth noting that the more contextual information we provide when generating the embeddings the more accurate it will be.

Implementation of Vector Search

Check out this guide on how to implement vector search. For the purpose of our custom classification API this entails the following steps:

  1. JSON File Storage on Google Cloud Storage (GCS): Instead of a traditional database, the taxonomy data (category names and their corresponding embedding vectors) is stored in JSON files on GCS. These files need to adhere to a specific format with ‘id’ and ‘embedding’ columns.
  2. Approximate Nearest Neighbor (ANN) Index Creation: Once the taxonomy data is in GCS, an ANN index is created. This index is a data structure optimized for fast nearest neighbor searches in high-dimensional spaces, like those used for embeddings.
  3. MatchingEngineIndexEndpoint: A MatchingEngineIndexEndpoint is created to serve as the interface for interacting with the ANN index. This endpoint is where you send your queries to find the most similar categories.
  4. Index Deployment: The created ANN index is deployed to the MatchingEngineIndexEndpoint. This makes the index accessible for querying.
  5. Querying: The endpoint, using the deployed ANN index, quickly identifies the most similar category embeddings and returns them as the classification results.

Importing the taxonomy from a spreadsheet

We want to import the list of categories, generate embeddings for each of the categories and store them in JSON files on Google Cloud Storage:

First let’s establish a data model to represent the taxonomy and a category:

from typing import Any, Optional
import pandas as pd

@dataclasses.dataclass
class Category:
"""Value object of a taxonomy and associated embeddings."""
name: str
id: Optional[str] = None
embeddings: Optional[list[float]] = None

class Taxonomy:
"""Value object of a taxonomy and associated category embeddings."""

def __init__(self, categories: Optional[list[Category]] = None) -> None:
self.categories = categories if categories else []

def to_category_embedding_list(self) -> list[dict[str, Any]]:
"""Returns the taxonomy as a list of categories and embeddings."""
category_embedding_list = []
for category in self.categories:
category_embedding_list.append({
'id': category.name,
'embedding': category.embeddings,
})
return category_embedding_list

Now, we can import the taxonomy from a spreadsheet and build the Taxonomy object.

import google.auth
import gspread

credentials, _ = google.auth.default(scopes=[
'https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive',])

gs = gspread.authorize(credentials)

def get_taxonomy_from_spreadsheet(
spreadsheet_id: str,
worksheet_name: str,
col_index: int,
header: bool = False
) -> None:
"""Imports a Taxonomy from a Google Spreadsheet."""
spreadsheet = gs.open_by_key(spreadsheet_id)
worksheet = spreadsheet.worksheet(worksheet_name)
values = worksheet.col_values(col_index)
values = values[1:] if header else values
categories = []

for index, value in enumerate(values):
categories.append(Category(id=str(index), name=value))

taxonomy = taxonomy_lib.Taxonomy(categories=categories)

As of now the embeddings attribute for each Category object is still empty, so we need to add the embedding vectors.

category_names = [category.name for category in taxonomy.categories]
category_embeddings = get_embeddings_batch(
category_names
)
for category in taxonomy.categories:
category.embeddings = category_embeddings[category.name]

Note that the method get_embeddings_batch ensures we are sending a maximum of 250 categories with a single request. We implemented this method as follows:

_MAX_BATCH_SIZE = 200

def get_embeddings_batch(text_list: list[str]) -> dict[str, list[float]]:
"""Gets the embeddings for texts and maps their embeddings.

Args:
text_list: A list of texts.

Returns:
A dictionary with list elements as keys and their embedding vectors.
"""
batch_start = 0
embeddings_vectors = []
num_batches = math.ceil(len(text_list) / _MAX_BATCH_SIZE)
current_batch = 1
while batch_start < len(text_list):
next_batch_index = batch_start + batch_size
batch = text_list[batch_start:next_batch_index]
embeddings = model.get_embeddings(batch)
for embedding in embeddings:
vector = embedding.values
embeddings_vectors.append(vector)
batch_start = next_batch_index
current_batch += 1
text_embeddings = [
(key, value) for key, value in zip(text_list, embeddings_vectors)
]
return dict(text_embeddings)

Once we have the taxonomy object filled with categories and their embeddings we can write them JSON files on Google Cloud storage according to the requirements. We want the file sizes of the JSON files to be ~50MB each, however there are no specifications on what file size is appropriate for index generation.

First, we’ll add the to_category_embedding_list method to the Taxonomy class to return the categories and their embeddings in the required format.

class Taxonomy:
"""Value object of a taxonomy and associated category embeddings."""

def __init__(
self,
categories: Optional[list[Category]] = None,
) -> None:
self.categories = categories if categories else []

def to_category_embedding_list(self) -> list[dict[str, Any]]:
"""Returns the taxonomy as a list of categories and embeddings."""
category_embedding_list = []
for category in self.categories:
category_embedding_list.append({
'id': category.name,
'embedding': category.embeddings,
})
return category_embedding_list

We can now write the taxonomy embeddings to Google Cloud Storage:

import json
import math
import os
import google.auth
from google.cloud import storage
import numpy as np

_CATEGORIES_PER_FILE = 3500
_BUCKET_NAME = 'your-gcs-bucket'

credentials, project = google.auth.default()
storage_client = storage.Client(
credentials=credentials, project=project
)
bucket_name = _BUCKET_NAME
bucket = storage_client.bucket(_BUCKET_NAME)

def write_taxonomy_embeddings(taxonomy: Taxonomy) -> None:
"""Writes a taxonomy to a Google Cloud Storage bucket."""
category_embeddings = taxonomy.to_category_embedding_list()
file_prefix = 'embeddings'
num_chunks = (
math.ceil(len(category_embeddings) / _CATEGORIES_PER_FILE)
if len(category_embeddings) > _CATEGORIES_PER_FILE
else 1
)
chunks = np.array_split(category_embeddings, num_chunks)
file_name = 'Unassigned'
for index, chunk in enumerate(chunks):
data_jsonl = '\n'.join(
[json.dumps(record, separators=(',', ':')) for record in chunk]
)
file_name = f'{file_prefix}_{index}.json'
blob = self._bucket.blob(file_name)
blob.upload_from_string(
data=data_jsonl, content_type='application/octet-stream'
)

write_taxonomy_embeddings(taxonomy)

This is how it looks on Google Cloud Storage:

We can also visualize the category embeddings in a 3-dimensional space using UMAP dimensionality reduction.

The colors represent the different parent categories and the points represent individual categories. Logically child categories have the same color as their parents and siblings and are closer to each other.

Approximate Nearest Neighbor (ANN) Index Creation

With our files on Google Cloud Storage, we are ready to create the ANN index. Depending on the number of categories this can take some time. In our case for ~32k categories it took ~45 minutes.

from google.cloud import aiplatform
import google.auth

_PROJECT = 'your-cloud-project-id'
_BUCKET_NAME = 'your-gcs-bucket'

credentials, _ = google.auth.default()
aiplatform.init(
project=_PROJECT,
location='us-central1',
credentials=credentials,
)

embedding_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name='my-index',
contents_delta_uri=_BUCKET_NAME,
dimensions=768,
approximate_neighbors_count=10,
distance_measure_type='DOT_PRODUCT_DISTANCE',
shard_size='SHARD_SIZE_SMALL',
feature_norm_type='UNIT_L2_NORM',
)

There are various distance measure types available. Using DOT_PRODUCT_DISTANCE in combination with UNIT_L2_NORM is the recommended approach.

MatchingEngineIndexEndpoint & Index Deployment

To create an index endpoint and deploy our index to this endpoint:

import time

_DEPLOYED_INDEX_DISPLAY_NAME ='my-deployed-index-endpoint'

embedding_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
display_name='my-index-endpoint',
)

time_id = int(time.time_ns())
deployed_index_id = f'{_DEPLOYED_INDEX_DISPLAY_NAME}_{time_id}'

deployed_index_endpoint = embedding_index_endpoint.deploy_index(
index=embedding_index,
deployed_index_id=deployed_index_id,
display_name=_DEPLOYED_INDEX_DISPLAY_NAME,
min_replica_count=1,
max_replica_count=10,
machine_type='e2-standard-2',
)

Querying

The index endpoint now allows us to perform ANN searches with a given embedding vector of the same size.

sample_vector = [0.123, 0.321, ...] # Length of 768.

response = deployed_index_endpoint.match(
deployed_index_id=deployed_index_id,
queries=[sample_vector],
num_neighbors=2,
)

print(response)
>> [[MatchNeighbor(id='category1', distance=0.80520508289337158, feature_vector=None, crowding_tag=None, restricts=None, numeric_restricts=None), MatchNeighbor(id='category2', distance=0.72440195083618164, feature_vector=None, crowding_tag=None, restricts=None, numeric_restricts=None)]]

Going beyond text classification

If we want our API to classify not just text but also images and videos a text embedding model won’t work. The good news is that VertexAI also offers multimodal embedding models, that way the text, image, and video embedding vectors are in the same semantic space with the same dimensionality. Therefore, these vectors can be used interchangeably.

There are a few downsides:

  • If we need this API to handle a high volume with low latency however using the multimodal model a single text has to be passed as contextual_text in a single request, whereas with the text embedding model we can pass up to 250 texts in a single request.
  • Text classification really should be done using a text embedding model for better results and we expect the vast majority of classification tasks to be done using text.

Instead, we stick with the text embedding model and use a 2 step approach for the classification of images and video, by first obtaining a description of the passed image or text before performing similarity searches. We can use the Gemini model to generate text from images and videos, query our index endpoint with the respective embeddings. Below an example of how to get a description for an image using Gemini.

import vertexai

from vertexai.generative_models import GenerativeModel, Part

_PROJECT = 'your-cloud-project-id'
_MODEL = 'textembedding-gecko-multilingual@001'

vertexai.init(project=_PROJECT, location='us-central1')

model = GenerativeModel(model_name='gemini-1.5-flash-001')

image_file = Part.from_uri(
"gs://path/to/kitten.jpeg", "image/jpeg"
)

# Query the model
response = model.generate_content([image_file, 'what is this image?'])
print(response.text)
>> 'A kitten with blue eyes looking up.'

Conclusion

By leveraging the capabilities of Google Cloud Platform, including Cloud Run and Vector Search, we’ve demonstrated a scalable and efficient solution for building such an API. This approach not only simplifies the classification process but also opens up a world of possibilities for various industries, from marketing and advertising to content management and e-commerce.

Building a custom classification API is a powerful way to harness the capabilities of large language models for your specific needs. Whether you’re a marketer, content creator, or data analyst, a custom classification API can be a valuable asset in your toolkit.

--

--