Stories by Greg Sommerville on Medium

The Developer’s Guide to OpenCode on Google Cloud

Greg Sommerville — Mon, 18 May 2026 14:43:06 GMT

Combine model flexibility with enterprise-grade security and performance

What is OpenCode?

OpenCode is an open-source coding platform that provides developers with a powerful AI-driven development environment through your choice of interfaces: a Terminal User Interface (TUI), a desktop application, an IDE plug-in, or a web page interface. It’s similar to other agentic coding platforms like Google’s Gemini CLI, Anthropic’s Claude Code, or OpenAI’s Codex, with the main difference being that OpenCode is model-agnostic.

Why would I use OpenCode with Google Cloud Platform (GCP)?

The main benefits of OpenCode are:

It’s open source (and therefore the source code is available for examination)
It allows easy switching between multiple models (either cloud-based or local)
It maintains data privacy, since data only flows between the OpenCode UI and the model it’s using.

The main benefits of using OpenCode with GCP are:

You can choose from all of the models available in the Model Garden, from the latest version of Gemini to open source models like Gemma, Qwen, or Kimi.
You can pick which region your model resides in, which allows you to control data locality.
GCP guarantees customer data, source code, and prompts are never used to train foundation models. This means that using models from the GCP Model Garden keeps your data private.
You can route your data through private networking via Private Service Connect, which keeps all of your data off of the public internet. If your business has data residency requirements, this can really help.
You can fine-tune your own model and host it on GCP, and use that model for coding assistance.
You can track usage and costs via Resource Labels, and billing is rolled up under your GCP billing, rather than being a separate cost.

To me, the strongest argument in OpenCode’s favor is the fact that you can easily switch between different models, often just by choosing the model from a dropdown control. Different models have different strengths and weaknesses, and being able to choose to use Gemini 3.1 Pro or Claude Opus 4.7 or any other model to suit my needs is a major strength.

While OpenCode supports local models, real-world coding requires massive context windows and KV caching that quickly overwhelm standard consumer GPUs — even modern 16GB cards like an RTX 5060 Ti. Using cloud-hosted models gives you access to enterprise-grade hardware without the severe performance degradation of local offloading. The bottom line is that unless you have a very powerful machine, a local model probably isn’t going to be good enough for real coding. Because of that, I think a cloud-based model is the way to go.

How do I install and set up OpenCode?

You can install OpenCode by downloading an installer from https://github.com/anomalyco/opencode. Note that there are several installation options, including using npm.

If you’re a Windows user like me, you may want to check the OpenCode documentation about how to set up the server component of OpenCode to run under WSL, which provides faster file access, and a unified Linux toolset. That said, I will say that I use the desktop version of OpenCode on Windows without WSL, and I have yet to see any problems. However, if you plan to let the agent run complex shell scripts or run local testing suites, using a WSL environment ensures the agent doesn’t trip over Windows-specific CLI syntax.

Activating Models

Once you have the software installed, the next step is to enable the use of different models within Model Garden. Here’s how to do that.

Log into the GCP console, choose or create your project, and navigate to the “APIs & Services” page, and click on the button labelled “+ Enable APIs and services”. Enable the “Agent Platform API”. This allows you to use the models in the model garden. The next step is to activate the models you want to use.

In the search bar at the top of the page, type in “garden”. That will give you a link that will take you to the Model Garden main page. From there, activate the models you wish to use with OpenCode.

Some models like Gemini and Claude are usage-driven, meaning that you don’t have to manually spin up a virtual machine to host them, and instead you pay only for the input and output tokens. Other models require a dedicated endpoint, which will incur costs related to having that server up and running, regardless of how much you use it.

Cost Warning for Dedicated Endpoints: Unlike Gemini’s pay-per-token API, hosting an open-source model on a dedicated endpoint means you are paying for the virtual machine (often equipped with expensive NVIDIA L4 or A100 GPUs) 24/7. Pro-tip: If you are using a dedicated endpoint for personal testing, write a quick gcloud script to spin down/pause the endpoint when your workday ends, or set up GCP budget alerts to prevent weekend cost spikes.

Configuring OpenCode

The next step is to tell OpenCode about which models are available for use, and which GCP project they are activated under. The first thing to do is to authenticate with GCP, which you accomplish using the following command:

gcloud auth application-default login

This command will open a web page to allow you to authenticate with GCP. Behind the scenes, OpenCode utilizes your local Application Default Credentials (ADC) to securely authenticate direct API requests to Vertex AI, meaning your GCP IAM permissions dictate exactly which Model Garden endpoints OpenCode is allowed to call.

Finally, use the following variable to specify which project within GCP to use (use “export” on WSL or Linux, use “set” in Windows):

export GOOGLE_CLOUD_PROJECT=

Alternate Approach: Note that if you don’t want to use ADC (application default credentials), you can set the following environment variable to point to the file that defines a service account to use:

export GOOGLE_APPLICATION_CREDENTIALS=

I recommend checking the official docs at https://opencode.ai/docs/providers/#google-vertex-ai for details about connecting to GCP-hosted models, as things change over time.

Private Service Connect

If your organization requires that traffic to Vertex AI stay off the public internet — for data residency, compliance, or general security posture — you can route OpenCode’s API calls through a Private Service Connect (PSC) endpoint instead of the default public Google API endpoints.

Setting up PSC is a non-trivial networking task that requires proper VPC configuration. At a high level, it involves:

Creating a Global PSC Endpoint: Although Vertex AI uses regional hostnames (e.g., us-central1-aiplatform.googleapis.com), standard API access requires a global PSC endpoint. You will need to reserve a global internal IP address in your VPC and create a forwarding rule that points to the global Google APIs bundle (either all-apis or vpc-sc).
Configuring Private DNS: To make the routing seamless for OpenCode without needing to override application base URLs, create a private Cloud DNS zone for googleapis.com. Within this zone, create an A record (e.g., *.googleapis.com or specifically for the Vertex AI hostname) that resolves to the internal IP address of your new PSC endpoint.
Ensuring Connectivity: Ensure whatever machine runs OpenCode can reach that endpoint. This is easiest if OpenCode runs on a Cloud Workstation or GCE VM inside the VPC. Reaching it from a local machine additionally requires Cloud VPN or Cloud Interconnect, plus a Cloud DNS inbound forwarding policy so your local host can correctly resolve the private googleapis.com hostname.

The relevant Google Cloud documentation to follow is:

Configure Private Service Connect to access Google APIs: This guide walks through the exact step-by-step setup for a global endpoint and DNS.

Because you are using Cloud DNS to seamlessly route googleapis.com traffic to your internal VPC endpoint, you do not need to manually configure custom endpoints or alter the opencode.json configuration file. OpenCode will route its API calls securely and internally by default.

Using OpenCode

Once your project is set up with the required enabled APIs and models, and (optionally) you’ve set up private networking, you’re ready to start using the tool. If you installed the CLI version, simply type “opencode” to run it. The desktop version is launched just like any other application, so once the application shows on the screen, you are ready to start typing in queries. Like many other agentic coding systems, “/init” will examine your current code base and produce a Markdown file with an overview of the code and important details.

Conclusion

There are many different tools available for AI-assisted coding these days, but OpenCode stands out for its flexibility. With a variety of interfaces and simple model-switching, it’s a tool well worth exploring.

The Developer’s Guide to OpenCode on Google Cloud was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building Offline RAG on iOS: How to Run Gemma 3N Locally

Greg Sommerville — Wed, 03 Dec 2025 19:05:47 GMT

Image created with Gemini

Running a Large Language Model (LLM) like Gemma 3N on an iPhone requires a fundamental shift in mindset. As a cloud developer, I’m used to infinite RAM and simple API calls to models like Gemini. But for this project, those luxuries were gone.

The goal was strict: build a mobile app with a bundled LLM, an embedding model, and a vector database — all operating fast enough to be usable, and entirely without an internet connection. Here is how we squeeze that much power into a pocket-sized device.

Just Your Typical RAG Chatbot, but Not

The goal was to create an iPhone RAG (retrieval augmented generation) chatbot app capable of answering incredibly complex, technical questions about the maintenance of industrial equipment. The source of truth for those questions was a single 350 page PDF reference document that was jammed with complicated tables, images, and detailed text.

To make this work, I needed a hybrid approach: combining keyword matching with semantic (vector) search. But here is the catch: Semantic search requires an embedding model running locally. Suddenly, our limited memory budget isn’t just for the LLM; it has to be shared with the embedding model, the vector database, and the app logic itself.

So given the LLM, a hybrid RAG database, and a separate embedding model, the big question is this: how do you put all of that into a single iPhone application, given that most LLMs are very large (not just in terms of number of parameters, but also pure size as measured in gigabytes), and most phones don’t have that much memory (at least compared to desktops or cloud-based machines)?

The first step is to choose a model.

Choosing a Model

Since my target hardware was an iPhone 16, I had a hard ceiling of 8GB of RAM. But the OS and the app code take probably about 3 GB of that, leaving us with a very tight budget for the models and the database. Finding an effective LLM that can run in 3 or 4 GB of memory (a reasonable amount) can be a challenge.

To narrow down the candidates, I didn’t just look at benchmarks. I used Ollama on my desktop to host multiple quantized small-scale models, feeding them specific questions related to the industry this app is for. I was able to create a set of sample questions and compare the answers from multiple LLMs using this handy tool. This gave me a sense of which models had decent built-in knowledge that would be helpful for this use case, and which ones I should skip.

This testing process highlighted that while several models were fast, Gemma 3N offered the best reasoning capabilities for our specific technical domain. Although I saw some good results from models like Gemma 3 (not 3N) and Qwen, ultimately I got the best answers from Gemma 3N. That’s good, because the 3N models are designed to be hosted on edge devices just like the iPhone.

At the highest level, there are two versions of Gemma 3N called E2B and E4B. “E2B” stands for “effectively 2 billion parameters”, and “E4B” means “effectively 4 billion parameters”. Normally you want to use the largest model that makes sense for your use case, because typically a 4B model gives better results than a 2B model, but in this case we need to think about memory usage.

By the way, the “Effective” prefix highlights that the model can run with a reduced memory and compute footprint compared to its total number of parameters. For example, E2B actually contains over 5 billion parameters, but through some innovative optimization methods like Per-Layer Embedding (PLE) caching, conditional parameter loading, and the use of MatFormer architecture, the number of parameters loaded is actually much closer to only 2 billion.

Although both Gemma E2B and Gemma E4B work on the iPhone, the quality of answers from E2B wasn’t significantly less than those from E4B in my case, and since E2B was smaller and faster, that tipped the scales in terms of choosing the E2B variant.

How to Use a LLM on an iPhone

When writing an iPhone app in Swift, there are two obvious options for hosting an LLM: Google’s MediaPipe, and Apple’s MLX Swift.

MediaPipe is cross-platform (iOS, Android, and web) and supports TensorFlow Lite (TFLite) models, recently rebranded to LiteRT, where “RT” stands for Runtime. You can find LiteRT models on Hugging Face.

MLX on the other hand was written by Apple and only runs on Apple hardware. It supports models stored in Safetensors files, which can also be found on Hugging Face.

Based on my testing, MLX was much faster for certain operations, and the number of model variants available on Hugging Face for MLX was quite a bit larger than the number for MediaPipe. For these reasons, and because I had no need for cross-platform functionality, I went with MLX Swift.

Important note about Quantizing: When you browse the models available on Hugging Face, you’ll see many variants. Even narrowing down to Gemma 3N E2B, you’ll see several different versions of those. There are really two main things I look for in this case: instruction tuning, and the number of bits used for quantizing. (“Quantizing” is the process of taking each of the parameters in a model and shrinking them down in order to save space.)

Instruction tuning is often indicated by an “it” string in the model name. That means it was trained to follow instructions, which is a necessity when dealing with something like a RAG chatbot.

Think of Quantization as compressing a high-resolution image. We take the massive, high-precision parameters of the model (usually 16-bit floating point numbers) and shrink them down to 4-bit integers. While this sounds like a drastic loss of data, it allows us to fit a massive brain into a tiny memory budget with surprisingly little loss in intelligence.

Bottom line — look for an instruction-tuned model that is quantized to 4 bits. The model I used is called gemma-3n-E2B-it-lm-4bit.

Including the Model in your App

When you download a model from Hugging Face, it comes as a set of files. Although the majority of the model is saved in .safetensor files, other files are included to configure the model and support the associated tokenizer.

The best way to include that in your app is to create a Folder reference in Xcode that points to the folder with the model files. This way you can update the folder as you need and don’t have to worry about adding or modifying individual files.

A Note on the App Store: The Apple App Store has limits as to how big an app can be, both in terms of initial loading, and total size. You won’t be able to create a very large app like this one and offer it via the App Store. Instead, this approach is good only for situations where you are deploying to corporate devices by using a Mobile Device Management (MDM) solution or something like that. Alternatively, you could leave the model files outside of your app and download them on the first run of the app.

Loading and Calling the Model

From a code perspective, I created a single service called LocalLLMService.swift that handles loading the model and also sending back responses, either streamed or all-at-once.

Let’s start with code for loading the model from our embedded resources. For brevity, I’ll only include the most important parts. You can find the entire file in this GitHub Gist.

First, let’s include the packages we need.

import Foundation
import MLX
import MLXNN
import MLXLLM
import MLXLMCommon
import Tokenizers

Then we load the model in the loadModel() function:

// Get full path to model directory
guard let bundlePath = Bundle.main.resourcePath else {
      throw ModelError.modelNotFound("Unable to access app bundle")
}


let fullModelPath = (bundlePath as NSString).appendingPathComponent(modelPath)


// Verify model directory exists
let fileManager = FileManager.default
var isDirectory: ObjCBool = false
guard fileManager.fileExists(atPath: fullModelPath, isDirectory: &isDirectory),
        isDirectory.boolValue else {
      throw ModelError.modelNotFound(fullModelPath)
}

// Verify required model files exist
let modelFile = (fullModelPath as NSString).appendingPathComponent("model.safetensors")
let tokenizerFile = (fullModelPath as NSString).appendingPathComponent("tokenizer.json")
let configFile = (fullModelPath as NSString).appendingPathComponent("config.json")

guard fileManager.fileExists(atPath: modelFile) else {
      throw ModelError.modelNotFound("model.safetensors not found")
}
guard fileManager.fileExists(atPath: tokenizerFile) else {
      throw ModelError.tokenizerLoadingFailed("tokenizer.json not found")
}
guard fileManager.fileExists(atPath: configFile) else {
      throw ModelError.modelLoadingFailed("config.json not found")
}

// Load MLX model container with Metal acceleration
print("  Loading MLX model container...")

// Create model configuration with local directory URL
let modelURL = URL(fileURLWithPath: fullModelPath)
let modelConfig = ModelConfiguration(
      directory: modelURL,
      defaultPrompt: "You are a helpful assistant."
)

// Load the model container using LLMModelFactory
self.modelContainer = try await LLMModelFactory.shared.loadContainer(
      configuration: modelConfig
) { progress in
      print("  Loading progress: \(Int(progress.fractionCompleted * 100))%")
  }

Once that’s done, there’s a crucial step that helps with memory issues:

// Configure MLX GPU buffer cache limit to prevent memory accumulation
// MLX caches freed GPU memory for reuse, but this can cause OOM on repeated inferences
// Set limit to 50 MB to allow some caching while preventing excessive accumulation
let cacheLimit = 50 * 1024 * 1024  // 50 MB
MLX.GPU.set(cacheLimit: cacheLimit)

Memory use is the major issue when dealing with LLMs on mobile hardware. The MLX framework does have a tendency to hold on to memory, which can accumulate and cause your app to crash after just a couple of queries. The above code explicitly controls how much memory is allocated, which fixes this problem.

One other note about memory: another really important step is to add an entitlement (via Xcode) to indicate that your app needs more memory. This results in the com.apple.developer.kernel.increased-memory-limit entitlement to be added to your app.

Now that the model is loaded, let’s look at how it’s called. The code supports both streaming and non-streaming responses. Let’s look at the streaming responses:

guard let container = modelContainer else {
    print("✗ Model container not initialized")
    return
}

do {
print("  Generating streaming response with MLX...")
print("  Prompt: \"\(prompt.prefix(50))\(prompt.count > 50 ? "..." : "")\"")


// Set up generation parameters
let params = GenerateParameters(
    temperature: temperature,
    topP: topP,
    repetitionPenalty: repetitionPenalty
)

// Capture values to avoid retaining self in closure
let maxTokensLimit = self.maxTokens

// Generate with streaming callback
let result = try await container.perform { context in
    // Prepare input with user messages using context processor
    let fullPrompt = prompt
    let input = try await context.processor.prepare(input: .init(prompt: fullPrompt))

    var localTokenCount = 0
    return try MLXLMCommon.generate(
        input: input,
        parameters: params,
        context: context
    ) { tokens in
        // tokens array is cumulative (all tokens so far), not incremental
        localTokenCount = tokens.count

        // Decode new tokens to text (synchronous decode)
        let newText = context.tokenizer.decode(tokens: tokens)

        // Stop if we've hit the EOS token ID (model is done) - check first for natural completion
        if let eosTokenId = context.tokenizer.eosTokenId,
           tokens.contains(eosTokenId) {
            return .stop
        }

        // Stop if we see end-of-turn markers in the decoded text
        if newText.contains("") || newText.contains("") {
            return .stop
        }

        // Stop if we've hit max tokens (safety limit)
        if localTokenCount >= maxTokensLimit {
            print("⚠️ Max token limit reached (\(maxTokensLimit)) - appending truncation notice")
            // Send truncation notice to user
            Task { @MainActor in
                onPartialResponse("\n\n[Response truncated - maximum length reached]")
            }
            return .stop
        }

        // Clean up EOS markers before sending to callback
        var cleanedText = newText
        cleanedText = cleanedText.replacingOccurrences(of: "", with: "")
        cleanedText = cleanedText.replacingOccurrences(of: "", with: "")

        // Only send non-empty cleaned text to callback
        if !cleanedText.isEmpty {
            // Call the partial response callback on main thread
            Task { @MainActor in
                onPartialResponse(cleanedText)
            }
        }
        return .more
    }
}

let generationTime = Date().timeIntervalSince(startTime)
print("  Generation time: \(String(format: "%.3f", generationTime))s")
print("  Tokens generated: \(result.tokens.count)")

// Force MLX to evaluate computation graph and release GPU buffers
// This triggers the cache limit policy, allowing old buffers to be freed
MLX.eval()

There are a couple of key points to take into consideration. First, at the top of the function we set the LLM parameters like temperature, top-P, etc. Second, we can specify a maximum number of output tokens, and the code stops calling the LLM once that limit is reached.

Finally (and perhaps most importantly), this implementation differs from standard streaming. The callback returns the total accumulated response so far, rather than just the new tokens. Your UI code should replace the current text view entirely on every update, rather than appending to it. This differs from the normal approach of the caller keeping the current answer and appending the new tokens.

Finally, note the last step (MLX.eval()), which is used to force MLX to release some internal buffers, which is another part of the memory saving approach.

Conclusion: The Cloud in Your Pocket

A year ago, building a RAG system capable of answering complex maintenance questions required a cloud GPU cluster and an API key. Today, we have that same capability running offline on a phone.

By carefully selecting an capable, small model like Gemma 3N, utilizing the unified MLX ecosystem, and respecting the strict memory limits of iOS, we didn’t just build a chatbot — we built an entire RAG solution. We proved that the edge is no longer just for “toy” models. It is ready for real work.

The constraints of mobile development — battery, thermal, and RAM — forces us to be better engineers. And honestly? Watching those tokens stream onto an iPhone screen feels a lot more satisfying than getting a JSON response from a server.

Building Offline RAG on iOS: How to Run Gemma 3N Locally was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Mixture-of-Experts LLMs Work

Greg Sommerville — Tue, 26 Aug 2025 21:39:49 GMT

An innovative approach to make models more efficient

Created using Imagen

The introduction of Generative AI models has fundamentally changed the landscape of what can be done with text, images, sound, and videos, offering dazzling capabilities like text summarization, customer feedback analysis, automated data entry, automated document reviews, code generation, and many, many more.

Large Language Models (LLMs) like Google’s Gemini continue to push the boundaries of what’s possible, but as these models grow in size and complexity, they bring with them significant challenges related to computational cost, training time, and efficient deployment.

This is where Mixture of Expert (MoE) LLM models provide a key architectural innovation. This article will demystify MoE architectures, explaining in plain English how they allow for the creation of incredibly powerful yet surprisingly efficient language models that are helping to shape the future of AI.

The Evolution of Large Language Models

In less than the last ten years, we’ve witnessed an explosion of AI capabilities. Back in 2018, models like BERT provided functionality that was truly impressive at the time, such as classifying text, extracting named entities (people, organizations, locations, and dates) from text, and even some simple question answering. BERT didn’t produce text itself, but the numeric answers it produced were very helpful for many types of problems.

Fast forward to 2025, and models like Gemini can not only produce long blocks of text or code, but also appear to demonstrate advanced reasoning, often working through complex problems step-by-step. How did we come so far so quickly?

One of the things driving that advancement is a massive increase in the size of these models. But what does “size” mean when talking about LLMs? Basically, it means the number of different numeric values in the model. When you ask Gemini a question, your text gets converted to numbers, and those numbers flow through what is essentially a giant mathematical formula, which uses numbers (called parameters) to change your input data over a series of steps, culminating in the final output numbers being converted back into text.

Think of the model as an intricate machine with countless adjustable knobs and levers — these are its ‘parameters.’ Each parameter is a numeric value that helps shape how the input data is transformed at every step, eventually leading to the final output.

The more parameters you have in your model, generally the more powerful the model becomes, since it can then incorporate more subtle patterns and relationships into its knowledge. Just to provide a little context for comparison, BERT had hundreds of millions of parameters, but modern LLMs often have billions or even trillions of parameters.

Something else happens when models increase in size. Besides more nuanced understanding of patterns, a surprising number of behaviors show up unexpectedly when a model’s size increases past a certain threshold. These are called emergent behaviors, and they include things like learning from examples, step-by-step reasoning, and even the ability to solve problems the models weren’t specifically designed for. No one trained the models on these behaviors — instead, these behaviors simply emerged once model sizes increased past a certain threshold.

Sounds great, doesn’t it? And it seems to imply that bigger models are always better than smaller models. So why wouldn’t we use the largest model possible every time, given that we want the best quality results?

Cost is the answer. All those mathematical calculations have to be performed on computer hardware, and the more calculations there are, the more time and computing power it takes, which results in higher electricity usage and therefore more cost.

Bigger models are also generally slower than small models, and they often require specialized computing hardware that can handle very large number of calculations done in parallel. This is true both for using a finished model (called inference), as well as creating a new model (called training).

Mixture of Experts (MoE) models attempt to solve this problem by using an innovative new architecture. Instead of using every single parameter in a model, they use only subsets of the model as needed, based on the user query. This means the model can incorporate knowledge of many different topics, and that knowledge is stored in such a way that only the most relevant sections of the models are activated as needed.

How LLMs Work

Traditional LLMs and MoE models have a lot in common. If you think of a model as a series of steps, there are a few steps that are the same for both types of models. Let’s go through each of those early steps.

Tokenization — Turning Text into Numbers

The first thing a LLM does is to convert the input text into numbers via a process called tokenization. Each model has a vocabulary, and that vocabulary consists of words with corresponding IDs. As an example, let’s say the word “dog” has a token ID of 6420. Each time the model finds the word “dog”, it will represent that by the number 6420.

Now to be really precise, I will say that some tokens are fragments of words (like “ing”) or even punctuation marks. Trust me when I say that there are benefits to tokenizing parts of words rather than whole words, but the details are not relevant for this discussion. Suffice it to say that tokens are either words, fragments of words, or things like punctuation marks.

Here’s an illustration of how the sentence “I am walking the dog” is converted into tokens:

The text “I am walking the dog” with boxes around each token, and arrows pointing to corresponding token ID numbers

Embeddings — Numbers that contain meaning

Converting text words into numeric tokens is a great first step, but the number 6420 (representing “dog”) doesn’t have much meaning by itself. The next step is to use an embedding for the token.

An embedding is simply a list of numbers (known as a vector). Embeddings can be quite large, having hundreds or even thousands of numbers per vector. Each model has an embedding for each token, as is shown here:

A set of rows, each row with a token ID as an index, with each row containing multiple floating point numbers

The idea of the embedding is that each of the numbers in the vector somehow represents some quality of the token itself. By having hundreds or thousands of numbers with different values (or magnitudes) per vector, you get a combination of qualities that can faithfully represent any concept.

In other words, it’s a way to associate meaning with each token, and it’s helpful in terms of comparing different tokens since the embedding vector for “cat” has a lot of commonality with the embedding vector for “dog”. Why? Well, they’re both animals, both common pets, both mammals, both quadrupeds, etc. Many of the similarities can be expressed by similar elements in the embeddings for both words.

An embedding vector showing possible interpretations for a few of its elements

Blocks — Modular pieces of the model

Once the initial tokenization happens and the initial set of vectors has been looked up for each token, the rest of the model processing happens. That processing is divided into transformer blocks, and a model can have hundreds of them, each feeding their output into the input of the next block until all the blocks have been used. At that point, the final output vector is transformed into a probability vector, and a single token is selected from it.

Simplified architecture diagram for an LLM, showing initial layers and a set of transformer blocks

We’ll talk about what happens in each block shortly, but the important thing to remember is that an LLM is constructed of multiple blocks, with each block taking in input vectors and outputting modified vectors. Those modified vectors are passed to the next block, and the process continues until the data reaches the output layer of the final block.

Once we reach the final layer of the final block of the model, an output vector is created, and the values in that vector are used to determine the single output token. This is done by converting the final output vector into a new vector that contains probabilities for each possible output token in the model’s vocabulary. In this case, all of the values in the probability vector add up to 1, and each value represents the probability that the corresponding token will be selected.

Using attributes like top-P, top-K, and temperature, the model then picks a token based on the contents of the final probability vector. Those attributes are different approaches used to select a final token given a selection of choices. “Top P” means choose from the highest percentage choices, “top K” means choose from a fixed-size set of the most likely options, and temperature controls how much to weigh towards unlikely choices versus higher-probability choices.

Once selected, the final resulting token is then appended to the input string, and the entire LLM process starts over again, with the new string passed in to the model, then tokenized, then converted to embeddings, then passing through the attention blocks, etc.

Attention — How do tokens influence each other?

So that’s the overall architecture. Now let’s talk about what happens within each transformer block.

The input for a block is a set of vectors, with one vector for each token in the input query. The first thing we do with those vectors is to modify them using a mechanism called self-attention, which is a process that modifies each token’s vector with information from the other vectors in the input string.

In other words, once we have initial embedding vectors for each token, we need a way for the model to understand the context of the entire sentence. A human can instantly tell that in the sentence, “The dog chased the cat, and it ran away,” the word “it” refers to “the cat.” An LLM has to learn this kind of relationship. That’s where the attention mechanism comes in.

Essentially, the model looks at every single token’s embedding vector, and for each token it asks the question: “How important are all the other tokens in the sentence to me right now?” The attention mechanism calculates a weight for every other token, a process that is essentially the model’s way of determining how relevant or related each token is to the others.

The attention mechanism allows the model to create a new, refined embedding vector for each token that is no longer just its standalone meaning but is now contextually aware. This new vector for “it” will now have a strong connection to the vector for “cat,” and a weaker one to the vector for “dog” for example.

In short, the attention mechanism is how the model builds a rich, interconnected understanding of the entire text, allowing it to make sense of things like pronoun references, word relationships, and the overall meaning of a sentence or paragraph.

Here’s an illustration of how different tokens influence each other. In this example, we’re creating a new vector based on the vector that originally came from the “dog” token, combined with every other preceding vector using a weighted sum. Each preceding token that is combined with the “dog” token has its own weight, so in the end some tokens have much more of an influence on the “dog” vector than others do.

Showing how different tokens have different weights, as they influence another token in the text

To be precise, this self-attention mechanism isn’t done just one for each token. Instead, the attention mechanism typically uses multiple sets of weights when modifying embeddings, and in the end these multiple modified vectors are mathematically combined. This process is called multihead attention, and it allows the model to combine the different token embedding vectors (with their different weights) in different ways.

The idea is that by combining multiple different takes of self-attention, a truer, deeper understanding of the relationships between the tokens will emerge.

In the example above, the tokens “walk” and “ing” influence the token “dog”, but the individual weights will vary from one attention head to another. In this simple example there probably aren’t a lot of really distinct sets of weights, but as the text gets more complex, having different attention heads provides many different perspectives on the text.

Finally, within the attention mechanism, there is also information added to each embedding that indicates the order of the vector within the overall text input. This is called positional encoding, and it’s used so the model understands the difference between “The dog is on the rug” and “The rug is on the dog”.

To sum up, the attention mechanism allows tokens to influence each other, so the model ends up with a much deeper understanding of the overall meaning of the text.

What is an “Expert” in an LLM?

At this point, we understand some basic architecture components of an LLM. Now let’s talk about what a Mixture-of-Experts model is, starting with an analogy.

Suppose we have a group of friends, and we want to route questions to different friends depending on the topic. However, at the very start of this process none of the friends really knows anything about any topic. So that means we essentially pick a random friend for a particular question. That friend doesn’t initially know anything about the topic, so they get the answer wrong. Since they did, you correct them so they learn more about the topic, and you make a mental note to route more questions about the same topic to them, since they are learning more as time goes by.

To extend the analogy, let’s say that instead of routing questions to one friend, you initially route a particular question to three friends. At the start none of them knows the answer, but again, you correct them when they are wrong and also remember to route more questions to them about this topic as they learn.

And since you’re asking three friends the same question, the odds are good that maybe one or two of the experts knew slightly more than the other experts, so you are also adjusting your thinking about which friends to trust the most for a particular question. In the end, you combine all the answers from all the friends you asked, but weigh them based on how much trust you have for each expert for this topic.

This is essentially the process of training an MoE model. The decisions about which experts to route a vector to are driven by a router, and during training both the selected (or activated) expert is trained and the router is also trained.

Now that we’ve looked at an analogy, let’s get into the technical details.

How MoE models differ from Dense models

Once the initial tokenization is done and initial embeddings are looked up, the remainder of an LLM’s processing involves passing data through a series of transformer blocks. The content of each block is where MoE models differ from traditional “dense” models.

Here’s a diagram comparing the transformer block architectures of a MoE model and a traditional dense model:

Comparing transformer blocks for dense models and MoE models

For a traditional dense model, the output vectors of the attention mechanism are then passed into a feed forward network, which is a set of layers. Each layer is essentially a mathematical formula that takes a vector (a set of numbers) and processes it in various ways using the model’s parameters in order to produce another vector, which is then passed to the next layer, etc. The vector flows through all of these layers until it reaches the bottom of the block.

In contrast, a Mixture of Experts model takes the output from the attention mechanism and then decides which experts it should route the vector to within its block. Each “expert” is a feed forward network, so the main difference between the model types is the fact that with a dense model, every vector flows through the remaining layers of the block, while with a MoE model, the routing mechanism sends the vector to one or more experts, ignoring the others.

The final step for an MoE model is to take the results of all of the expert networks and combine them into a single vector using a weighted sum. At this point this output vector is passed to the next block, and processing continues until we reach the final block, where the final output vector is handled just as the final vector is for a traditional model (as described above).

How the Router Works

The router (or gate network) can be thought of as a small self-contained model within each transformer block. It’s designed to take in a vector and produce another vector that indicates which experts should be engaged. The number of experts activated is different for each input vector, which means sometimes a single expert will be activated, while other times more than one will be activated (each with different weights for their importance for the topic.)

The routing model that controls which experts should be engaged for a particular vector is trained, just like every other part of the overall LLM. That means as the LLM model training happens, the router model (one per block) is adjusted along with all of the values in the block’s feed forward networks that were activated for a particular vector.

That means that initially, the router network is essentially completely random. It’s only after training that the router network develops preferences for certain vectors to be routed to specific experts (“expert” in this case being a set of feed forward layers.)

While the maximum number of experts is controlled by the people who designed the model, the decision about which vectors end up going to which experts is completely determined by the training process. And that process means that both the routing network and the feed forward networks belonging to the activated experts are adjusted during training, while the inactive expert networks are left untouched.

And speaking of training, we definitely gain performance improvements during both training and inference by using a MoE, since the embedding vectors only flow through the active experts for a particular vector. That means that many feed forward layers are unused during training and inference, which means less processing time, less electricity used, and much more efficient model operation. This is the secret to a MoE model.

A Word About Training LLMs

Although we’ve mentioned training a few times, we haven’t gone into depth about it. Let’s address that now.

As I mentioned earlier, during the training process many elements of the overall model are adjusted, from the feed forward layers to the router component to the attention mechanism in each block. All start out essentially random, and then are gradually adjusted over the course of the model training.

So how do you train an LLM? Well, unlike many other types of machine learning, you don’t need labelled data. That is, you don’t have to supply a correct answer for each item of input data, as you would for a classification model or something like that. For example, if you were training a model to identify images as either a dog or a cat, you’d typically have to provide hundreds of examples, each with the proper label.

With an LLM, the training method is completely different. Training an LLM uses text pulled from whatever sources were used, and the process is as simple as removing the last word of the text. Once that is done, the initial part of the text is used as input, and the output is then compared to the chopped-off last word.

Consider the example “Mary had a little lamb”, a line from a well-known nursery rhyme. If we pass in the string “Mary had a little” (i.e., without “lamb”) into a model, many possibilities could be returned for the next token, including “lamb”, “problem”, “dog”, etc. All of those answers are completely valid, but in this case we only want to match against “lamb”, since that’s the word our training text uses.

That means that if the model returns anything other than “lamb”, then the training process will determine that the model needs to be adjusted to make it more likely that it returns “lamb” for that input.

This is done via a process called back propagation, which essentially adjusts the parameters of a model from the bottom of the model up to the top, reducing the magnitude of the adjustments as it goes along. Each adjustment is small, but over the entire course of training a model, billions or trillions of adjustments are made, which is how the model learns.

But wait — since “Mary had a little problem” is a valid English sentence, why do we treat this as an incorrect answer? The answer is that over the entire set of training text, there will indeed be many examples that have “problem” as the next word to “Mary had a little”, and the training process will gradually adjust the model to handle that. Since there are many other options for the next word in that sequence, by the time training is complete, the weight (or prevalence) for one result over another will reflect the prevalence found in the training text itself.

Mixture of Experts Pros and Cons

Now that we’ve run through the difference between a traditional LLM and a MoE LLM, let’s summarize what an MoE model brings to the table.

MoE models have more parameters than dense models, but only some are used during training and inference
Since only some parameters are used, this makes training and inference faster and much more efficient, while still producing high quality results
MoE models increase the complexity of a transformer block by adding a router and multiple separate feed forward networks
The routing mechanism can be thought of as a separate model, although it’s also trained at the same time as the rest of the feed forward layers and attention mechanisms.
It is theorized that the separation of topics into distinct experts can improve the overall quality of the model, since each expert subnetwork is more focused on the topic than a general dense model.

Due to their efficiency, MoE models are becoming increasingly common. Although they have generally larger numbers of parameters (and thus require more storage), only some of those parameters are used at any given time, which saves time and money.

Although many companies that provide commercial models (like Google or OpenAI) typically do not reveal details of their model’s architecture or training methods, there are a number of well-known models that do use the MoE approach, including Grok-1, DeepSeek, Mixtral, and Qwen1.5-MoE. Additionally, it is believed that models like GPT-4, PaLM2 and Claude use an approach similar to MoE.

Conclusion and Additional Resources

In summary, we’ve explored how Mixture of Experts (MoE) models represent a significant advancement in large language model architecture, offering a solution to the challenges of ever-growing model sizes. By selectively activating subsets of their parameters based on the input query, MoE models enable faster training and inference while maintaining high-quality results.

This article demystified the core components of LLMs — tokenization, embeddings, attention mechanisms, and transformer blocks — and highlighted how MoE models diverge from traditional dense models through their routing mechanism and specialized “experts.” Although the architecture of a LLM is inherently complex, breaking it into smaller pieces makes it much more understandable.

The efficiency and potential for improved topic-focused understanding offered by MoE architectures suggest they will continue to be a crucial area of research and development. To see how researchers are already improving on this foundation, explore the concept of Mixture-of-Experts with Expert Choice Routing, a variation that focuses on improving the performance of the router component. This work highlights just one of the many exciting avenues for innovation in this field.

How Mixture-of-Experts LLMs Work was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Finding Groundwater Using Google Earth Engine and Gemini

Greg Sommerville — Wed, 16 Jul 2025 17:47:22 GMT

Image of satellite over the Earth generated by Gemini

It’s safe to say that Google Earth Engine (GEE) has changed the world for geospatial analysis. By providing access to a massive catalog of satellite imagery, it allows us to analyze our planet in ways that were previously unthinkable. Despite this, it’s often difficult to translate that raw data into a solution for a complex, real-world problem like locating viable groundwater sources.

This article presents a novel technique for using GEE’s infrared (IR) imagery to identify indicators of groundwater. By analyzing specific spectral patterns in the satellite data, we can significantly narrow down areas of interest, making groundwater exploration more targeted and efficient.

This article will demonstrate how to use Google Earth Engine to find indicators of groundwater. By the end, you’ll understand:

Why Near-Infrared (NIR) imagery is a key tool for this task.
How remote sensing indices like NDVI and NDMI work.
How to use Gemini to analyze the resulting images to pinpoint promising locations.
The high-level steps to implement this on Google Cloud.

Now that we understand the goal, let’s look at the remote sensing techniques we can use.

Groundwater Detection

To understand the solution, we first need to define some basic concepts. Groundwater is water held underground in soil or in rock crevices and cavities. It’s different from surface water, which includes things like lakes, rivers, and streams.

Why is groundwater important? For one, it’s a major source of drinking water. Beyond that, it’s used for agriculture and in various industries including manufacturing, mining, and energy production. Additionally, groundwater can help resupply low levels of surface water during times of drought.

The Importance of Efficient Groundwater Detection

So why is finding groundwater efficiently such a big deal? It comes down to a few key reasons. Traditional methods (often involving manual inspections) are slow, expensive, and hit-or-miss. A better approach using remote sensing offers clear advantages:

Addresses Water Scarcity: Quickly identifies new, sustainable water sources to support communities and agriculture, especially in drought-prone regions.
Lowers Costs and Reduces Risk: Saves significant time and money by pinpointing the most promising locations for drilling, reducing the need for costly and often unsuccessful exploratory work.
Minimizes Environmental Impact: Prevents unnecessary land disruption and protects existing ecosystems by making exploration targeted and precise.
Supports Infrastructure Repair: Has the potential to identify large slow-water leaks that may go unnoticed for long periods of time.

How Remote Sensing Finds Groundwater

The key to using remote sensing for this task lies in observing what we can see — vegetation and soil moisture — to infer what we can’t see underground. Healthy, well-hydrated vegetation in an otherwise arid area is often a strong indicator of a shallow groundwater source. To detect these patterns, we use specific wavelengths of light captured by satellites.

Satellite Images and Near Infrared Wavelengths

Near-Infrared (NIR) is a part of the electromagnetic spectrum that’s invisible to the human eye, with wavelengths slightly longer than visible light. It’s crucial in remote sensing because healthy vegetation strongly reflects NIR light while absorbing visible red light. This distinct reflection pattern allows scientists to assess vegetation health and density.

Google Earth Engine (GEE) provides extensive access to satellite imagery that includes both NIR and visible color bands. Satellites like Landsat and Sentinel-2 are key sources for this data. Within GEE, users can select specific satellite image collections and then choose the desired spectral bands (like Red, Green, Blue for visible color, and the NIR band) to analyze or visualize. This enables various applications, such as creating natural color images or false-color composites that highlight vegetation using the NIR band.

Here’s an example image that shows visible, NIR, and shortwave infrared (SWIR) images for the same person:

Using NDVI and NDMI to Find Groundwater

To turn raw satellite data into actionable insights, we use remote sensing indices — mathematical combinations of different spectral bands. For groundwater detection, two of the most effective are the Normalized Difference Vegetation Index (NDVI) and the Normalized Difference Moisture Index (NDMI).

Normalized Difference Vegetation Index (NDVI)

NDVI is a widely used indicator of healthy, green vegetation. It’s calculated based on the difference between near-infrared (NIR) and red light reflected by plants. Healthy vegetation absorbs most visible red light while reflecting a large portion of NIR light.

How it’s calculated: NDVI=(NIR−Red)/(NIR+Red)
Relevance to groundwater: Areas with higher groundwater availability often support more vigorous vegetation, especially in arid regions. Identifying pockets of high NDVI can help us infer subsurface water sources that are sustaining plant growth.

The same area shown in True Color and NDVI. Image created by author.

Normalized Difference Moisture Index (NDMI)

Also known as the Normalized Difference Water Index (NDWI), NDMI is used to assess the water content in vegetation. It utilizes the near-infrared (NIR) and shortwave-infrared (SWIR) bands, as water in plants absorbs SWIR light.

How it’s calculated: NDMI=(NIR−SWIR)/(NIR+SWIR)
Relevance to groundwater: Elevated NDMI values indicate well-hydrated plants, which could be a direct result of access to shallow groundwater. This is especially useful for detecting subtle moisture differences that aren’t visible to the naked eye.

The same area shown in True Color and NDMI. Image created by author.

How NDVI and NDMI help in groundwater exploration:

By combining the insights from both NDVI and NDMI, we can develop a more comprehensive understanding of potential groundwater locations:

Identifying Water-Stressed Areas: Low NDVI and NDMI could indicate areas where vegetation is stressed due to lack of surface or groundwater, thus directing exploration away from such regions.
Locating Phreatophytes: Certain plant species, known as phreatophytes, have roots that extend deep enough to reach the water table. These plants often exhibit high NDVI and NDMI values. Mapping clusters of these vigorous, well-hydrated plants can point to shallow groundwater reserves.
Detecting Anomalies: Unexpectedly high NDVI or NDMI in an otherwise arid landscape can signal a hidden groundwater source. These anomalies might be indicative of springs, seeps, or areas where the water table is close to the surface.
Monitoring Seasonal Changes: Analyzing how these indices change over different seasons can provide insights into the dynamics of groundwater. For example, if vegetation remains green and moist during dry seasons, it strongly suggests a reliable groundwater source.
Complementary Data: These indices are most effective when used in conjunction with other geospatial data, such as geological maps, topographic data, and soil moisture information, to provide a more robust assessment of groundwater potential.

Interpretation of Remote Sensing Images

As you can see above, both NDVI and NDMI produce images that are color-coded for the data they are displaying. We can look at a set of images of the same area over time to detect changes in groundwater indicators like these, but that’s a fairly manual process. Instead, the simplest approach is to provide the images to Gemini and ask it to examine them.

Using Gemini for the analysis of NDVI and NDMI images is an efficient approach to identifying groundwater indicators. Instead of manual visual inspection (which can be time-consuming and prone to human error), Gemini can be leveraged to process and interpret these remote sensing outputs.

By feeding the generated NDVI and NDMI images (or the underlying spectral data) into Gemini, the AI can be prompted to identify specific patterns, anomalies, and relationships indicative of groundwater presence. For instance, Gemini can be instructed to highlight areas with consistently high NDVI values in arid regions, especially during dry seasons, as this strongly suggests subsurface water sustaining the vegetation. Similarly, it can pinpoint regions with elevated NDMI values, indicating high moisture content in vegetation, which might be linked to shallow groundwater tables.

Additionally, Gemini can go beyond simple value thresholds by integrating temporal data. It can analyze sequences of NDVI and NDMI images collected over various seasons or years to detect subtle changes that reveal groundwater dynamics. For example, if an area shows a sustained high NDMI despite prolonged drought conditions, Gemini can identify this as a significant anomaly pointing to a resilient groundwater source. The AI’s ability to process vast amounts of imagery and identify complex, multi-variable correlations makes it an invaluable tool for groundwater exploration, allowing for more precise targeting of areas for further investigation and significantly reducing the time and resources traditionally required for such endeavors.

Implementation Workflow in GEE

At this point, we understand the theory. Now let’s walk through the high-level steps to implement this on Google Cloud. The basic process involves selecting a satellite data source, retrieving images for a specific location and time, and then generating our NDVI and NDMI indices.

1. Accessing Google Earth Engine

First, you’ll need to be able to make calls to the GEE platform. If you haven’t already, you can sign up on the GEE Developer page. From there, you can work in the web-based Code Editor or use the Python API in a Jupyter notebook environment, which is what we’ll be doing here.

2. Retrieving and Processing Imagery

The core of the workflow is to query an image collection, like COPERNICUS/S2_SR_HARMONIZED from the Sentinel-2 satellite, which contains the spectral bands we need. We filter this collection by our region of interest and a specific date range.

The following Python code demonstrates this entire process. It defines functions to:

Query GEE for relevant Sentinel-2 imagery for a given location and date.
Download the separate Red, Near-Infrared (NIR), and Shortwave-Infrared (SWIR) bands.
Calculate the NDVI and NDMI values from those bands.
Visualize the final indices as color-coded images ready for analysis.

import requests
import ee
import numpy as np
import datetime
import matplotlib.pyplot as plt
from PIL import Image

MY_PROJECT_ID = "something"
ee.Initialize(project=MY_PROJECT_ID)


def load_image(image_path):
    """
    Loads an image and converts it to a floating point numpy array.
    """
    img = Image.open(image_path).convert('RGB')
    img_array = np.array(img, dtype=np.float32) / 255.0  # Normalize to 0-1
    return img_array


def calculate_ndvi(nir, red):
    """
    Calculates the Normalized Difference Vegetation Index (NDVI).
    """
    numerator = nir - red
    denominator = nir + red
    ndvi = np.where(denominator != 0, numerator / denominator, 0)
    return ndvi


def calculate_ndmi(nir, swir):
    """
    Calculates the Normalized Difference Moisture Index (NDMI).
    """
    numerator = nir - swir
    denominator = nir + swir
    ndmi = np.where(denominator != 0, numerator / denominator, 0)
    return ndmi


def create_index_images(true_color_image, nir_image, swir_image, output_suffix, formatted_datetime):
    # Extract bands from the true color image
    red_band = true_color_image[:, :, 0]
    green_band = true_color_image[:, :, 1]
    # blue_band = true_color_image[:, :, 2]

    # Ensure NIR and SWIR are single-band images; if RGB, take one channel
    if nir_image.ndim == 3:
        nir_band = nir_image[:, :, 0]  # Take the first channel
    else:
        nir_band = nir_image  # assume already single band

    if swir_image.ndim == 3:
        swir_band = swir_image[:, :, 0]  # Take the first channel
    else:
        swir_band = swir_image  # assume already single band

    # Calculate indices
    ndvi_image = calculate_ndvi(nir_band, red_band)
    ndmi_image = calculate_ndmi(nir_band, swir_band)
    visualize_index(ndvi_image, f'NDVI: {formatted_datetime}', f'ndvi_{output_suffix}.png')
    visualize_index(ndmi_image, f'NDMI: {formatted_datetime}', f'ndmi_{output_suffix}.png')


def visualize_index(index_array, title, output_path, cmap='RdYlGn'):
    """
    Visualizes the index array using a colormap and saves the image.
    """
    plt.ioff()  # Turn interactive mode off
    plt.figure(figsize=(10, 8))
    plt.imshow(index_array, cmap=cmap)
    plt.colorbar(label=title)
    plt.title(title)
    plt.savefig(output_path)
    plt.close()


def get_satellite_imagery(longitude, latitude, start_datetime_str, end_datetime_str, buffer_distance):
    # Convert datetime string to ee.Date
    start_date = ee.Date(start_datetime_str)
    end_date = ee.Date(end_datetime_str)

    # Create point geometry. Note that "buffer_distance" is in meters, unless a specific projection is specified
    point = ee.Geometry.Point([longitude, latitude])
    region = point.buffer(buffer_distance)

    # Get Sentinel-2 collection
    s2_collection = (
        ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
        .filterBounds(region)
        .filterDate(start_date, end_date)
        .sort('system:time_start')
    )
    return s2_collection, region


def save_multiband_imagery(image, region, base_filename, pixel_width):
    """
    Save true color, NIR, and SWIR versions of the satellite imagery.

    Args:
        image (ee.Image): Earth Engine image object
        region (ee.Geometry): Region of interest
        base_filename (str): Base filename without extension
        image_scale (int): Image dimensions in pixels
    """
    # Define visualization parameters for different band combinations
    vis_params = {
        'true_color': {
            'bands': ['B4', 'B3', 'B2'],
            'min': 0,
            'max': 3000,
            'filename': f"true_color_{base_filename}.png",
            'scale': 10
        },
        'nir': {
            'bands': ['B8'],
            'min': 0,
            'max': 3000,
            'filename': f"nir_{base_filename}.png",
            'scale': 10
        },
        'swir': {
            'bands': ['B11'],
            'min': 0,
            'max': 3000,
            'filename': f"swir_{base_filename}.png",
            'scale': 20
        }
    }

    # Save each band combination
    fnames = []
    for band_type, params in vis_params.items():

        thumb_params = {
            'region': region,
            'format': 'png',
            'bands': params['bands'],
            'min': params['min'],
            'max': params['max'],
            'dimensions': pixel_width
        }

        url = image.getThumbURL(thumb_params)
        response = requests.get(url)
        if response.status_code == 200:
            with open(params['filename'], 'wb') as f:
                f.write(response.content)
            fnames.append(params['filename'])
        else:
            print(f"Failed to download {band_type} image. Status code: {response.status_code}")

    return fnames


def main():
    # define our date range and the location to examine
    start_datetime_str = '2023-12-25'
    end_datetime_str = '2024-01-15'
    latitude, longitude = 35.089248, -106.637810

    # center on the coord, within a box (buffer_distance is in meters, and talks about space around center point)
    collection, region = get_satellite_imagery(
        longitude, latitude,
        start_datetime_str,
        end_datetime_str,
        buffer_distance=500,  # this gives us 1 KM square bitmap
    )

    collection_info = collection.getInfo()
    num_images = len(collection_info['features'])
    print(f"Number of images in the collection: {num_images}")

    for index, feature in enumerate(collection_info['features']):
        image_id = feature['id']           # Get the image ID
        image = ee.Image(image_id)
        info = image.getInfo()

        # Convert milliseconds to seconds and create a datetime object
        image_datetime = info['properties']['GENERATION_TIME']
        datetime_object = datetime.datetime.fromtimestamp(image_datetime / 1000)
        formatted_datetime = datetime_object.strftime("%Y-%m-%d %H:%M:%S")

        truecolor_fname, nir_fname, swir_fname = save_multiband_imagery(image, region, f"{index}", pixel_width=100)

        truecolor = load_image(truecolor_fname)
        nir = load_image(nir_fname)
        swir = load_image(swir_fname)

        print(f'Creating index images for image {index}, taken on {formatted_datetime}')
        create_index_images(truecolor, nir, swir, f"{index:03d}", formatted_datetime)


if __name__ == "__main__":
    main()

Conclusion

This article has demonstrated the power of Google Earth Engine (GEE) and remote sensing indices in the important task of groundwater exploration. We’ve seen how leveraging GEE’s satellite imagery datasets, particularly those including Near-Infrared (NIR) and Shortwave Infrared (SWIR) bands, allows for the calculation of indices like NDVI and NDMI. These indices provide invaluable insights into vegetation health and moisture content, which serve as strong indicators of underlying groundwater reserves. By moving beyond traditional, labor-intensive methods, this approach offers a more efficient, cost-effective, and environmentally friendly way to pinpoint areas with high groundwater potential, ultimately contributing to better water resource management and addressing the challenges of water scarcity.

The integration of advanced AI tools like Gemini further amplifies the capabilities of this remote sensing methodology. Gemini’s ability to analyze complex patterns, anomalies, and temporal changes within NDVI and NDMI imagery transforms a manual interpretation process into an automated, data-driven assessment. This not only accelerates the identification of promising groundwater locations but also enables a more nuanced understanding of groundwater dynamics over time. By combining the rich data available through GEE with intelligent analytical platforms, we can significantly enhance our capacity to discover, monitor, and sustainably manage this indispensable natural resource for the benefit of communities and ecosystems worldwide.

Finding Groundwater Using Google Earth Engine and Gemini was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Implement Hybrid Search for RAG with BigQuery

Greg Sommerville — Tue, 18 Mar 2025 21:34:53 GMT

How to implement both keyword and vector search with BigQuery

Generated by Imagen 3

At this point, it seems like we’re all familiar with the idea of using RAG (Retrieval Augmented Generation) as part of a LLM-based chatbot.

With a RAG approach, we retrieve data from a datastore based on a user query, add that data to our prompt, and then pass that prompt to a model like Gemini. Since we supply (hopefully) relevant information in the prompt itself, the RAG approach drastically reduces the likelihood of hallucinations that can result in incorrect answers.

As great as that is, there are certain use cases where a typical RAG approach falls short. For example, creating a chatbot that can help users browse and search a product catalog is one use case where a traditional RAG approach won’t work well. Let’s dive in to understand why.

Shortcomings of Traditional RAG

A traditional RAG approach typically relies on using a vector (also known as semantic) search only, which searches based on similar meaning, rather than exact words.

Vector search works by converting a search query into a vector (a list of numbers) that encodes the meaning of the query. Mathematically, vectors with similar meanings should be numerically close to each other, so we can simply look for vectors that are numerically as close to the query vector as possible, which should (theoretically) give us results that are most similar in meaning to the search vector, and therefore most relevant.

That’s great as long as you want to search by meaning, but it falls short when trying to search for words like “Google” or “HP” or “Ricoh” or something like that. Those brand names don’t have meanings per se, so searching for something of similar meaning doesn’t generally work well. In this case, a vector search alone isn’t enough — we need a hybrid search that combines keyword and vector searches.

This article will demonstrate how to use Google Cloud BigQuery as a datastore for RAG. By the end of the article, you’ll understand:

When to use a hybrid search instead of a vector-only search
How to implement vector and keyword search using BigQuery
How to call Vertex AI to obtain embeddings for vector search
The main ideas involved in creating a chatbot that allows searching through a large product catalog (as an example)

Why BigQuery?

One of the choices that must be made when designing a RAG solution is choosing which datastore to use. For a catalog solution I recently created, I chose Google Cloud’s BigQuery (BQ) as the datastore for product information

Although BQ is a data warehouse and data warehouses are typically used for analytics, in this case you can think of it as a regular SQL database optimized for queries, which makes it very effective as a datastore for a RAG hybrid search.

As a matter of fact, BigQuery works well as a vector database, a keyword-based search database, or a hybrid combination of both (which is what we will discuss in this article.)

Here are a number of reasons for choosing BigQuery for RAG:

BigQuery is designed to handle petabyte-scale datasets. This is crucial for RAG applications that might need to index and search through vast amounts of text and vector embeddings (like a large product catalog.)
BigQuery includes built-in support for vector search. It also supports keyword search through standard SQL queries (using the LIKE keyword and wildcard expressions.)
BigQuery’s columnar storage and distributed query processing enable extremely fast query execution, even on large datasets. This translates to low latency in retrieving relevant information for your RAG model.
BigQuery has excellent integration with other Google Cloud Services, like Cloud Storage, Vertex AI, and Cloud Functions. This simplifies the development and deployment of your RAG application.
BigQuery provides robust data governance and security features, ensuring that your data is protected and managed effectively.
BigQuery is ultimately a data warehouse, which means if your RAG application does require data analysis and reporting, BigQuery’s capabilities can be leveraged to gain valuable insights.
BigQuery uses a Serverless Architecture. BigQuery’s serverless nature eliminates the need for infrastructure management, allowing you to focus on building your RAG application.

For those reasons, BigQuery was an easy choice for the RAG datastore.

Storing Products in BigQuery

A first step for any RAG solution is to collect and index data. Since we’re creating a catalog chatbot, in this case we need to store information about the available products and services. Here’s what we need to include:

Product Name
Product Description
Product Category
Filename that the product information came from
Keywords for the product
Embeddings for the product (calculated from extracted information)

A few notes about these fields:

The product name and description are simple text descriptions of a product.
The product category is important since we can filter product searches based on our current category. Each product should have a single product category associated with it.
The filename of the source PDF is important since it allows us to remove products in the future. This would be done by using a Cloud Run Function to detect the removal of a file from a storage bucket, which would then remove all of the associated products and services. This allows our users to change and update the list of products simply by adding or removing files from a storage bucket.
Keywords are quite important when dealing with products and services. This is the field that will allow us to do keyword matching, as with brand names.
Finally, embeddings are calculated by combining the product name and description into a string, and then create an embedding for that string. This allows our users to search based on the meaning of the words in the product description.

As we mentioned earlier, we can use BigQuery to store all product information. Here’s a schema for what we’re going to store:

Generating Embeddings

There are a variety of ways to create embeddings, including using the built-in BigQuery function GENERATE_EMBEDDING(). However, in this case we’ll call Vertex AI to generate them, using some Python code.

That code would normally be part of a larger overall RAG solution, which handles both ingesting and storing product information into BigQuery, and also the actual chatbot operation when you retrieve data from BigQuery in order to respond to the user.

Note that there are two functions to generate embeddings: one for embedding the text that describes a product, and another for generating embeddings to be used during retrievals.

from vertexai.preview.language_models import TextEmbeddingModel, TextEmbeddingInput

EMBEDDING_MODEL_ID = "text-embedding-005"

def get_embeddings_for_storage(title: str, text: str) -> List[float]:
    model = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL_ID)
    text_embedding_input = TextEmbeddingInput(
        task_type='RETRIEVAL_DOCUMENT',
        title=title,
        text=text)
     embeddings = model.get_embeddings([text_embedding_input])
     return embeddings[0].values

def get_embeddings_for_retrieval(text: str) -> List[float]:
     model = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL_ID)
     text_embedding_input = TextEmbeddingInput(
        task_type='RETRIEVAL_QUERY',
        text=text
     )
     embeddings = model.get_embeddings([text_embedding_input])
     return embeddings[0].values

It’s important to notice that there are differences in the value of the task_type parameter, which indicates why we are generating the embeddings. RETRIEVAL_DOCUMENT is for the initial embedding operation, and the code takes both a title and body text. RETRIEVAL_QUERY, on the other hand, is used when getting embeddings for retrieval.

Product Categories

One of the challenges that RAG systems have is ensuring that they retrieve the right information. Even when you use embeddings (and potentially keywords), you can end up retrieving too much information, or the wrong information. This is especially true when dealing with very large amounts of data. Imagine that our catalog contains hundreds of thousands of products — we want to ensure that we are retrieving only the most relevant.

By assigning a category to each product, and then also understanding which product category the user is asking about, we can filter the product data to be highly relevant.

In this example, we’re putting together a chatbot for IT products and services, so we’ll start with the following categories:

Hardware / Networking
Hardware / Servers
Hardware / Storage
Hardware / Printers
Hardware / Telecommunications
Hardware / Desktop computers
Hardware / Laptops and Tablets
Hardware / Computer Accessories
Software / Virtualization and Operating Systems
Software / Management Software
Software / Database Systems
Software / Telecommunication Systems
Software / Productivity Software
Software / Security Software
Software / Other Software
Services / Cloud Computing
Services / Network Services
Services / End-User Support
Services / Staffing
Services / Telecommunication

Each category is simply a string. Although this example uses two levels in its taxonomy, you can create your own list of categories in any way you wish, with any number of layers. Use whatever structure makes the most sense for your use case. Just make sure that categories don’t overlap, or else the model will have a hard time determining which category is the current one.

Understanding the Current Category

During the conversation with the user, we can have Gemini look at the list of categories and the conversation history in order to determine which category is the current topic of conversation. We do this using the following prompt:

You are an expert in the field of IT provisioning and supplies.

Each product or service that can be found in the catalog has a category. 
Those categories are listed below.

**Categories:**
{categories}

Look at the following conversation between a potential buyer and an AI guide, 
and pay close attention to the latest part of the conversation. 
You should return the category of the product or service that the buyer is 
looking for.

**Conversation:**
{conversation}

It's possible that the buyer may be asking questions that are very generic 
and not directly related to a particular category. 
For example, if they ask about "Software",but don't specify what kind of 
software they are looking for, you should return "Software".

However, if they are asking about a specific category, you should return 
the category. For example, if they are asking about hardware servers, 
you should return "Hardware / Servers".

If you can't figure out what category the buyer is looking for, 
you should return "Unknown".

Notice how we check if the conversation is still somewhat generic (meaning the user hasn’t been specific enough about what they are looking for), we get a category of “Unknown”, which we can interpret as a trigger to ask Gemini to provide general information about the categories and ask the user for more detail.

However, once we know which category the user is interested in, we can ask Gemini to extract likely search keywords based on their query. Along with that, we can also ask Gemini to give us a short string that describes what the user is searching for. We take that string and turn it into a vector embedding.

Then, using the current category, the current product keywords, and the embedding for the user query, we can perform our hybrid search.

Retrieving Products From BigQuery

At this point, we have the user query, its corresponding embeddings (calculated separately), a list of keywords that should help with searching, and a current category name. Here’s how we do the retrieval of the relevant products:

def get_products(current_category: str, 
                 keywords: str, 
                 embedding: List[float]) -> str:

     # using a combination of keywords and the embeddings, search
     # through the relevant category and return product information as a
     # string that can be included in the prompt
     # set up our keyword query so it's like:
     # SELECT…WHERE LOWER(keywords) LIKE ANY ('%hp%', '%color%')
     # keywords come in looking like: '"HP", "color printer", "etc"'
     keywords = [f"'%{k.lower().strip()}%'" for k in keywords.split(',')]
     keyword_match_string = ", ".join(keywords)

     if keyword_match_string:
           QUERY = (
              'SELECT product_name, product_description '
              'FROM `dataset.products` '
              f"WHERE category = '{current_category}' AND "
              "LOWER(keywords) LIKE ANY ({keyword_match_string}) "
              'LIMIT 10')
           query_job = bqclient.query(QUERY)
           keyword_rows = query_job.result()
     else:
           keyword_rows = []

     # notice the different SELECT when dealing with a vector search.
     # Also, need to select "base.product_name" (etc.)
     # because that's how it's returned (there are other vector-related 
     # fields also returned, which we ignore)
     QUERY = (
           'SELECT base.product_name, base.product_description '
           "FROM VECTOR_SEARCH("
           f"(SELECT * from dataset.products where category = '{current_category}'), "
           "'embeddings', "
           f'(SELECT {embedding} as embed), '
           'top_k => 10, '
           "distance_type => 'COSINE');"
     )
     query_job = bqclient.query(QUERY)
     vector_rows = query_job.result()
  
     # now combine the results from the keyword search and the vector search
     product_rows = list(vector_rows) + list(keyword_rows)
     if len(product_rows) == 0:
         return "No products found"

     final_prods = [f"**Product: {prod.product_name}**: {prod.product_description}" for prod in product_rows]
     return "\n".join(final_prods)

The product retrieval is done in two steps: retrieving products by keyword, then by embeddings. Once both sets are retrieved, we combine the two and create a string that lists out the products.

Note that we pass in the keywords to this function as a single string separated by a comma. This is how we asked Gemini to return it, so we need to modify it a bit to make it work with our SQL SELECT statement.

First, we split and clean each keyword by turning it into lower case and stripping surrounding blanks. Then we surround each keyword with percentage symbols, since that signifies a wildcard match. Finally, we combine everything into a single SQL statement that may look something like this:

SELECT product_name, product_description
FROM `dataset.products`
WHERE category = 'Hardware / Printers'
AND LOWER(keywords) LIKE ANY ('%hp%', '%ricoh%')
LIMIT 10

Notice how we first select only products that match our current category, which drastically reduces the possible matches and makes the selection more accurate. Then we use the LIKE ANY condition to match the keywords against a list of wildcard patterns, and limit our results to the first ten rows returned.

The second query uses BigQuery’s built-in vector search. Here’s what the vector selection statement looks like:

SELECT base.product_name, base.product_description
FROM VECTOR_SEARCH(
  (SELECT * from dataset.products where category = 'Hardware / Printers'),
  'embeddings',
  (SELECT {embedding} as embed),
  top_k => 10,
  distance_type => 'COSINE'
)

You can see that the structure of the SQL SELECT statement is quite different from a standard SQL query. In this case, we use the VECTOR_SEARCH() function, which takes a subquery (for the data to search), the field name of the embeddings column (“embeddings”, in our case), the embeddings to match on, and parameters like top_k and distance_type (which is how embedding vectors are compared to each other.)

Finally, notice that we select “base.product_name” and “base.product_description” rather than just “product_name” and “product_description”. This is because the vector search returns a number of fields, and the “base.” prefix refers to the fields from the products table, rather than other data that is related to the vector search.

One last thing to talk about is using an index for performance reasons. Although indexing is typically used to improve retrieval speed, it’s mostly automatic in BigQuery for traditional queries. However, BigQuery does support indexes for vector search, and this is recommended once your set of potential matches (i.e., the number of products in the table) gets very large. See this web page for more information: https://cloud.google.com/bigquery/docs/vector-index

Conclusion

When you need a RAG data source (hybrid or otherwise), BigQuery is a very effective solution. With its built-in vector search capabilities, practically unlimited scalability, and fast querying capabilities, it can be a powerful tool when building a RAG LLM solution.

Although keyword search and vector search are two distinct approaches, BigQuery can support both as part of a hybrid solution by using the techniques described in this article.

For more information about using vector search with BigQuery, please see this web page.

How to Implement Hybrid Search for RAG with BigQuery was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unlocking PDFs for RAG with Markdown and Gemini

Greg Sommerville — Wed, 18 Dec 2024 16:56:13 GMT

Unlock PDFs for RAG with Markdown and Gemini

Created by Imagen 3

It’s safe to say that Retrieval Augmented Generation (RAG) has changed the world for many businesses and organizations. By supplementing the built-in capabilities that an LLM like Gemini has with your own information, you can create extremely powerful experiences that are truly transformative.

Despite this, it’s often difficult to create a RAG application that works well with complex unstructured documents like PDFs.

This article presents a novel technique for extracting text from PDFs in Markdown format, leading to improved accuracy and richer context in Retrieval Augmented Generation (RAG) applications.

Markdown isn’t just for output. Using Markdown in your prompts can dramatically improve the quality of the model’s responses due to the added nuance it provides compared to plain text.

The problem with PDFs

PDFs are notoriously difficult to work with. Each document can have a wide variety of layouts, including multiple columns of text, or even text that seems to be randomly distributed on a page. Since PDFs support not only text but also images, some pages may look like text but are actually represented as images. Additionally, PDFs often contain tabular data which can be quite challenging to parse. Finally, it’s quite difficult to extract text from a PDF that also retains formatting information like bold, italics, and bullet points. By extracting only the text, you lose meaning and nuance that were in the original document.

Each of these situations makes it difficult to use PDFs in a RAG application. There are of course a number of Python libraries available that are designed to work with PDF documents like PyPDF, PDFPlumber, or PDFMiner, but almost none of them handle all of the complex situations described above. Depending on the source document, all of these libraries can produce text that’s incomplete or even completely incorrect.

Recently some new approaches have been introduced that use ML models (like Docling) to parse PDFs, but they can be extraordinarily slow, and aren’t usable for PDFs beyond just a few pages. (In one test I recently ran on my laptop, it took Docling 18 minutes to parse a 12 page document.)

This blog post describes a new technique to read in PDF files and quickly and efficiently generate accurate corresponding Markdown using Gemini and Google Cloud. The resulting Markdown is well-suited for indexing into a RAG datastore.

A word about Markdown

Markdown is a simple and compact markup language. Markdown employs a simpler syntax than HTML and CSS, focusing on a limited set of stylistic elements: headings, bold text, italic text, hyperlinks, bullet points, and simple tables.

Most LLMs such as Gemini create output that uses Markdown, and the styling that it provides is extremely helpful in reader comprehension. Having actual bullet points is vastly superior to a plain-text alternative like using a hyphen at the start of a line, and bold and italicized text can make important information really stand out. Beyond that, Markdown’s ability to organize information into a table can be quite helpful.

Perhaps less intuitively, Markdown is also extremely useful when creating prompts. By selectively highlighting key phrases in your prompt, or organizing information into bulleted lists, we provide the model with more information than just the text content, which improves the model’s understanding and helps it focus on the task at hand.

Even so, it’s important to remember that Markdown is a simple language, and may not support everything you can store in a PDF. For example, Markdown tables do not support spanned rows or columns, which are often found in table headers. That’s important to keep in mind as you test this new approach, since it will affect the accuracy of your extraction for certain PDF files.

Regardless of these limitations, having the ability to extract a PDF’s content as Markdown can be extremely helpful when working on a RAG application. During the chunking and indexing process, you can use the headers to understand sections and subsections, which allows chunking documents into discrete topics. Similarly, tabular data arranged in a Markdown table can help the model understand the content much more easily than using plain text.

To sum up, it’s clear that using Markdown extracted from a PDF can dramatically improve the quality of Gemini’s responses due to the added nuance it provides compared to plain text. Beyond that, it also helps in terms of chunking documents during the RAG ingestion process, since you can use cues like headers to detect logical sections within the document.

Now that we understand how Markdown can help, let’s look at the process to extract it from PDF documents.

How To Extract Markdown From a PDF

In simple terms, here’s the process for extracting Markdown from a PDF document:

For each page in the PDF:
- Create an image of the page
- Pass that image to Gemini, with a prompt asking it to extract the content of the page as Markdown
Once all of the individual pages have been processed, combine the markdown from all of the pages into a single Markdown string.

This approach works quite well. Here’s an example, using a page from the instructions for the state of Illinois tax form 1040. Notice that the page is split into two columns, and the top half of the page is completely separate from the bottom half:

A page from IL Form 1040, showing multiple columns and sections

And here’s the corresponding Markdown generated by Gemini, rendered so you can see the use of bullet points, headers, and the like:

Markdown extracted from the IL 1040 page

As you can see, the quality of the extracted markdown is very good, as it generally reflects how a human being would read the page. Notice that “Step 2” (the top half of the page) is described fully before “Step 3” (the bottom half.)

Additionally, markdown is produced that designates bullet point lists, bolded text, headings, and more. All of this adds meaning to the raw text that is extracted, which will typically produce better results when passing this markdown to Gemini. And, as stated earlier, having headings and subheadings helps us chunk a document into logical groupings, which will help with the RAG retrieval process.

Implementation Details

Depending on your use case, you could simply loop through each page within a PDF, extract a page image, and then pass it to Gemini in order to obtain the markdown. However, when approaching this problem, it’s good to think about scaling.

On my laptop, extracting an image for the above example page took 0.140 seconds, so that part of the algorithm is extremely quick. However, calling Gemini 1.5 Flash to extract the Markdown took 23.857 seconds, which can quickly add up for longer PDF documents.

Luckily, this problem fits very well with a map-reduce approach. This approach first splits work into multiple parts, each of which runs in parallel. That part is called the map step. Then, when all of the parallel parts are complete, the results are combined or aggregated, which is called the reduce step.

In our case, we can process each page separately and then combine the markdown for all of the pages once all pages are processed. By leveraging Google Cloud, we can distribute the work using a PubSub topic, and process each page using a Cloud Run Function. Here’s a diagram that illustrates this approach:

Architecture Diagram showing the PDF processing approach

Reading from left to right, these steps are taken:

When a PDF file is placed in a Google Cloud Storage bucket, it causes a Cloud Function to be run.
That function copies the PDF from the bucket to the function’s local storage, then opens it simply to determine how many pages it contains. Then, for each page, the function writes a small JSON item to the PubSub topic, which contains the name of the PDF, the page number to process (from 0 to N — 1, where there are N pages), and the total number of pages found in the PDF.
The Page Handler cloud function is triggered when a new item shows up in the PubSub topic. Note that several invocations of this function can be run at the same time through the parallel processing facilitated by Cloud Run. You can specify the maximum concurrency when configuring the function.
The function copies the PDF from the bucket to the function’s local storage, opens the PDF, renders an image for the page in question (that is, the page number in the JSON data retrieved from the topic), and then calls Gemini via the Vertex AI API to get the Markdown.
Once the page Markdown is obtained from Gemini, it is stored in a BigQuery table, which has fields for the file name, the page number, and the extracted markdown string.

These steps extract markdown for each page (the map part of map-reduce), but we still need to address the reduce step where all of the individual page markdown is combined into a single string.

In this case, the simplest approach is to have the page handler function check if it is the last page in the document. By counting the number of pages in the BigQuery table for the given document, we can determine if all processing is complete (which is why we passed the total number of pages in as part of the data on the PubSub topic.)

In short, after the page handler function finishes processing the page, it counts the number of completed pages from the BigQuery table for the document in question, and if it matches the total number of pages, then all of the individual page markdown strings are retrieved (ordered by page number) and combined into a single string. At that point we can store the document Markdown in a file, or perform more processing (such as using the extracted Markdown as part of another prompt sent to Gemini) if desired.

Implementation Code

First, let’s look at the code for the PDF file handler — the function that is invoked when a PDF file is placed in a bucket. We use the PDF library PyPdfium to count the number of pages.

from google.cloud import storage, pubsub_v1
import os
from typing import Callable
from concurrent import futures
import pypdfium2 as pdfium
import json

# project ID
project_id = os.getenv("PROJECTID")
# the pubsub topic we're writing to
pubsub_topicname = os.getenv("TOPICNAME")
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, pubsub_topicname)


def handle_new_file(event, context):
    # copy file from cloud storage into local storage
    bucketname = event['bucket']
    filename = event['name']
    if filename.lower().endswith('.pdf') is False:
        print(f"File {filename} is not a PDF file, skipping")
        return
    localname = '/tmp/test.pdf'
    download_to_local(bucketname, filename, localname)

    # Determine how many pages there are
    num_pages = len(pdfium.PdfDocument(localname))

    # For each page, post a message
    publish_futures = []
    for page_num in range(num_pages):
        # Create a JSON object with the file name, page number to process, and total number of pages
        data = json.dumps({"filename": filename, "pagenum": page_num, "totalpages": num_pages}).encode('utf-8')

        # Non-blocking. Publish failures are handled in the callback function.
        future = publisher.publish(topic_path, data)
        future.add_done_callback(get_callback(future, data))
        publish_futures.append(future)

    # Wait for all the publish futures to resolve before exiting.
    futures.wait(publish_futures, return_when=futures.ALL_COMPLETED)

    # then delete the local file and exit
    os.remove(localname)


def download_to_local(bucketname, filename, localname):
    bucket = storage_client.bucket(bucketname)
    blob = bucket.blob(filename)
    blob.download_to_filename(localname)


def get_callback(publish_future: pubsub_v1.publisher.futures.Future, data: str) -> Callable[[pubsub_v1.publisher.futures.Future], None]:
    def callback(publish_future: pubsub_v1.publisher.futures.Future) -> None:
        try:
            # Wait 60 seconds for the publish call to succeed.
            publish_future.result(timeout=60)
        except futures.TimeoutError:
            print(f"Publishing {data} timed out.")
    return callback

Now let’s look at the function that processes an individual page.

import base64
from google.cloud import storage
import os
import json
from read_pdf import get_markdown_for_page
from bigquery import save_page_info, get_num_pages_for_filename, get_markdown_for_filename


BUCKET = os.getenv("BUCKET")
storage_client = storage.Client()


def handle_pubsub_message(event, context):
    # Decode the message data
    message_bytes = base64.b64decode(event['data'])
    message_str = message_bytes.decode('utf-8')
    message_json = json.loads(message_str)

    # Get information about the page we should process
    filename = message_json.get("filename")
    pagenum = message_json.get("pagenum")
    totalpages = message_json.get("totalpages")

    # retrieve the file, extract the page in question, convert it to an image,
    # and use Gemini to get the markdown for it
    download_to_local(BUCKET, filename, "temp.pdf")
    markdown = get_markdown_for_page("temp.pdf", pagenum)
    save_page_info(filename, pagenum, markdown)

    # now check if all of the pages have been processed
    num_pages_for_filename = get_num_pages_for_filename(filename)
    if num_pages_for_filename == totalpages:
        # retrieve the markdown for all pages, combine, and then store as a file
        # in the future, we will now pass this string to Gemini to get the product info
        all_markdown = get_markdown_for_filename(filename)
        save_text_to_bucket(BUCKET, f'markdown\{filename}.md', all_markdown)


def download_to_local(bucketname, filename, localname):
    bucket = storage_client.bucket(bucketname)
    blob = bucket.blob(filename)
    blob.download_to_filename(localname)


def save_text_to_bucket(bucketname, filename, text):
    bucket = storage_client.bucket(bucketname)
    blob = bucket.blob(filename)
    blob.upload_from_string(text)

As you can see, this function calls a couple of additional modules. First, here’s the read_pdf.py module for extracting the image and then calling Gemini for the markdown:

import vertexai
from vertexai.generative_models import (
    Part,
    Image,
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory,
)
import pypdfium2 as pdfium
import os


PROJECT_ID = os.getenv("PROJECTID")
REGION = os.getenv("REGION")
LOCAL_IMAGE_FILE = "/tmp/page.png"
vertexai.init(project=PROJECT_ID, location=REGION)
model = GenerativeModel("gemini-1.5-flash-002")


def get_markdown_for_page(fname, pagenum):
    imgname = get_image_for_page(fname, pagenum)
    markdown = call_gemini_for_markdown(imgname)
    return markdown


def get_image_for_page(fname, pagenum):
    doc = pdfium.PdfDocument(fname)
    page = doc.get_page(pagenum)
    bitmap = page.render(scale=2)    # 72dpi resolution x 2
    bitmap = bitmap.to_pil()
    bitmap.save(LOCAL_IMAGE_FILE)
    return LOCAL_IMAGE_FILE


def call_gemini_for_markdown(img_filename):
    image1 = Part.from_image(Image.load_from_file(img_filename))
    generation_config = {
        "max_output_tokens": 8192,
        "temperature": 1,
        "top_p": 0.95,
    }

    safety_settings = {
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    }

    responses = model.generate_content(
        [image1, "Examine the image and return all of the text within it, converted to Markdown. Make sure the text reflects how a human being would read this, following columns and understanding formatting. Ignore footnotes and page numbers - they should not be returned as part of the Markdown. Only generate markdown for the text found on the page."],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )

    response_text = []
    for response in responses:
        response_text.append(response.text)
    return "".join(response_text)

As you can see, the prompt we use to extract the Markdown is the following:

Examine the image and return all of the text within it, converted to 
Markdown. Make sure the text reflects how a human being would read this, 
following columns and understanding formatting. Ignore footnotes and 
page numbers - they should not be returned as part of the Markdown. 
Only generate markdown for the text found on the page.

Finally, there are a couple of functions we use when interacting with BigQuery, which are located in the bigquery.py module:

from google.cloud import logging, bigquery
import os
import time


BQ_DATASET = os.getenv("BQ_DATASET")
BQ_TABLE = "pdf2markdown"
bq_client = bigquery.Client()
logging_client = logging.Client()
log_name = "debug-log"
logger = logging_client.logger(log_name)


def save_page_info(filename, pagenum, markdown):
    table_id = f'{BQ_DATASET}.{BQ_TABLE}'
    table_ref = bq_client.dataset(BQ_DATASET).table(BQ_TABLE)

    # Insert the extracted fields as a new row
    try:
        errors = bq_client.insert_rows_json(
            table_ref,
            [{
                "filename": filename,
                "pagenum": pagenum,
                "markdown": markdown
            }])

        if errors == []:
            logger.log_text("Data inserted into table")
        else:
            logger.log_text(f"Errors encountered while inserting data: {errors}", severity="ERROR")
    except Exception as e:
        logger.log_text(f"Error inserting data into BQ: {e}", severity="ERROR")


def get_num_pages_for_filename(filename):
    query = f"SELECT COUNT(*) as numpages FROM `{BQ_DATASET}.{BQ_TABLE}` WHERE filename = '{filename}'"
    query_job = bq_client.query(query)
    results = list(query_job.result())
    count = results[0].numpages
    return count


def get_markdown_for_filename(filename):
    query = f"SELECT markdown FROM `{BQ_DATASET}.{BQ_TABLE}` WHERE filename = '{filename}' ORDER BY pagenum"
    query_job = bq_client.query(query)
    results = list(query_job.result())
    # combine into one string
    parts = [row.markdown for row in results]
    return "\n".join(parts)

Note that this code assumes that the BigQuery table pdf2markdown has already been created. Although you can create the table via code if it doesn’t exist, there is often a slight delay before you can insert data into that table, which can result in errors. Best practice is to create the empty table outside of your code first by using Terraform or some other Infrastructure As Code (IAC) approach.

Conclusion

This article talks about the challenges that come with working with PDF documents, specifically for a RAG application. Since PDF files were designed primarily to support almost any imaginable layout, they are very often quite difficult to work with when attempting to extract the text and related contextual information like headings, tables, etc.

Markdown, on the other hand, is very well-suited for use with a LLM like Gemini, both in terms of adding readability and context to the output, but also for constructing prompts, and when chunking and indexing documents for a RAG solution. The challenge is to extract content from a PDF in Markdown format.

By turning each page of a PDF into an image, and then asking Gemini to extract the page content as Markdown, we can quickly and easily extract both the text and the context of the text from the document. And by leveraging the power of Google Cloud, we can make that process extremely efficient by processing many pages in parallel, only to combine the results once all pages have been processed.

Finally, another option to explore is Google Cloud’s DocumentAI, which uses Google Foundation models to parse and chunk documents. It also has built-in OCR support, which allows parsing of image-based pages. You may wish to compare that approach with the approach described here, in order to determine the best approach for your documents. Keep in mind that DocumentAI does not return Markdown, so you should take that into account when deciding which approach to take.

Unlocking PDFs for RAG with Markdown and Gemini was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Extract JSON Data from Text using Gemini

Greg Sommerville — Mon, 04 Nov 2024 16:43:45 GMT

AI generated image

It’s no exaggeration to say that the introduction of Generative AI models has changed the landscape of what can be done with text. Large Language Models (LLMs) like Google’s Gemini provide a host of capabilities that can be used in diverse applications, from analyzing customer feedback to extracting and summarizing insights from research papers, to automating data entry from invoices, and much more.

Although we often think of models in terms of generating text or code or images, they also excel at extracting information from text, offering unparalleled accuracy and efficiency compared to traditional methods. This article demonstrates how to use Gemini to extract information from unstructured text and then package that information into a JSON object.

The Use Case

We will explore a real-world scenario where precise data extraction is critical: analyzing reports of drug overdose deaths. These reports often contain a wealth of information, including details about the deceased, the circumstances surrounding the overdose, and potential contributing factors. Our goal is to extract over 200 specific data points from these reports, such as the presence of witnesses, history of substance abuse, mental health indicators, and more.

By automating this process with Gemini, we can efficiently transform these unstructured narratives into structured data, enabling researchers and public health officials to identify trends, patterns, and potential risk factors with greater accuracy. This structured data, formatted as a JSON object, can then be easily integrated into databases and analysis tools, ultimately contributing to more effective prevention and intervention strategies.

Now that we understand the overall situation, let’s look at an example. Here’s an example narrative:

THE VICTIM IS A 42 YEAR OLD WHITE FEMALE, NOT HISPANIC, WHO DIED AT HOME. 
CAUSE OF DEATH IS ACUTE COMBINED DRUG INTOXICATION INCLUDING HEROIN 
AND METHAMPHETAMINE. THE MANNER OF DEATH IS ACCIDENT. THE V WAS LAST 
SEEN ALIVE BY HER BOYFRIEND LAST NIGHT.  THE BOYFRIEND FOUND HER 
UNRESPONSIVE THIS MORNING AND CALLED 911.  EMS ARRIVED AND PRONOUNCED 
DEATH. DRUG PARAPHERNALIA WAS FOUND NEAR THE BODY. THE V HAD A HISTORY 
OF DRUG ABUSE. TOXICOLOGY IS POSITIVE FOR AMPHETAMINE, METHAMPHETAMINE, 
CODEINE FREE, MORPHINE FREE AND 6 MONOACETYLMORPHINE.

The following is a short subset of some of the fields extracted from the narrative. (Note that BystanderPartner is true because the narrative mentions that the victim was found by their boyfriend, while the other values default to false.)

{
  "BystanderBreathing": false,
  "BystanderCPR": false,
  "BystanderFamily": false,
  "BystanderFriend": false,
  "BystanderIntOther": false,
  "BystanderIntOther_specify": "",
  "BystanderMedical": false,
  "BystanderNoOD": false,
  "BystanderNotRecognize": false,
  "BystanderOther": false,
  "BystanderOther_specify": "",
  "BystanderPartner": true,
  "BystanderPublic": false,
  "CME_AlcoholProblem": false,
  "CME_CircumstancesKnown": false,
  "CME_CircumstancesOtherTex": "",
  // etc - more fields included
}

The Approach

To successfully extract and structure this information from overdose reports, we must first craft a precise and effective prompt that guides Gemini towards identifying and extracting the 200+ specific data points. These fields encompass a wide range of data types, from booleans and integers to dates and free-form text, demanding a prompt that can handle this diversity. This involves carefully defining the target information for each field, providing clear instructions, and potentially incorporating examples to illustrate the desired output format.

Equally important is a robust mechanism to extract the data from Gemini’s response and validate its structure and content, ensuring each field adheres to its expected data type. Although there are a number of Python libraries that are designed to validate data (like Pydantic), in this case we’ll demonstrate how to validate the data without using an external library.

With a clear understanding of our objectives, let’s dive into the crucial first step: prompt engineering.

Prompt Engineering

The prompt for this task will have several different sections. To start with, we need to set the persona and specify the overall task. We use the following text to accomplish this:

You are a highly skilled information extraction AI specializing in 
medical narratives. Your task is to do the following two steps:

Step 1. Extract fields from the narrative and briefly explain your thinking. 
Keep your explanation concise.

Step 2. Generate a JSON object that contains the extracted fields

Please analyze the following text and extract the requested fields into 
a JSON object.

If a field is of type "option", you MUST use one of the listed choices. 
Your answer should be the number associated with the choice. 
Example: if "1: No Pulse detected" is a choice, your answer should be "1".

If a field is of type "boolean", you MUST use either true or false.

If a field is of type "integer", you MUST answer with a positive integer 
constant.

If a field is of type "string", you MUST answer with a string constant.

This initial portion of the prompt sets the stage for accurate and structured data extraction by establishing Gemini as a ‘highly skilled information extraction AI specializing in medical narratives.’ This focuses its attention on the relevant domain and expertise, priming it to interpret the text effectively.

The two-step task definition further enhances clarity. Step 1 encourages transparency by having Gemini explain its reasoning, which is invaluable for debugging, and also for providing transparency for public health officials on how Gemini made its determination of the extracted value. Step 2 ensures a structured, machine-readable JSON output, facilitating easy integration with other systems and analysis tools.

Additionally, clearly specifying the expected data types for different fields (“option”, “boolean”, “integer”, “string”) is crucial for maintaining data integrity and consistency. This reduces ambiguity and guides Gemini towards producing output that adheres to your predefined schema. Notice that we include a concrete example of how to handle “option” type fields that helps eliminate any confusion and ensures the model outputs the desired numerical representation.

Now that the overall task and persona have been established, we include the text of the narrative in the prompt, as well as descriptions of each field. Gemini consumes and produces markdown text, so we’ll use that to distinguish these important elements:

**Narrative:**
{narrative}

**Fields to extract:**
{field_definitions}

For those unfamiliar with markdown, text enclosed in double asterisks (**) renders as bold. The fields contained in braces will be substituted with the actual value of the narrative, and a description of all of the fields to be extracted.

The field definitions list out each of the 200+ fields we’d like to extract. For example, here’s the definition of a field meant to determine if any bystanders were present at the scene when the overdose occurred:

**BystandersPresent**: (option) Bystanders present at time of overdose
  * 1: No bystanders present
  * 2: 1 bystander present
  * 3: Multiple bystanders present
  * 4: Bystanders present, unknown number
  * 5: Unknown if bystander present

This markdown format, with single asterisks indicating bullet points, effectively structures the options for Gemini. Other field types (like Boolean or integer) describe the field name, type, and description without a list of options.

To maintain flexibility, it’s often desirable to have a dynamic set of fields, so we can update them without rewriting and redeploying our code. In this case we use a Google Sheets document that has a row for each field, with information about field name, types, description, options, and more. By storing this information in a spreadsheet, the definitions can easily be updated or tweaked, and the data can be exported as a CSV file, which our code can then read in.

Once we have all of the fields defined in our prompt, we finish it by added important clarifications and a reiteration of the overall goal:

**Additional instructions:**

* Focus on accuracy. Do not generate information that is not explicitly 
supported by the narrative.

* "CME" refers to the coroner or medical examiner.

* "LE" refers to law enforcement.

* Some fields are meant to be considered a pair, with the first field 
(like "BystanderIntOther") being a boolean indicating that there is 
additional information, and the second field (like 
"BystanderIntOtherSpecify") gets filled in with a string only if the 
first field is True.

Please ensure the JSON output is well-formatted and adheres to standard 
conventions. All fields in the JSON object must have a valid value.

Finally, one last thing to be considered when calling Gemini is that you can specify exactly the data type you’d like returned (such as JSON), as described in this article. However, although we can ask for JSON only as a response, there are reasons not to do so. For example, in this case we ask the model to explain its thinking, which is helpful for debugging and also generally improves the quality of the response. Using this approach requires that we extract the JSON object from a larger string, rather than asking for only JSON.

Now that we’ve constructed our prompt, let’s assume we’ve called Gemini, received a response, and now need to parse the result.

Extracting and Validating JSON Data

Despite their advancements, modern large language models are not without limitations. Potential problems include missing fields in the output JSON object, as well as fields that do exist but have invalid values, like a Boolean field with a value of the string “True”, rather than the Boolean value true.

Python libraries like Pydantic or JSON Schema are designed to parse data and ensure that it follows a particular structure or schema, and if you’re comfortable with using such an approach, it will work well. Explicitly defining a schema for the JSON object helps with validation at the time of parsing, can improve code clarity, and can reduce the amount of code you need to write.

That said, it can be challenging to use Pydantic with a dynamic schema like the one we’re using here, where the field definitions are read in from a CSV file. Although there are ways to use Pydantic with dynamic schemas, for clarity we will demonstrate how to validate and correct data using plain Python, which will iterate through all of the fields in our schema and check for various problems.

As stated earlier, the first potential issue with extracted JSON is with missing values. The solution to this is to have default values for each field. For some fields (like booleans) this can be simple. For example, the data to be extracted may be worded in such a way that we only have a True boolean value when we are positive that certain conditions are true. In this case, a default value of False is probably best.

Other field types like strings can be handled in a similar manner. For example, if a string value is missing, perhaps the default value would be an empty string. Likewise, if a field is of type integer, it may have a specific value like 99 that is always used to indicate an unknown value. Choice values (where valid values must come from a list of possible values) can be more challenging, but if there is a choice that indicates that the value is unknown, that would be the default value to use.

Besides missing values, another potential concern is when the value for a field does not match the expected type, like in the example above of the string “True” being returned instead of a boolean constant. In this case, it’s best to attempt to convert values to their correct types through code, as is shown here:

# assume that field_type and field_name were read in from our CSV
if field_type == 'Bool':
    extracted_value = result[field_name]

    # first, handle None, changing it to False
    if extracted_value is None:
        extracted_value = False

    if type(extracted_value) is str:
        print(f'Converting field {field_name} from string to bool')
        extracted_value = extracted_value.lower() == 'true'
    elif type(extracted_value) is int:
        print(f'Converting field {field_name} from int ({extracted_value}) to bool')
        extracted_value = extracted_value == 1

    result[field_name] = extracted_value

Notice that when converting a string value to a boolean, we assume the value will be False unless the string is either “True” or “true”. Likewise, in many computer languages, a value of 1 is considered to be equivalent to True, while 0 indicates a False value. Therefore, to convert from an integer to a boolean, we check if the integer value is equal to 1.

Conclusion

When using a LLM like Gemini to extract structured data from unstructured text, there are two main areas to focus on: prompt construction and data validation.

This article has provided guidance about how to dynamically create a prompt using a set of overall instructions, a list of fields to extract that are described using markdown, and a set of final clarifications and details that the model needs to consider when answering.

Upon receiving the model’s response, we must iterate over each field to be extracted, providing default values for missing fields, and also handling the situation where the data type returned does not match the actual field.

By using these approaches, you will be able to leverage the power and efficiency of models like Gemini. The best way to learn is to experiment, so your next step will be to experiment with your own data. Starting out by using a Colab Enterprise notebook is a quick and easy way to begin

Extract JSON Data from Text using Gemini was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Winning Blackjack using Machine Learning

Greg Sommerville — Tue, 12 Feb 2019 20:54:10 GMT

A Practical Example of a Genetic Algorithm

One of the great things about machine learning is that there are so many different approaches to solving problems. Neural networks are great for finding patterns in data, resulting in predictive capabilities that are truly impressive. Reinforcement learning uses rewards-based concepts, improving over time. And then there’s the approach called a genetic algorithm.

A genetic algorithm (GA) uses principles from evolution to solve problems. It works by using a population of potential solutions to a problem, repeatedly selecting and breeding the most successful candidates until the ultimate solution emerges after a number of generations.

To demonstrate how effective this approach is, we will use it to solve a complex problem — the creation of a strategy for playing the casino game Blackjack (also known as “21”).

The term “strategy” in this case means a guide for player actions that covers all situations. The goal is to find a strategy that is the very best possible, resulting in maximized winnings over time.

About this “Winning” Strategy

Of course, in reality there is no winning strategy for Blackjack — the rules are set up so the house always has an edge. If you play long enough, you will lose money.

Knowing that, the best possible strategy is the one that minimizes losses. Using such a strategy allows a player to stretch a bankroll as far as possible while hoping for a run of short-term good luck. That’s really the only way to profit at Blackjack.

As you might imagine, Blackjack has been studied by mathematicians and computer scientists for a long, long time. Back in the 1960s, a mathematician named Edward O. Thorp authored a book called Beat the Dealer, which included charts showing the optimal “Basic” strategy.

That optimal strategy looks something like this:

Optimal Strategy for Blackjack

The three tables represent a complete strategy for playing Blackjack.

The tall table on the left is for hard hands, the table in the upper right is for soft hands, and the table in the lower right is for pairs.

If you aren’t familiar with Blackjack, a soft hand is a hand with an Ace that can count as 1 or 11, without the total hand value exceeding 21. A pair is self-explanatory, and a hard hand is basically everything else, reduced to a total hand value.

The columns along the tops of the three tables are for the dealer upcard, which influences strategy. Notice that the upcard ranks don’t include Jack, Queen or King. That’s because those cards all count as 10, so they are all grouped together with the Ten (“T”) to simplify the tables.

To use the tables, a player would first determine if they have a pair, soft hand or hard hand, then look in the appropriate table using the row corresponding to their hand holding, and the column corresponding to the dealer upcard.

The cell in the table will be “H” when the correct strategy is to hit, “S” when the correct strategy is to stand, “D” for double-down, and (in the pairs table only) “P” for split.

Knowing the optimal solution to a problem like this is actually very helpful. Comparing the results from a GA to the known solution will demonstrate how effective the technique is.

Finally, there’s one other thing to get out of the way before we go any further, and that’s the idea of nondeterminism. That means that if the same GA code is run twice in a row, two different results will be returned. That’s something that happens with genetic algorithms due to their inherent randomness. It’s unusual for software to act this way, but in this case it’s just part of the approach.

How a Genetic Algorithm Works

Genetic algorithms are fun to use because they’re so easy to understand: you start with a population of (initially, completely random) potential solutions, and then let evolution do its thing to find a solution.

That evolutionary process is driven by comparing candidate solutions. Each candidate has a fitness score that indicates how good it is. That score is calculated once per generation for all candidates, and can be used to compare them to each other.

In the case of a Blackjack strategy, the fitness score is pretty straightforward: if you play N hands of Blackjack using the strategy, how much money do you have when done? (Due to the house edge, all strategies will lose money, which means all fitness scores will be negative. A higher fitness score for a strategy merely means it lost less money than others might have.)

Once an effective fitness function is created, the next decision when using a GA is how to do selection.

There are a number of different selection techniques to control how much a selection is driven by fitness score vs. randomness. One simple approach is called Tournament Selection, and it works by picking N random candidates from the population and using the one with the best fitness score. It’s simple and effective.

Once two parents are selected, they are crossed over to form a child. This works just like regular sexual reproduction — genetic material from both parents are combined. Since the parents were selected with an eye to fitness, the goal is to pass on the successful elements from both parents.

Naturally, in this case the “genetic material” is simply 340 cells from the three tables that each strategy has. A cell in the child is populated by choosing the corresponding cell from one of the two parents. Oftentimes, crossover is done proportional to the relative fitness scores, so one parent could end up contributing many more table cells than the other if they had a significantly better fitness score.

Finally, just like in nature, it’s important to have diversity in a population. Populations that are too small or too homogenous always perform worse than bigger and more diverse populations.

Genetic diversity is important, because if you don’t have enough, it’s easy to get stuck in something called a local minimum, which is basically a solution that performs better than any similar alternatives, but is inferior to other solutions that are significantly dissimilar to it.

To avoid that problem, genetic algorithms sometimes use mutation (the introduction of completely new genetic material) to boost genetic diversity, although larger initial populations also help.

Results Using a GA

One of the cool things about GAs is simply watching them evolve a solution. The first generation is populated with completely random solutions. This is the very best solution (based on fitness score) from 750 candidates in generation 0 (the first, random generation):

Randomly generated candidate from Gen 0

As you can see, it’s completely random. By generation 12, some things are starting to take shape:

With only 12 generations experience, the most successful strategies are those that Stand with a hard 20, 19, 18, and possibly 17. That part of the strategy develops first because it happens so often and it has a fairly unambiguous result. Basic concepts get developed first with GAs, with the details coming in later generations.

The other hints of quality in the strategy are the hard 11 and hard 10 holdings. According to the optimal strategy those should be mostly Double-Down, so it’s encouraging to see so much yellow there.

The pairs and soft hand tables develop last because those hands happen so infrequently. A player is dealt a pair only 6% of the time, for example.

By generation 33, things are starting to become clear:

By generation 100, the hard hand table on the left is completely stabilized — it doesn’t change from generation to generation. The soft hand and pairs tables are getting more refined:

And then the final generations are used to refine the strategies. The changes from generation to generation are much smaller at this stage, since it’s really just the process of working out the smallest details.

Finally, the best solution found over 237 generations:

As you can see, the final result is not exactly the same as the optimal solution, but it’s very, very close. The hard hands in particular (the table on the left) are almost exactly correct. The soft hands and pairs tables have a few more cells that don’t match, but that’s likely because those hand types occur far less than hard hands.

In terms of outcome, playing the optimal strategy for 500,000 hands at $5 per hand would result in a loss of $176,040. Using the computer-generated strategy would result in a loss of $176,538, a difference of $498 over half a million hands.

There’s an animated GIF that shows the evolution of this strategy over 237 generations, but be aware that it’s 19 MB in size, so you may not wish to view it over a phone.

The source code for the software that produced these images is open source. It’s a desktop application for Windows written in C# with WPF.

Combinatorial Implications

As impressive as the resulting strategy is, we need to put it into context by thinking about the scope of the problem. An optimal strategy for Blackjack is expressed by filling each of the 340 table cells (spread across the three tables) with the best choice for each holding/dealer upcard combination — either stand, hit, double-down, or split.

In terms of combinations, there are 4¹⁰⁰ possible pair strategies, 3⁸⁰ possible soft hand strategies, and 3¹⁶⁰ possible hard hand strategies, for a grand total of 5 x 10¹⁷⁴ possible strategies for Blackjack:

4¹⁰⁰ x 3⁸⁰ x 3¹⁶⁰ = 5 x 10¹⁷⁴ possible Blackjack strategies

In this case the genetic algorithm found a close-to-optimal solution in a solution space of 5 x 10¹⁷⁴ possible answers. Running on a standard desktop computer, it took about 75 minutes. During that run, about 178,000 strategies were evaluated.

Testing Fitness

Genetic algorithms are essentially driven by fitness functions. Without a good way to compare candidates to each other, there’s no way the evolutionary process can work.

The idea of a fitness function is simple. Even though we may not know the optimal solution to a problem, we do have a way to measure potential solutions against each other. The fitness function reflects the relative fitness levels of the candidates passed to it, so the scores can effectively be used for selection.

For the purposes of finding a Blackjack strategy, a fitness function is straightforward — it’s a function that returns the expected final earnings after using the strategy over a certain number of hands.

But how many hands is enough?

As it turns out, you need to play a lot of hands with a strategy to determine its quality. Because of the innate randomness of a deck of cards, many hands need to be played so the randomness evens out across the candidates.

That’s especially important when our GA gets close to a final solution. In early generations, it’s not a problem if the fitness scores are not exact, because the difference between a bad candidate and a good candidate is usually quite large and the convergence to the final solution continues without a problem.

However, once the GA gets into the later generations, the candidate strategies being compared will have only minor differences, so it’s important to get accurate expected winnings from a fitness function.

Luckily, it’s pretty straightforward to find the right number of hands needed. Using a single strategy, multiple tests are run, resulting in a set of fitness scores. The variations from run to run for the same strategy will reveal how much variability there is, which is driven in part by the number of hands tested. The more hands played, the smaller the variations will be.

By measuring the standard deviation of the set of scores we get a sense of how much variability we have across the set for a test of N hands. But as we experiment with different numbers of hands played per test, we can’t compare standard deviations, for the following reason:

Standard deviation is scaled to the underlying data. We can’t compare fitness scores (or standard deviations thereof) from tests using different numbers of hands because a higher number of hands played results in a corresponding increase in the fitness score.

Put it another way: say a strategy wins 34% of the time. If you run it for 25,000 hands versus 50,000 hands, you’ll have different totals at the end. That’s why you can’t simply compare fitness scores that result from different test conditions. And if you can’t compare the raw values, you can’t compare the standard deviations.

We solve this by dividing the standard deviation by the average fitness score for each of the test values (the number of hands played, that is). That gives us something called the coefficient of variation, which can be compared to other test values, regardless of the number of hands played.

The chart here that demonstrates how the variability shrinks as we play more hands:

There are a couple of observations from the chart. First, testing with only 5,000 or 10,000 hands is not sufficient. There will be large swings in fitness scores reported for the same strategy at these levels. In fact, it looks like a minimum of 100,000 hands is probably reasonable, because that is the point at which the variability starts to flatten out.

Could we run with 500,000 or more hands per test? Of course. It reduces variability and increases the accuracy of the fitness function. In fact, the coefficient of variation for 500,000 hands is 0.0229, which is significantly better than 0.0494 for 100,000 hands. But that improvement is definitely a case of diminishing returns: the number of tests had to be increased 5x just to get half the variability.

Given those findings, the fitness function for a strategy will need to play at least 100,000 hands of Blackjack, using the following rules (common in real-world casinos):

Using 4 decks of cards shuffled together
Dealer is required to hit until they have 17 (soft or hard)
You can double down on a hand that you split
There is no insurance
Blackjack pays 3:2

Genetic Algorithm Configurations

One of the unusual aspects to working with a GA is that it has so many settings that need to be configured. The following items can be configured for a run:

Population Size
Selection Method
Mutation Rate and Impact
Termination Conditions

Varying each of these gives different results. The best way to settle on values for these settings is simply to experiment.

Population Size

Here’s a chart of the average candidate fitness per generation for the different population sizes:

The X axis of this chart is the generation number (with a maximum of 200), and the Y axis is the average fitness score per generation. The first few generations aren’t shown to emphasize the differences as we reach the later generations.

The flat white line along the top of the chart is the fitness score for the known, optimal baseline strategy.

The first thing to notice is that the two smallest populations (having only 100 and 250 candidates respectively, shown in blue and orange) performed the worst of all sizes.

The lack of genetic diversity in those small populations results in poor final fitness scores, along with a slower process of finding a solution. Clearly, having a large enough population to ensure genetic diversity is important.

On the other hand, there aren’t too many differences between populations of 400, 550, 700, 850 and 1000.

This is a similar situation to choosing the number of hands to test with — if you pick a value that’s too small, the test isn’t accurate, but once you exceed a certain level, the differences are minor.

Selection Methods

The process of finding good candidates for crossover is called selection, and there are a number of ways to do it. Tournament selection has already been covered. Here are two other approaches:

Roulette Wheel Selection selects candidates proportionate to their fitness scores. Imagine a pie chart with three wedges of size 1, 2, and 5. The wedge with the value 5 will be selected 5/8 of the time, the wedge with value 2 will be selected 2/8 of the time, and the wedge with value 1 will be selected 1/8 of the time. That’s the basic idea behind Roulette Wheel selection. The size of each candidate’s wedge is proportional to their fitness score compared to the total score of all candidates.

One of the problems with that selection method is that sometimes certain candidates will have such a small fitness score that they never get selected. If, by luck, there are a couple of candidates that have fitness scores far higher than the others, they may be disproportionately selected, which reduces genetic diversity.

The solution is to use Ranked Selection, which works by sorting the candidates by fitness, then giving the worst candidate a score of 1, the next worse a score of 2, and so forth, all the way up to the best candidate, which receives a score equal to the population size. Once this fitness score adjustment is complete, Roulette Wheel selection is used.

Here’s a graph that compares the average fitness per generation using a variety of selection methods:

As you can see, tourney selection converges on an optimal solution very quickly — in fact, the bigger the tourney size, the faster the average fitness score improves. That makes sense, because if you’re choosing 7 random candidates and using the best, the quality is going to be much higher than doing the same while choosing only 2.

Even though it had the fastest initial improvement, Tourney 7 ends up producing the worst results. That makes sense, because although a big tourney size results in rapid improvement, it also limits the genetic pool to only the best. Needed genetic diversity is lost, and in the long run it doesn’t perform as well.

The best performers look to be Tourney 2, Tourney 3, and Tourney 4. Given a population of 700, these numbers provide good long-term results.

Elitism

There’s another concept in genetic algorithms called elitism. It’s the idea that when building a new generation, first sort the population by fitness, and then pass in a certain percentage of the best candidates directly into the next generation without alteration. After that is done, normal crossover begins.

This chart shows the effects of four different elitism rates (later generations only, to show the details). Clearly no elitism or 15% are reasonable, although 0% looks a bit better.

There’s one thing that’s surprising about this chart — the higher the elitism was, the slower the convergence was to solution. You might think that deliberately including the best from each generation would speed things up, but in fact it looks like using only crossed-over candidates gives the best results, and is also the fastest.

Mutations

Keeping genetic diversity high is important, and mutation is an easy way to introduce that.

There are two factors relating to mutation: how often does it happen, and how much of an impact does it have when it does happen?

A mutation rate controls how often a newly created candidate will be mutated. The mutation is done immediately after creation via crossover.

The mutation impact controls how much a candidate is mutated, in terms of percentage of its cells that will be randomly changed. All three tables (hard hands, soft hands and pairs) are mutated the same percentage.

Starting with a fixed impact rate of 10%, here are the effects of different mutation rates:

It is clear that mutation does not help for this problem — the more candidates are affected by mutation, the worse the results. It follows that trying different mutation impact values is not required — 0% mutation rate is clearly the best for this problem.

Termination Conditions

Knowing when to quit a genetic algorithm can be tricky. Some situations call for a fixed number of generations, but for this problem the solution was to look for stagnation — in other words, the genetic algorithm stops when it detects that the candidates are no longer improving.

The condition used for this test was that if there was no improvement in the overall best strategy (or a generation’s average score) for 25 generations in a row, then the process terminates and the best result found to that point is used as the final solution.

Wrapping Up

Genetic algorithms are a powerful technique for solving complex problems, and they have the benefit of being easy to understand. For problems with huge solution spaces due to combinatorial factors, they are extremely effective.

For more information about GA, please start with this Wikipedia article or the PluralSight (paid) course I wrote that covers the topic in far greater detail.

Winning Blackjack using Machine Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.