Create a Document Embedding Workflow with OCI Object Storage, OCI GenAI, and Oracle Database 23ai

Anders Swanson
4 min readJul 10, 2024

--

Vector databases like Oracle Database 23ai can quickly find related vectors using similarity search, providing an essential tool for recommendation systems or to enhance LLM fidelity using Retrieval Augmented Generation.

To effectively use similarity search, you’ll need a database populated with embedding vectors. In this article, we’ll walk through a Java workflow that embeds documents from Oracle Cloud Infrastructure (OCI) Object Storage using OCI GenAI, storing the resulting embeddings in Oracle Database 23ai.

We’ll split this workflow into four classes, each acting as a distinct step in our pipeline:

  1. A document loader that streams documents from OCI Object Storage.
  2. A splitter that chunks documents into parts, preparing them for embedding.
  3. An embedding service that embeds document chunks using OCI GenAI.
  4. A vector store implementation that stores embeddings in Oracle Database.

The full sample of this workflow can be found on GitHub here.

Implementing a Document Loader

The OCIDocumentLoader takes a bucket and object prefix, and returns a stream of documents beginning with that prefix.

First, let’s write a method to list objects from a bucket, returning a list of object names. This method uses pagination to list every objects starting with a given prefix, and applies a filter to return only the object names. We’ll use the listObjects to get all of objects that we need to download for splitting and embedding.

Using the listObjects method, let’s add two new methods that extract the text from a GetObjectResponse, and return the listed object’s content as a stream. Later on, we’ll map the new streamDocuments method to our text splitter.

For the full document loader implementation, take a look at the OCIDocumentLoader class.

The Text Splitter

For our sample code, we’ll use a simple text splitter that chunks text line-by-line. If you’re parsing more complicated document types, such as HTML or PDF, you’ll need a more specific splitter.

Implementing an Embedding service

The embedding service will take input from the Splitter, transforming a list of document chunks into a list of batches, where each batch is sent to the OCI GenAI service for embedding. We’ll map the GenAI service responses into an embedding list that we’ll insert into the database later on.

Note that the OCI GenAI service has a maximum input size of 96, necessitating the use of batching.

For the full embedding service implementation, look at the OCIEmbeddingModel class.

Oracle Database as a vector store

We’ll create a vector store service backed by Oracle Database, using the new vector type to store our embeddings. If you’re looking for a more detailed look at vector stores, check out my prior article describing how to use similarity search with Oracle Database.

For the vector store, we’ll define a method that creates both a table with vector column, and an Inverted File vector index on this table using cosine distance to facilitate fast search.

Next, we’ll implement a method to add embedding vectors to the database. Because we have a vector column, we’ll use the OracleType.VECTOR value to insert the vector into our prepared statement.

For the full vector store implementation, see the OracleVectorStore class.

With the vector store implementation we have all four components needed for our workflow, what remains is to wire them together.

Wiring up the example workflow

The EmbeddingWorkflowIT class ties together our workflow components: streaming documents from a bucket, splitting the documents into chunks, embedding each chunk, and finally storing the embeddings in Oracle Database.

The code for this is relatively simple, instantiating each component and chaining the resulting stream operations together.

Questions or comments? Let me know! If you’re considering building out AI solutions using Oracle Database, we’re here to help.

--

--