Asset Orchestration architecture brings the “RAG experience” to image generation

Team Octo
OctoAI
Published in
8 min readDec 18, 2023

The rise of RAG for LLMs

Foundation models are powerful. Early adoption has shown that the best way to truly unlock value through these models is through customization of the model outputs — like including product-specific info for a support ChatBot, or location specific details for a travel app. For text generation, the Retrieval Augmented Generation (RAG) approach, and the tooling around it, simplify how developers can incorporate application-specific context to models for many common customization needs. RAG has quickly become the norm around adoption of Large Language Models (LLMs), but it does not work for media generation, including images, video or audio. A lack of coherent architecture has limited the extent to which developers can use GenAI to create customized visuals in applications.

A new approach to this problem is called Asset Orchestration, which offers a RAG-like experience and novel architecture to customize media generation models. With Asset Orchestration, developers can create fine-tuning assets from images, import model assets from the thousands already created by the community, and apply them to power their image generation at inference time.

This article is a brief overview of why we designed Asset Orchestration as an alternative to RAG for media. We’ll dive into:

  • How RAG works
  • Why it RAG not apply to image generation
  • How Asset Orchestration enables use-case and context-aware image generation

Retrieval Augmented Generation (RAG) and LLM Context Windows

RAG has now emerged as the norm for how builders and businesses integrate LLMs into applications. In this approach, application and business specific context information is provided to the model at inference time, allowing for generation to incorporate these additional facts as needed in the output. This information could be datasheets and guides related to specific products, as in the case of the customer service ChatBot we discussed; or detailed and current information about events, locations and prices, as in the case of the travel app.

Key to this approach is the LLM context window. The additional information is provided to the LLM along with the inference, as part of the input tokens for the generation. The LLM’s context window, which is the total number of tokens that can be passed to the LLM as input, determines the amount of information that can be included. In the RAG architecture, developers use a predefined set of reference data sources — often pre-processed and curated as a data store for optimal retrieval based on the input prompt for the query — from which additional information can be retrieved.

The reference data sources here could be as simple as additional text that captures context information, or more elaborate mechanisms like a pre-optimized search corpus (using TF-IDF, BM25 etc.) or vector database with vector embeddings representing the context info (using Instructor, BERT etc.). The text-to-text generation itself is broken into a multi-step process. The first step queries the data sources and retrieves the relevant information associated with the input text (Retrieval), the second step inserts the retrieved information into the input prompt as part of the inference (Augmented), and the LLM uses information in the prompt to generate a result (Generation). These deliver LLM outputs highly applicable to the specific application and use case.

Retrieval Augmented Generation architecture Source: Langchain Blog, https://blog.langchain.dev/espilla-x-langchain-retrieval-augmented-generation-rag-in-llm-powered-question-answering-pipelines/

This has been further simplified by tools like LangChain and LlamaIndex, which abstract the model specific operations, the retrieval and the prompt inclusion operations, all into easy pre-built function calls. These enable simple operations like adding documents in the Stuff Document Chain with Langchain, or more complex retrievals using pre-built integrations with vector databases like Pinecone or Elastic.

RAG and the tooling around it enable a simple path to creating context aware LLM outputs:

  1. Prepare data sources that represent the context information needed for the application
  2. Retrieve the appropriate references at generation time
  3. Incorporate information from the retrieved reference sources for generation

While RAG is not sufficient for all types of customizations (example — text generation in a specific tone, or writing in a new style, cannot be accomplished with RAG alone), it does address the most frequently seen customization need in text generation — which is to incorporate context specific information in the outputs.

Why RAG cannot work for image generation

Media generation must support larger data sizes

At this point, one aspect becomes clear. Everything in RAG — be it the corpus retrieval, or the vector stores, or the prompt inclusion approach to providing context to the model — is built for text. Fundamental aspects that make RAG a valuable architecture in text generation are (1) the context windows in LLMs provide an effective way to provide this information to the LLM so it can use this additional data for the generation, (2) many of the customization needs that are commonly seen in the text generation space are to do with incorporation of use case specific data. And neither apply directly to image generation.

Let’s start by briefly looking at GenAI image generation. At the core of GenAI image generation is the diffusion model (specifically, the latent diffusion model). Diffusion models work through a process of injecting noise and then iterative denoising to generate its final output. The text prompt provided is used to guide the denoising process through text conditioning, and the output iteratively gets closer to the desired end image based on the images that have been used in the model’s training data. The inputs to this process are the text prompt, and sometimes (like in SDXL 1.0) additional images to provide a starting base and mask for use cases like inpainting. Even if there was a way for the model to use context specific images to influence generation, the size of images makes this challenging. The default 1024x1024 pixel image from Stable Diffusion XL is approximately 1MB in size, and providing a set of images to influence image creation would add MBs of data and increase latency for each inference call. Contrast this with text context, where a 4000 token context window for a Llama 2 is under 0.01 MB in size, making this an efficient and scalable process. While broad customization is not possible, models like SDXL 1.0 do allow the use of a base image in the generation — for use cases like cropping and inpainting.

Media generation calls for “form” not facts

Image generation customization is often to do with style. The most common needs we have heard from customers is around applying a specific set of styles and “look and feel” to generated images in a predictable and consistent manner. This could be for applying a company’s thematic visuals in images, or creating images in a specific universe for a game, or ensuring images stay within a specific customized style for consistency. As briefly discussed earlier, a model can only create images based on images used in training.

While providing an image as context can be used for specific inpainting or cropping use cases, these will not be able to generate new imagery based on that style — like placing your custom character in different settings and actions, or making sure images apply a specific unique color scheme. Creating new visuals in a specific style actually requires updates to the weights of the model, ie, it is not enough to add additional images (or facts) it requires teaching the model how to create new types of images (or form). RAG was not designed to address this.

With the momentum of innovation in the space, like the application of LoRA to text-to-image models and communities and repositories with thousands of pre-created fine-tuning LoRAS and checkpoints, today there is a growing pool of fine-tuning resources image generation. But using these involves either manually loading and unloading these, or running multiple dedicated endpoints, all of which involve additional cost, latencies, and complexity. The RAG-like framework and tooling to bring these together for an application developer just did not exist. And other approaches to customize image generation — like using CLIP or Deepbooru for image to image use cases, or using the init_image in SDXL for inpainting — do not allow the ease and flexibility of RAG. We heard this loud and clear from customers, this is why we designed the Asset Orchestration architecture.

Asset Orchestration brings the ease of the RAG experience to Image Generation

The kind of image generation customization we discussed here requires the model to learn to create new imagery. Common approaches to fine-tuning Stable Diffusion, like LoRAs, checkpoints, and textual inversions, are popular and available in the community, but using these at scale was not possible for app developers. To make it easy and efficient to do so, these new customizations need to be applied at inference time, just like RAG enables with LLMs.

Asset Orchestration addresses this need by applying fine tuning to the base model at inference time, effectively changing the weights and the form of the model at inference time. To do this, Asset Orchestration relies on assets — which are fine-tuning datasets like LoRAs and checkpoints. In OctoAI, they’re hosted in a customer’s Asset Library.

The inference API specifies one or more assets that need to be applied for the specific generation. Based on the selected assets and ratios, the model weights are dynamically updated at inference time. This is enabled by specific capabilities built into OctoAI — including the base accelerated model, intelligent caching, and weights separation from the model — to enable loading of fine-tuning assets ranging from tens of MB sized LoRAs to GB sized checkpoints in an efficient and timely manner.

Asset Orchestration architecture in action; Source: OctoML

Asset Orchestration is an architecture built from the bottom up to enable customization of media generation models. It enables the live endpoint and the running model to learn new form and fine-tuned styles at runtime, and has been proven to work at scale with the OctoAI Image Gen Solution. With these, Asset Orchestration enables the same simple path that RAG offers for LLMs, to customize image generation:

  1. Prepare the Asset Library with your set of imported or created fine-tuning assets
  2. Retrieve the desired fine-tuning assets at inference time, and
  3. Incorporate the fine-tuning assets for the image generation

Asset Orchestration is available today in the OctoAI Image Gen Solution. Get started today to create your own fine-tuning assets, import your choice of LoRAs, checkpoints or textual inversions from the thousands available in popular repos, and mix and match these to create new innovative styles.

We’re working on expanding the application of Asset Orchestration to other domains, and are looking for design partners to help us shape this capability and roadmap. If you have an interesting model customization use case and would like to work with us, reach out!

--

--

Team Octo
OctoAI
Editor for

Thoughts on machine learning, app dev, and the future of AI from the engineers at octo.ai