Foundational Model Orchestration (FOMO) — A Primer
VC Astasia Myers’ perspectives on machine learning, cloud infrastructure, developer tools, open source, and security. Sign up here.
There are two approaches to building Machine Learning (ML) applications. Traditionally, data scientists and ML engineers trained or fine-tuned ML models they deployed to production as a notebook or API. Recently a new approach emerged where ML practitioners that can include software developers, leverage prompt engineering to get results from a hosted foundational model via API. Prompt engineering is a Natural Language Processing (NLP) concept that involves discovering inputs that yield desirable or useful results from a ML model. With the rise of foundational models, we’ve seen the emergence of new tooling called Foundational Model Orchestration (FOMO) that coordinate tasks within a foundational model driven workflow.
According to Stanford, foundational models are “models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks.” Today foundational models support a single modality like text (e.g. Open GPT-3, Cohere, and Anthropic); image (Stability, DallE); video (Microsoft X-Clip); and audio (OpenAI Whisper). We expect to see multi-modal foundational models in the future.
We spoke to 20 ML practitioners building applications leveraging foundational models. Interestingly, 50% were software engineers, an emerging persona within ML. Because foundational models lower the barrier to building ML models and are a higher level of abstraction, for the first time we are seeing software developers become part of the ML development process. For those working with hosted Large Lange Models (LLMs), Open GPT-3 was used 80% of the time, Cohere 15%, and Anthropic 5%. We found the most common use cases were search, summarization, Q&A, and chatbots.
One reason FOMO solutions have emerged is because the foundational model API exists within the context of a pipeline the includes data computation and knowledge systems. The majority of the time finding value from the model isn’t just querying the API to receive a result. Instead, it’s a multi-step process. Currently, hosted APIs don’t offer pre- and post-processing as part of the platform experience. A catalyst for pre-processing is that foundational models have a ~4K token limit so users need to perform string splitting. A second reason that FOMO solutions have appeared is that foundational models currently do not allow teams to integrate directly with external resources like databases and SaaS products. This means users can’t add their own data to the model directly to enhance performance for certain tasks or domains.
Let’s say you want to summarize transcripts using a foundational model like GPT-3. If you have a transcript of an hour-long meeting, it can be ~7.5K words. 1,500 words equate to about 2048 tokens so the total transcript is about 10K tokens. That’s beyond the 4K token limit and too much for one prompt. In turn you need to start with string splitting to break down the text into three groups. Then it is a map/reduce effort where you use GPT-3 to summarize chunks of the document (with some degree of overlap between chunks). Last, you query the model to summarize the summaries for a final output.
Another example would be creating a Q&A service for a textbook. This is a large amount of information so you start by string splitting the materials into chunks. You can create embeddings of chunks and adds these together in a vector store like Pinecone or Weaviate. When a user asks a question, you embed the query using the same embeddings model, and do a cosine similarity search using your vector database. This enables you to build a prompt that can be used against a foundational model API to get a response. This demonstrates the multi-step process of pre-processing to embedding to finding “k nearest neighbors” search to querying the LLM model with a prompt to get a return answer.
FOMO solutions are valuable for tying LLMs to internal data systems; prompt engineering like A/B testing prompts; chaining models together; switching foundational models; and A/B testing foundational models. We expect foundational models will add the ability to tie into third party services for retrieval augmentation so the value of this functionality goes down overtime. While ML practitioners will still need to manage the individual prompts that ping the API, we believe the value of prompt engineering will decrease over time with zero-shot learning and increased token limits. We believe in a world where there are multiple foundational models that are tuned for a particular task that must be chained together. In this world FOMO’s value goes up because it facilitates A/B testing foundational models, chaining services, and switching out models easily. When speaking with users, we heard the decision criteria for which model vendor to pick is a mix of performance, cost, result filters, and existing relationships. We consistently heard that pinging model APIs can be expensive so companies may use different models based on customer tier. FOMO solutions can help enable cost savings and vendor flexibility by being a unified multi-provider interface.
Additionally, we do not believe in a world where one multi-modal model can perform all tasks so there is no need for a chain. We believe there will be more specialized foundational models that excel at particular tasks over time. Moreover, we can imagine chains will include a mix of hosted ML APIs and self-hosed fine-tuned models. While FOMO products have started with LLMs, we anticipate that they will move beyond LLMs to other foundational model types like image, video, audio, etc. to be multi-modal.
Some may be wondering, why can’t you do foundational model orchestration with data or ML orchestration solutions? Today these solutions don’t have purpose build features and adapters for the foundational model ecosystem. As described above, foundational model pipelines are different than data or ML. ML orchestration solutions focus on coordinating the ML lifecycle from training to experiment tracking to deployment and serving. FOMO solutions have topic understanding. Alternatives would have to be significantly augmented. While not impossible, it’s not the most natural extension.
There are now a handful of offerings in the foundational model orchestration space including LangChain, Dust, GPT Index, Fixie.ai, and Cognosis. We heard from users the offerings accelerate foundational model application development and are being used in production only a few months after their release.
We believe FOMO will be a key piece of the foundational model infrastructure stack. While algorithms are advancing quickly with people eagerly anticipating GPT-4, tools are emerging just as fast. From evaluation optimization to debugging to observability to low code platforms, we anticipate a new foundational model tool chain will emerge. If you or someone you know is working on a ML tooling startup or adjacent offering, it would be great to hear from you. Comment below or email me at firstname.lastname@example.org to let us know.