Decoding foundation models: A guide to deploying generative AI in the enterprise.

10 min readApr 1, 2024

Written by Tarun Chopra and Kate Soule

Granite, Llama, GPT-4, Mixtral, CLIP, DALL-E, Claude, Stable Diffusion, DBRX, Midjourney, Orca, Gemini… the number of new and emerging AI models continues to grow at an exponential pace. I don’t know about you, but even I am finding it hard to keep up with them all.

These models — which are known as foundation models, a type of large language model (LLM) — represent a paradigm shift in the AI landscape. Rather than be trained for specific tasks, these models provide a versatile base that can be adapted for various applications. That’s because these models can apply information from one situation to another without having to be trained on an entirely new dataset.

As I mentioned in my previous blog, this new evolution of AI has led to a dramatic increase in enterprises looking to adopt AI. With more and more companies looking to deploy foundation models across a wider range of mission-critical situations, I feel it’s important for business leaders to grasp the fundamentals of foundation models in order to ensure they’re setting themselves up for success.

Meet one of our AI model experts, Kate Soule

That’s why, I’m thrilled to sit down with Kate Soule, Program Director, Generative AI at IBM Research, who is here to help us gain a better understanding the world of foundation models.

Kate is the Program Director for Generative AI Research at IBM where she helps manage the development of IBM’s Large Language Models and other Generative AI technologies. Prior to her current role at IBM Research, Kate was a leader at the MIT-IBM Watson AI Lab, a joint research partnership between MIT and IBM Research; Kate ran the Lab’s corporate membership program, supporting industry members as they co-invested in the Lab’s AI research technologies. Kate earned her MBA at MIT Sloan and previously worked at Deloitte Consulting as a Senior Consultant within the Strategy and Operations practice. Kate also holds a B.S. in Statistics from Cornell University.

Part 1: Let’s establish some foundation model fundamentals

To start, I’m going to ask Kate some elementary questions that will establish a basic overview of foundation models and their value to the enterprise. From there, we’ll dive deeper into how enterprises can apply foundation models in the real world, before discussing where this new, emerging market is going next.

Tarun: Kate, to kick us off, can you explain what a foundation model is? How is it different from a traditional AI model?

Kate: The term “foundation model,” was actually first coined by the Stanford Institute for Human-Centered Artificial Intelligence when they saw that the field of AI was converging on this new operating mode that we know today as generative AI. Unlike foundation models, traditional AI models are trained on very task-specific data and designed to perform a very specific task. Foundation models, in comparison, are trained on a huge amount of unstructured data. We’re talking terabytes to petabytes of data. This gives foundation models the ability to transfer to multiple different tasks and perform multiple different functions — with minimal task-specific, labeled data required. For more on foundation models, you can check out this explainer video.

Tarun: What is the business value of foundation models? What is the ROI?

Kate: There is immense business value to be found with foundation models. One of the biggest advantages with foundation models for enterprises is performance. These models have seen so much data that they are equipped to perform far more complicated tasks than traditional, narrow AI models. The second advantage is the productivity gains. Most people using foundation models are using a generic model created by some other model provider. The model provider had to train it on lots of unstructured, generic, data. This generic starting point can be adapted byt the model consumer for new tasks with little to no additional labeled data needed. As anyone who has built and deployed traditional AI applications can attest, cleaning and curating high-quality labeled data is one of the most painful and costly steps in the process. Now foundation models reduce your dependency on labeled data, it frees you up to unlock new use cases for automation and value creation that were previously unobtainable because data costs were too high and cost-prohibitive. This is where the true business value and return on investment lies.

Tarun: When it comes to generative AI, are bigger models better than smaller models?

Kate: Despite the common belief that “bigger is always better,” it’s not always the case when it comes to generative AI. There are a number of exciting innovations happening in the AI space which are enabling smaller, yet still powerful LLMs. So much so, in fact, that the new term, “Small Language Models,” has emerged. These models are much cheaper to run, are faster, and are sometimes specialized to excel in a specific domain. For example, in IBM’s Watson Code Assistant for Z product, we leverage a cobol-specialized code model to enable cobol to java translation tasks. These models are called Small Language Models because they are in Billions to tens of Billions of parameters, which is still quite large but is a fraction of the size of the hundreds of billion parameter models that are available today. Small Language Models are still a relatively new frontier, but they prove that bigger isn’t always better — the best model selection depends on the task at hand.

Tarun: How do you ensure data linage and transparency?

Kate: At IBM, we have a streamlined process to help ensure data linage and transparency. Let me break it down…

1. Dataset Acquisition: We start with data acquisition, which is guided by IBM’s governance experts who help us ensure we only use datasets for training that meet IBM’s standards. We then work with domain experts to find high-quality datasets that are relevant for the tasks our customers care about.

2. Dataset Pre-Processing: After we acquire the data, there is a full team dedicated just to the pre-processing of it. Not only does this include all of the data wrangling and engineering that goes into making this data suitable for training, but it also involves working to identify and filter out toxic information like HAP (hate, abuse, profanity), biased content, and potentially-pirated sources.

3. Data Selection: Once all the data is processed, our research team selects an optimal mix of data sources, balancing general knowledge with enterprise-specialized knowledge. For example, 10% of the training data for our granite-13b model was selected from finance and legal domains.

From a transparency perspective, we then publish all the details of our data acquisition pipeline and governance process in our granite technical report, including the fine grain details of what we trained our models on. If you’re interested, you can find that report here.

Tarun: Can you talk about the difference between experimentation and deployment?

Kate: There is definitely a big difference between AI experimentation and AI deployment. The first difference is with the approach. When you’re in experimentation mode, I’d recommend starting with a bigger LLM. It will be more expensive to run, but it will also require minimal prompt engineering allowing you to move quickly and try a lot of use cases. But, when it comes time to actually deploy AI in production, you should be trying to see just how small of a model you can get away with in order to reduce your cost and improve your latency. The second difference is related to governance. With experimentation, you are working within a sandbox with guardrails in place, so you’re operating in a safe environment. However, come deployment, it is more important than ever to have a governed process in place you can ensure a trusted deployment.

Part 2: Applying foundation models in the real world

We work with thousands of different companies around the world on their generative AI initiatives, and a lot of clients are asking me some tougher questions about how they can move from experimentation to deployment. So, I want to get into the “meat and potatoes” of how business leaders can actually apply foundation models in the real world.

Tarun: What is needed for deployment? Is there a deployment checklist that business leaders should reference?

To move from experimentation to deployment, you first need to prove you have a viable, cost-effective use case that meets your company’s AI Ethics guidelines and can be deployed in a governed, monitorable fashion. Here are a couple of questions you may want to keep in mind for a potential deployment. This list is far from exhaustive, but if you can’t answer “yes” to each one of these, then chances are you probably aren’t ready yet for deployment:

1. Do you have a firm understanding of model performance, with clear success metrics and acceptance criteria identified?

2. Have you run small scale, controlled pilots to validate that the model performs as expected when used by real users?

3. Have you studied the impact model hallucinations, model bias, and other safety issues, and implemented safety measures as necessary to mitigate risk?

4. Does this Generative AI use case comply with your company’s AI Ethics and AI Safety standards and requirements?

5. Have you confirmed that the use case is an acceptable use of the model given the license terms of underlying model and of any data used to train that model?

6. Does the use case and selected underlying models and model technology comply with applicable AI regulations?

7. Are there user guardrails in place with this deployment to ensure that the underlying models will only be used in the manner in which the use case envisions?

8. Have you modeled the expected costs the deployment is forecasted to incur, looking at factors like inference demand, expected model input and output lengths, and the resultant inference costs to confirm that any forecasted savings warrant the projected costs?

Tarun: How do companies evaluate which models they should use? And, with the space changing so rapidly, how can companies deploy in a way that still offers them the agility to pivot?

Kate: Generative AI models are evolving rapidly. Today, with a “small” 13B parameter LLM (I use the term small lightly here, 13B parameters is huge), we can perform tasks that were only achievable last year with 100B+ parameters. My advice is to focus on innovating with the data needed to test and customize the model, and not the underlying base LLM itself which will continue to change. Invest in creating a robust test bed of data used to test LLM performance on a suite of use cases. That way, as new models are released, you can quickly evaluate their performance and adapt your deployments as needed. Similarly, models can be customized through a process called “tuning” (also called model alignment), where you can use high-quality, labeled data to tailor any model for a given use case. If you focus and invest in data used for tuning, then that data can be applied to any base model, smoothing out performance and creating a uniform, tailored experience. Models tuned with the same data will behave similarly. As the underlying LLM technology evolves and new models are released, align the new model using the same tuning data so that your customizations, prompts, and general experience using the model can be passed forward to the new tech with minimal switching costs.

Tarun: What considerations need to be made with foundation models in order to mitigate security risks?

Kate: Even with traditional AI there were a number of risks to consider from both an ethical and societal perspective — such as model bias. With foundation models, these risks still exist, but are now intensified due to the sheer volume of data used in these systems compared to traditional AI models. There is more data used to train these models than any human could sit down and read in a lifetime. That means we need to rely on imperfect, automated measures to clean and sanitize data. In addition to the amplified pre-existing risks, foundation models also present net-new risks. Some of the new risks are:

· Hallucinations. A hallucination is a phenomenon where a LLM perceives patterns or objects that are nonexistent, resulting in an output that is incorrect. In other words, it “hallucinates” the response.

· Personal privacy. Revealing of personal or private information as part of the prompts that are sent to the model. If foundation models are not properly developed to secure confidential data, they might expose confidential information in the generated output.

· Copyright infringement. Companies face a similar risk with copyright infringement, which can happen if a foundation model generates content that is too similar to existing work protected by copyright or licensing agreements.

For a deeper dive into the new and amplified risks of foundation models, check out this ebook here.

Part 3: Exploring the future of foundation models

As with any new, emerging technology, we’re seeing the foundation models space change at a rapid pace. So, I want to give some insight into what we’re seeing on the horizon so business leaders can get ahead of the next wave of innovation.

Tarun: Where do you see LLMs going next?

Kate: As I mentioned earlier, we’re already seeing new innovations in the AI space like the emergence of Small Language Models (SLMs). he next wave of innovation we are seeing is how small models combined and orchestrated to work together to deliver the performance of a much larger model. This can be done with a collection of independent models where requests are routed to the optimal model to perform a task. Alternatively, multiple smaller models can be trained to work in concert, fully integrated as a new type of LLM called a “Mixture of Experts.” Mixtral-Instruct is an example of such a model, available today in watsonx.ai. An advantage of this style of models is it has the performance of a larger model, but at inference time, it has the latency of a much smaller model. Moving forward, we’re going to see more ecosystems of SLMs and Mixture of Experts models which can allow for improved performance, greater efficiency, and more customizability for enterprise tasks.

Tarun: What role does open source play in the future of Generative AI?

Kate: There are two camps emerging in Generative AI. One camp that claims that “one model will rule them all.” Meaning, at the end of the day, all you need is one very large, very performant LLM to run every single task. In contrast, is the philosophy that you may not want a sledgehammer to be your tool of choice for every task. Instead, you may be better served having a toolkit of models, of different sizes, different specializations, and different functions. I personally think this is the future we should be driving towards, because it gives users ultimate flexibility, including the opportunity to customize their cost-performance tradeoff for the task at hand. The open source community is key to making this toolkit approach succeed. It is only by fostering an active ecosystem of open innovation that Generative AI will be able to reach its full, unbridled, potential.

How to get started?

To elaborate on what Kate stated above, we have been working hard to help companies deploy foundation models at scale with trust and transparency. I encourage you to check out our new AI and data platform, watsonx, which is designed to help companies build, scale, and govern AI all within one platform. You should also explore our AI Assistants, like watsonx Orchestrate, which are specifically designed to help companies deploy foundation models and unlock productivity across key workflows.