A No-Nonsense Approach to Large Language Models for the Enterprise pt. 2

Tools, Methodologies and Use Cases

Bálint Kovács
Starschema Blog
Published in
13 min readAug 2


For businesses, large language models (LLMs) — so-called due to the high number of parameters and the petabytes of online data that they were trained on — are one of the most potent and exciting technologies to emerge in recent memory. However, there is considerable confusion and hype around these solutions, which makes it hard to gauge just what you can reasonably expect from them in actual enterprise-grade use cases — and how to ensure that those expectations are actually met.

This blog post series provides insights on the functionality, applicability and security aspects of LLMs from the perspective of data scientists who work on enterprise-grade AI and ML solutions every day. In the previous post, we established a working understanding of LLMs and looked at how they became “large,” as well as the differences between models, applications and chaining — if you’re new to this series, we strongly suggest that you take the time to read the earlier piece. We also introduced a trio of experiments that we’ve conducted to be able to define some cornerstones and best practices for solving a business problem using an LLM.

In this second part, we’ll discuss in detail the toolset and methodologies that we auditioned for our enterprise-grade LLM experiments to help you better understand how such a selection process works and enable you to better drive a conversation about the potential and pitfalls of LLM implementations.

Photo by Dollar Gill on Unsplash

The Straightforward Choice: OpenAI Services

When setting out to conduct any kind of serious experimentation with LLMs, the most obvious choice is to start with a tool from the developers of ChatGPT. As we concluded in the opening part of this series, OpenAI have the most mature product currently available in terms of response quality and hallucination reduction and have generally established themselves as the primary innovators in the field. So, naturally, we wanted to see if these isolated characteristics and the reputation built on them survive impact with reality and translate to a generally superior suitability for business use cases.

First, a look at the offerings. ChatGPT is already a complete chatbot application, so it’s the model behind it that we need to work with and build on top of. This can be done through the OpenAI API, where different versions of the models are available, including GPT-4, the most recent and powerful version, and GPT-3.5, the model behind the first iteration of chatGPT. One benefit of using the API is that it provides access to model configuration parameters — when using chatGPT, we only pass dialogue text, but with the API, we can play with parameters like temperature, which controls the randomness of the model output, and role, where we can set the context in which the agent will operate. These provide configuration options matching the needs of the enterprise environment and use cases in which the models will be deployed.


Since the real-world viability of solutions greatly depends on their cost-effectiveness, we need to consider the financial aspects of using OpenAI tools. When looking at pricing options, there are two main factors: context window size and the underlying model. Context window defines the maximum number of tokens that can be passed to the models at once. Since these models have no built-in memory, this also defines the maximum amount of information that they can keep track of. The largest context window size currently available for GPT-4 is 32k tokens, which is roughly equal to the content of 40 A4-size pages. Using the API with this configuration costs 0.12$ per 1000 tokens, which is around the content of a single A4 page. This may sound inexpensive at first but can easily add up when dealing with a large user base typical of an enterprise. Older models like GPT-3.5, with its 16k context window size, cost only 0.004$ per 1000 tokens — a significant cost decrease that goes hand-in-hand with sacrificing some power: the earlier models are more prone to errors in reasoning and are also more biased, which might still make them suitable for applications with more limited scope and/or functionality.


When you go with OpenAI, all your data,including business-critical elements, goes through OpenAI’s servers. The company recently enabled users to opt out of having their data used for training future models, although there’s still a 30-day retention period. The sole exception is using OpenAI services within the Azure cloud, but that introduces additional costs in the form of cloud infrastructure usage fees — and these services are only available for corporate accounts approved by OpenAI.

The Wild Card: Open-Source Models

LLaMa, Alpaca, Vicuna, Falcon, Dolly — there’s an abundance of open-source LLMs available, among which these are just the most popular ones. They primarily differ from ChatGPT in terms of model size, as these models have fewer parameters than the approximately 1 trillion that GPT-4 boasts. The largest Falcon model, which currently leads the HuggingFace LLM benchmark, has 40 billion parameters. But luckily, model performance doesn’t scale linearly, and these models are also capable of humanlike conversations, only they’ll be less refined than the ones possible with ChatGPT. However, the open-source alternatives are considerably limited by a smaller context window. This means they can handle less information at once than OpenAI’s models: for reference, GPT-3.5 Turbo has a 16K window size in stark contrast to the new LLaMa 2’s 4K. Other differences in performance come from the dataset used for training as well as modifications to the architecture or training process.

A major advantage of using open-source models is the ability to host them locally. This means that your data does not need to go through third-party servers, and the solution will not depend on a third-party API either. However, the task of managing the infrastructure behind these models falls on you, the user, which will inevitably lead to additional setup costs. You can mitigate this to a degree by leveraging a cloud service, but defining the architecture will be your responsibility.

Open-source models also empower you to train them on a custom dataset. OpenAI only provides options for fine-tuning, but their models are highly generic, so these options are only needed for specific tasks. The downside of training your own LLM’s is that it requires significantly more resources and know-how, as training a model is much more challenging than using it for inferencing. We suggest staying away from this option for a business use case and leaving it to research teams unless the data you’re working with is very specific and contains sensitive information.

There is also a potential dealbreaker with open-source models: most of them are only available with a non-commercial license. When considering these models, it’s essential to look into the license: for example, the LLaMa model, published by Facebook Research, prohibits commercial use, and therefore all models that are based on it are also limited to non-commercial use. However, the new LLaMa 2 model is under a different license that’s intended for commercial use. Other prominent fully open models like Falcon, Dolly and OpenAssistant can all be used for commercial purposes.

The Gateway: Model-Chaining Tools

Language models are, to a degree, single-task-specific. This task can have a narrow focus, such as code generation, or broader utility such as maintaining a humanlike dialogue with the user. To enable models to work together in solving complex tasks, we use so-called “model chaining” frameworks. In these frameworks, one language model acts as an agent: it breaks down a complex task into simple subtasks, and delegates them to tools (which may also be language models), and continues this in an iterative loop until the base task is solved. This helps us to create agents that have humanlike problem-solving capabilities — extended or limited by the tools we equip them with. Such tools typically include web search, SQL database connections, code generators, etc.


LangChain is an LLM chaining tool that supports several open-source models besides OpenAI’s GPT models. It already has many tools integrated, such as Google API for web search, SQL wrapper for database queries and a variety of classes for easier API calls that can also handle certain errors. It even enables creating custom tools and agents and features extensive documentation and tutorials to provide a low barrier of entry, as well as intensive community support and advanced guides for further development.


HayStack specializes in creating both general conversational agents and ones specializing in answering questions about specific documents. Chatbots with memory can be created with only a few lines of code and without having to define custom classes. One major downside of HayStack is that it offers less flexible tooling compared to LangChain and currently only supports models from OpenAI.


JARVIS is a model-chaining agent that integrates language models with HuggingFace machine learning models. It enables the user to simply specify the task they want performed and the input they have, and it will automatically call the appropriate model. For example, if you want to recognize what type of dog is in an image, JARVIS will find the right image classifier and call it without you having to explicitly code it. Unfortunately, JARVIS currently has no integration with other tools such as web search.

The Experiments

These days, people and organizations seem inclined to throw LLMs at any use case, and the question whether we really need LLM integration for a specific application often gets lost in the noise. Still, it remains crucial to consider this at the outset to avoid getting led astray by hype and ending up with ill-suited solutions.

There are cases which have become the “Hello World” for using LLMs, such as creating a custom Q&A assistant on top of your own data using OpenAI embeddings. Companies usually have digital textual documents of their policies, internal know-how, trainings, etc, and OpenAI embeddings can help to index such disparate documents in a way that machines could understand and search inside them. And when you put a GPT model on top of such a system, the model can translate your question to this embedding space and search for the most similar element. After reading that element, it produces relevant output in a human-understandable form. This in itself could provide a strong business justification for the existence of LLMs — but we wanted to explore some other use cases to showcase lesser-known aspects of the technology.

Database Assistants–Open or Closed

For our first experiment, we settled on an information retrieval use case that, instead of focusing on embeddings, highlights the capabilities of LLMs for searching directly inside databases. For this experiment, we ended up relying on OpenAI models but had also auditioned open-source alternatives. On top of examining the querying capabilities of the models, the experiment demonstrated the main differences of the models.

Our database was a PostgreSQL-based source with multiple tables. SQL is a widely used language, and most companies have these types of databases, so we thought it would be interesting to see how an LLM could handle such a data source to answer questions without the need of writing SQL queries. It shouldn’t come as a surprise, based on what we’ve discussed in this series so far, that OpenAI models ended up outperforming the free alternatives — although some instances were close.

Preview of Results

All models were fed the database schema with the prompt so they would know what fields are in the database. Next, we asked them questions of varying difficulty. The OpenAI models — even the no-longer-cutting-edge GPT-3.5 version — successfully solved every task and handled potentially problematic elements like table joining or synonyms in the initial prompt. Better yet, the results were stable and reproducible. Meanwhile, the open-source models failed to translate more complex requests. Context length was a limiting factor for fitting composite schema, and they tended to leave out important parts of the prompt and just give completely random answers that suited the question linguistically.

Knowledge Graphs with LLMs

The second use case combines knowledge graphs with LLMs to unlock multiple benefits. LLMs’ knowledge is represented implicitly in their parameters, which makes it inaccessible and unexplainable to humans. They are trained on a general corpus, so they cannot fully adapt specific domains by themself, and they have a knowledge cutoff, which means they cannot answer any questions that are outside of their training data. Their output can also contain hallucinations and other kinds of factually incorrect answers. In all of these aspects, knowledge graphs can improve the effectiveness of LLMs. However, building an adequate knowledge graph is not a straightforward process and, currently, there aren’t really any best practices for it.

Therefore, our research involved creating a summary on how to build a knowledge graph from raw textual data with the help of LLMs. After that, with the newly established graph, we took a deep dive into information retrieval methodology and corner cases. Since embeddings are the current way of solving most of the above-mentioned problems, we compared the graph-based and embedding-based solution of a few cases to see how they stack up and whether a graph can really solve the limited information spanning that’s typical of embeddings.

Preview of Results

Knowledge graphs seem to be a potent extension to LLM — but they’re just not “there” yet. Even state-of-the-art models battle to logically understand the schema behind a graph and, sometimes, they have a hard time translating a not-so-straightforward question into a graph query. This might be due to the fact that they haven’t seen too many graph databases and are unfamiliar with the basic concepts. However, they could write and execute queries — e.g. Cypher queries — and even find reasons behind the returned information in some cases. This suggests that potential in knowledge graphs that we’ve mentioned is realistically attainable, but we still have to teach the LLM how to handle graphs effectively.

LLMs with Tools

The third use case is a smart gardener assistant built in a way that showcases the power of chaining using LangChain. For this experiment, we used OpenAI’s GPT-3.5 model as the brain and equipped it with tools like image recognition, internet search and database querying. The goal of the model is to decide if a plant is happy and healthy and make suggestions for adjustments if something appears to be less than ideal. This use-case specifically focused on developing a complex workflow with chaining and using so-called “chain-of-thoughts” thought-action-observation cycles.

Preview of Results

Creating a workflow by using chain-of-thoughts can be positively goosebump-inducing. During our experiment, our goal was to create a demo that could reach out into the physical world. Defining tools in this ecosystem usually takes just a few lines of code, and implementing the core mechanism behind the chaining using frameworks like LangChain is also pretty straightforward.

The results were nothing short of amazing: the LLM could easily decide which tool it had to use, what steps it would take to solve our requests, how to interpret the results of the tool usage, and how to extract relevant information. Impressively, it could even figure out how to handle certain setbacks: it could refine its own internal prompts when it produced an error in the expected syntax, kept searching for specific information on the internet when initial attempts didn’t produce satisfactory results and knew to stop when information was unavailable before creating hallucinations.

A Clearer View of the LLM Landscape

Building on the early conclusions we shared in the previous post, we can now share some more mature findings that have come out of our experimentation with enterprise-grade applications of LLMs:

  • OpenAI models maintain their advantage at a wide variety of tasks — be it SQL querying, graph querying, tool usage tools or complex problem-solving — with open-source models consistently lagging behind.
  • There’s no hard-and-fast rule for when to use a self-hosted open-source model over the cloud-based OpenAI models, as it’s heavily dependent on the domain. However, if you wish to apply LLMs in a general use case, it’s a safe bet to go with one of the OpenAI’s models as of now, and going open-source only becomes a real alternative when you have a more specific task in mind.
  • Leveraging models in overall less resource-heavy systems is on the horizon, with intensive research efforts directed at fitting the inference of the big models into a simple environment using quantization techniques to control costs and other resource demands. However, there are already great hosting options, notably from Azure and HuggingFace services.
  • LLMs’ general knowledge has been extended to where there are barely any topics about which they don’t have at least some usable information — and these should also get covered soon, thanks to development into new ways to improve LLMs’ knowledge, like the plugin system or defining built-in functions.
  • LLMs can be exceptionally good at sequential thinking and creating chain-of-thoughts cycles, despite lacking real reasoning capabilities and easy interpretability for humans due to the multitude of problems and solutions they need to juggle. This gives the illusion that they have some sort of reasoning and should continue to drive conversations about artificial general intelligence.
  • Context window size is an important lingering limitation, especially for open-source models. Precise prompts are usually long, with loads of context information, and applying few-shot prompting examples also requires a high number of tokens. However, there are promising developments towards increasing context window size, such as the Claude model, which has achieved a 100K-long context token.

In our next and final post in this series, we’ll show you the exact results of our experiments and discuss the most notable experiences we had while working on them. It’s still a work in progress, so if there’s anything you’d like to see addressed next time — or have questions about any general or specific LLM-related issues, reach out — we’d love to talk.

About the Authors

Bálint Kovács a data scientist at Starschema with a background in software development. He has worked on diverse roles and projects, including as a research fellow and assistant lecturer at a top Hungarian university, a deep learning developer at a big multinational company and, currently, as a consultant data scientist. He enjoys diving deep into user data to uncover hidden insights and leverage them to create effective prototypes. Connect with Bálint on LinkedIn.

Szilvia Hodvogner is a data scientist at Starschema with a degree in computer science, specializing in artificial intelligence and computer vision. She has extensive experience working for research-oriented companies, where she worked with predictive models and natural language processing. At Starschema, Szilvia currently works on GIS and NLP projects. Connect with Szilvia on LinkedIn.

Balázs Zempléni is a data scientist at Starschema. He holds a degree in Engineering and specializes in digital image and signal processing. He has worked for multiple banks in various data engineering and business intelligence roles. In recent years, he has focused on developing a natural-language processing solution to improve internal business processes based on textual data. In addition to his work, Balázs is an avid presenter at meetups and conferences. Connect with Balázs on LinkedIn.