A No-Nonsense Approach to Large Language Models for the Enterprise pt. 1

Crucial Context, Experiments and Early Findings

Balázs Zempléni
HCLTech-Starschema Blog
10 min readJul 18, 2023

--

Generative AI seems to be taking over the world, and many executives and the enterprises they steward have a hard time keeping up and understanding where to place their attention. Transformers, diffusion models, large language models (LLMs), different generative pre-trained transformers (GPTs) like ChatGPT, Alpaca, LLaMA, PaLM, Bard and artificial general intelligence are just a few phrases which are much more commonly tossed around nowadays when discussing the future of business — and beyond.
We’ve entered an era where we can create virtual agents with superhuman capabilities, albeit in an environment that’s filled with hype and noise.

Photo by Jan Huber on Unsplash

Keeping up with such intense development in AI-based solutions is difficult but crucial for organizations that want to leverage its results to optimize operations, improve resilience and build competitive advantage. Since the most advanced applications haven’t been around too long, they’re generating just as much confusion and concern as excitement — which, again, has especially major implications at an enterprise scale. So, in this new series, we’ll focus on LLMs to provide insights on their functionality, applicability and security aspects from the perspective of data scientists who work on enterprise-grade AI and ML solutions every day.

This first post in the series will provide the context necessary for developing a more mature understanding of LLMs and share early lessons from in-house experiments with enterprise-grade models that our data science team has conducted. The second post will introduce in greater depth the tools, techniques and methodologies which we have found useful for our experiments, while the last piece will evaluate the results and draw longer-term conclusions from them.

Historical Context

To understand the current AI boom, we need to go back to 2017 and the debut of transformers by Google researchers. Transformers were a break-through because they could process a sentence as a whole, unlike other models at the time, whose understanding was merely sequential. This ability makes model training a parallel procedure, which is not only very efficient but also enables these models to “remember” and take into account earlier sections of the text when generating their output. Transformers achieve these advantages through their attention mechanism, which provides information about the relationships between the words of the text.

After the spread of transformers and attention mechanisms, OpenAI began releasing their GPT series. GPT models prioritized more advanced weighting of words and were specifically trained to be best at predicting the next token only. The newest product in the series is GPT-4, released in 2023, which displayed sparks of an artificial general intelligence (AGI) due to its general knowledge about an extensive range of topics, human-like cognitive capabilities and ability to handle not only text but visual input as well.

Language Models Go Large

The models mentioned above are referred to as large language models (LLMs) because they have a remarkably high number of parameters and were trained on petabytes of textual data from the internet. Parameters are like neurons in the brain: in essence, they enable GPT models to think. For reference, an average human brain has approximately 80–100 billion neurons, while ChatGPT — which is built on GPT-3 — has around 175 billion parameters, and GPT-4 a whopping 1 trillion. Meanwhile, LLaMa has a version that has only 7 billion parameters but can achieve comparable performance to GPT-3. This should help understand why training, inferencing and maintenance for such models have an exceptionally high resource cost, notably in terms of computational power, which only a few big companies have the means to provide.

LLMs are outstanding at generative use cases like summarizing documents, writing code and synthesizing knowledge as increasingly adaptive assistants that leverage conversational interfaces.

The leading LLM service is currently offered by OpenAI, and they have multiple models which could be inferred in a matter of seconds thanks to their server park. OpenAI not only host ChatGPT but share an API with a wide range of possibilities — embeddings, completions, moderations, etc — and also offer an Azure integrator.

Promising alternatives to OpenAI are constantly being introduced, and most of them are open-source. Using open-source models greatly expands your opportunities for customization. Finally, hosting a custom LLM model is also possible, but it opens up a different can of worms that we’ll not be getting into here. There are other more prominent models that follow closely in the footsteps of OpenAI’s GPTs, and these offer a notable advantage in that they’re much lighter on resources — e.g. training effort and GPU clusters — since they have fewer parameters compared to the OpenAI models. However, this also makes them comparatively limited in the amount of context they’re able to leverage.

So far, we’ve talked about models, and it’s important to differentiate a model from an application. GPT3.5, GPT-4 and LLaMa are models, while chatGPT is a chatbot application built on top of GPT3.5 and GPT-4. Or take GitHub Copilot, a code assistant, which is also based on GPT models, and Microsoft Bing, which now uses some form of GPT-4. Other prominent examples include BloombergGPT for finance, Jasper for copywriting or Scholarcy for researching, with the list continuously growing.

However, new approaches are also emerging where LLMs are used as the communication interface between various systems. This is the so-called Chaining or ReAct methods. The idea behind these is to combine different models to create a system that can autonomously solve a wide range of tasks. LLMs can logically divide the problem into steps and create connections between the goals of individual substeps and the requirements of subsequent steps. For example, Microsoft’s JARVIS preprocesses a user request with ChatGPT, then disassembles it into subtasks and selects an expert model for every task from HuggingFace based on appropriateness as suggested by the model descriptions.

Once every task has a model dedicated to solving it, the application executes the plans. The results then once again go to ChatGPT, which integrates the predictions and generates the response for the initial query. Other example frameworks are Haystack and LangChain, both of which are open-source Python packages with expansive integration possibilities. They’re useful for streamlining the creation of complex chaining workflows, incorporating other neural networks, web search, database querying, APIs, etc.

Putting the Models to Work

Our data science team put together a series of experiments to identify various models’ key strengths, limitations and main resource demands and to gain a better perspective on which model is best-suited to a typical enterprise use case. For such experiments, there are a handful of things to consider, such as which model or service to use, whether to build in a local-hosted or cloud-based environment, what are the costs and resource needs and how to ensure data privacy and security. To give a full picture of those things, we constructed our use cases to highlight different considerations and show the capabilities and limitations of LLMs across different domains, while still enabling comparisons between them.

The main advantages and drawbacks of cloud-based and locally hosted LLMs

For our first experiment, we chose an information retrieval use case. The goal is to examine different LLMs in terms of how effectively they can perform a direct search on a structured database. SQL-based databases can be found at most companies, so we wanted to see if these models could answer questions by querying in the right way without any help. We created an SQL database with multiple tables and asked questions about the dataset from various models to analyze the differences between the OpenAI’s models and the open-source competition.

The second experiment involved a different kind of structured data: knowledge graphs, a somewhat new way of storing information. Combining an LLM with a knowledge graph offers many potential benefits, including the ability to greatly mitigate hallucinations and factual incorrectness. We devised this experiment in no small part to find out if this is indeed the case, and also to show how these knowledge graphs can be established for effective use by LLMs.

The third experiment is a demonstration of the power of the chaining method introduced earlier. We created a smart gardener application which is able to answer questions about… a plant in our office! OpenAI’s model serves as the application’s core, complemented by tools such as object recognition, web searching and database querying. It can decide from an image if our plant is healthy or not and suggest adjustments to its environment if needed.

A Few Spoilers

In the subsequent entries of this series, we’ll provide crucial context for understanding the methodology, relevance and outcomes of our experiments, but we’ll close this introductory post by sharing some key early findings.

OpenAI’s models are the ones to beat.

OpenAI’s models are the reigning LLM champion. When used with optimal settings–the prompt is well-defined, the parameters are settled according to our needs–they produce the best results, without too much unnecessary babbling. These models seem to have a better understanding of user intent than the open-source models, are constantly getting better at avoiding hallucinations and, crucially, are getting better at saying “I don’t know” or “I’m not sure”, when the retrieved information is formally correct but dubious in its factuality.

On a more critical note, while we found OpenAI’s offerings to be a great starting point for experimenting with models and the technology at their heart, it also became apparent that maintaining an end-to-end solution may prove prohibitively expensive, and their fine-tuning system for customization also has considerable limitations.

The cutting edge will cost you.

As befits their name, large language models are extremely resource-heavy. Even inferencing models alone need tens of gigabytes of GPU RAM, and this number can easily exceed 100. Training such models is impossible without GPU clusters with hundreds of GPU RAM. This makes the integration of open-source models an expensive and complex undertaking. Despite the tool itself being free, hosting an architecture locally or using a cloud service will generate additional costs.

The cutting edge isn’t the safest place to be.

Privacy and security are critical challenges when adapting LLMs.

If you go with OpenAI, your data will go through their server, regardless of sensitivity. Their general policy is to retain your data for 30 days, with the only exception being Azure OpenAI, where you can opt out. It may sound trivial at this point, but sharing sensitive business data can lead to serious problems, as it’s very difficult to restrict the model from learning from and reusing it. One notable incident involved Samsung workers uploading source code to ChatGPT, inadvertently adding it to the training data that other users can benefit from. The very same could happen with proprietary codes, meeting details, emails and other confidential company data. Meanwhile, such privacy threats become a non-issue if you deploy an open-source model within your organization.

On the security side, open-source models make it your responsibility to build the right protections or use customizable cloud solutions, whereas OpenAI enables you to rely on their server and API services. These have so far proven reassuringly reliable, with only one minor reported case of chat titles for different users becoming public.

Stay Tuned

This post has hopefully helped you build a strong foundation to get the most out of what’s to come in the remainder of this series. In our next post, we’ll dive into the experiments laid out above and build on the knowledge presented here to develop a better understanding of the tools and data science methodologies that go into creating viable enterprise-grade LLM solutions.

In the meantime, feel free to reach out if you have questions about any general or specific LLM-related issues–we’d love to talk.

About the Authors

Balázs Zempléni is a data scientist at Starschema. He holds a degree in Engineering and specializes in digital image and signal processing. He has worked for multiple banks in various data engineering and business intelligence roles. In recent years, he has focused on developing a natural-language processing solution to improve internal business processes based on textual data. In addition to his work, Balázs is an avid presenter at meetups and conferences. Connect with Balázs on LinkedIn.

Szilvia Hodvogner is a data scientist at Starschema with a degree in computer science, specializing in artificial intelligence and computer vision. She has extensive experience working for research-oriented companies, where she worked with predictive models and natural language processing. At Starschema, Szilvia currently works on GIS and NLP projects. Connect with Szilvia on LinkedIn.

Bálint Kovács a data scientist at Starschema with a background in software development. He has worked on diverse roles and projects, including as a research fellow and assistant lecturer at a top Hungarian university, a deep learning developer at a big multinational company and, currently, as a consultant data scientist. He enjoys diving deep into user data to uncover hidden insights and leverage them to create effective prototypes. Connect with Bálint on LinkedIn.

--

--