A No-Nonsense Approach to Large Language Models for the Enterprise pt. 3

Final Results and Evaluation

Szilvia Hodvogner
HCLTech-Starschema Blog
23 min readAug 21, 2023

--

Businesses are rushing to adapt language models to better serve their unique use cases, but due to the rapid evolution of the technology and the many variables in enterprise implementations, there’s a shortage of actual-best practices for getting the most out of LLMs at this scale. Effective LLM integration requires not just technical know-how but a realistic outlook on where these models can perform at a consistently high level. Our team of data scientists recently conducted a series of experiments to define mature, actionable insights that will help large organizations set accurate expectations regarding LLMs — and meet them.

Photo by Piotr Łaskawski on Unsplash

Our first post in this three-part series provided a historical and technical overview of the emergence of LLMs and discussed the different approaches to using them as building blocks for enterprise-grade solutions. The second post took a closer look at the toolset and methodologies, narrowing the focus on our three experiments to highlighting the potential and pitfalls of the different types of LLM implementations–most notably between OpenAI services and open-source alternatives. Both previous articles offered early conclusions and insights on the questions at the heart of this series–just what can you realistically expect from LLMs in an enterprise setting, and how to ensure you actually achieve it–so, if this is your entry point to this series, we hope you’ll go back and read the earlier pieces too to get a more complete perspective on our experiments and results.

This third and final post in the series is a hands-on review of our three experiments and offers a detailed look at the outcomes and what they might mean for your business. For each experiment, we’ll summarize why it made sense in the first place, give a detailed summary of the setup and the test phases and, lastly, evaluate the results using examples taken from the tests, backing up our early takeaways listed in the previous posts with evidence.

Experiment #1: Database Assistant–Open or Closed

Our first experiment had two main goals, the first of which was to create an information retrieval system that showcases the SQL capabilities of LLMs. SQL databases are everywhere, which has made writing SQL queries a valuable skill–and an “assistant” that can do some or all of this work for users would have great productivity-boosting potential. The second goal was to try out multiple LLMs and weigh up the candidates–open-source and the leading OpenAI models–based on various aspects.

The Setup

To make measurements that would enable us to get usable insights from the experiments, we needed to create an SQL database. We chose a publicly available dataset with multiple tables containing dummy HR data for a small imaginary business–employees’ personal data, their job titles, etc. This image shows the database schema with the connections:

The database schema used in the Database Assistant experiment

After initializing this database in a PostgreSQL environment, we made it available through secure API calls so we could easily interact with it from code. The next step was to search for candidate LLMs and we tried to select models with different advantages. Due to the lighting-fast development in the field, this wasn’t as easy as it seems: new models were getting introduced on a nearly daily basis, so we had to draw the line somewhere. For the purpose of this use case and experiment, we set two constraints of models: they had to be relatively small and/or easily integrable. Our reasoning behind this was that, if this use-case were to go into production, the domain would be small enough that we expect it would have to run in a limited-resource environment and avoid a high setup cost. The next table lists the models we ended up using:

At this point, the only thing left was to construct questions — prompts for the model — of increasing difficulty. First, we created questions that were very straightforward and didn’t require joining tables to be able to provide the answer. Next, we asked questions that were still recognized as easy but included some synonyms. To solve the hardest questions, the model needed to figure out not only the rephrasing but, for example, joining and other advanced SQL techniques as well.

For the test environment, we created a testing script that could easily adapt to different model interfaces. Using OpenAI’s models is straightforward enough, but for the open-source models, HuggingFace provided a coding framework which ensured unified and painless integration. When needed, we used a GPU, but only a smaller one like a Tesla T4.

The Prompt and Context

All of the models got the same structured prompts and context. In this case, we set their identity to a helpful assistant that translates questions into SQL queries. We also provided them with the database schema, so they knew what fields they could use to answer the user’s questions. However, if we look into the different context sizes, this clearly highlights one very important thing. Dolly and GPT4ALL models have only a 2K context length, which means they could not handle larger datasets. LLaMa2 has a 4K context length, and OpenAI comes out on top with its sizes in the 8K-32K range. Our example database is considered tiny but, for reference, our prompt cost an average of 200 tokens, not including the question.

OpenAI Easily Takes the Lead

The OpenAI models passed every test with flying colors: basic SQL statements were a piece of cake to them, and they could easily rephrase verbs and nouns from the question to the SQL query based on the schema. They could handle the keys and joining statements exceptionally well. Although they would not always produce the best-performing SQL results, they always provided a reliable answer. With no major difference between our test GPT-3.5 and GPT-4, it’s safe to say the older GPT-3.5 is also capable of solving these kinds of problems. Not only the models performed excellent, they were stable, reproducible and very fast using the OpenAI APIs.

Learning from the Tiniest — Dolly

The smallest model participating in the experiment, Dolly-V2.- ended up struggling the most–and failing most of the tests.

Dolly had a basic knowledge of SQL–for example, when we asked “how many employees are there?” or “how many employees have the last name King?”, it could generate a correct functional query. It could even list the names of employees who joined before 1990, etc. This shows that, for its size, Dolly has decent functionality.

However, the hallucinations and heavily “generative attitude” — repeatedly erring on the side of linguistic, rather than substantive appropriateness–made for a mixed picture overall. For example, when asked “how long has Alexander Hunold been working at the company?” it simply generated a number for the answer: “Alexander Hunold has been working at the company for 10 years.” The same thing happened, when asked “what is the role of Alexander Hunold?”. Every complex query left us with either some hallucination or syntax errors. In these complex cases, the main problems stemmed from the misuse of the keys for joining and misunderstanding the questions.

Interestingly, Dolly results are sometimes hard to reproduce, and the model has a strangely oscillating performance. One example that highlights this happened when we asked it to “list each department with the number of employees working for that department. Return the department name and the headcount.” This is a relatively complex statement, so we expected the model to fail it. However, much to our surprise, Dolly provided the following SQL query:

SELECT d.department_name, COUNT(e.employee_id)
FROM employees e
JOIN jobs j ON j.job_id = e.job_id
JOIN departments d ON d.department_id = j.department_id
JOIN locations l ON l.location_id = d.department_id
GROUP BY d.department_name

This is very close to the ideal answer, since it only messes up two things: in the second JOIN clause, the key is wrong, since only the employees (e in the query) table has department_id, the jobs table does not. And there’s a similar issue with the third JOIN, as it should map l.location_id with d.location_id, and not department_id. If corrected, this query would lead to the solution.

An Well-Optimized Big Boy — GPT4All

Despite its larger size –(13 billion parameters vs. Dolly’s 2.8) –GPT4All’s resource demand was comparable to Dolly’s. And while it could perform basic tasks like counting rows with small WHERE clauses or filtering employees based on their hire dates, it kept messing up the syntax. Interestingly, it was unable to comprehend the parameter we provided in the prompt: for example, when asked to count employees who have the letter t in their name, the model searched for the capital letter T, and returned only a subset of the correct set. Other times, it didn’t use the right schema and employed fields from the overall schema but with the wrong table. Another frequently occurring problem was the use of unnecessary clauses: multiple times, it added a “WHERE employee_id = ‘1234567890’ for no reason at all or tried to work with the manager_id field even if it wasn’t mentioned at all.

The Fresh-Faced Challenger of OpenAI Models — LLaMa-2

For LLaMa-2, our main focus was the smallest version, the LLaMa-2–7B. LLaMa-2 is a new competitor that builds on the original LLaMa by Meta. The main improvements is the bigger context size, with 4K tokens, and some performance gains. It has three versions, differentiated by the number of parameters: the 7-billion “7B”, 13-billion “13B” and 70-billion–you guessed it–“70B”.

Up until the medium-level questions LLaMa-2–7B performed great: it could create basic SQL statements with WHERE, COUNT, LIKE, GROUP BY and DISTINCT clauses. However, it had some persistent problems, including misuse of the keys at joining and creating syntactically correct but functionally incorrect statements. For example, for the task of “who has the max salary in IT and how much bigger is their salary compared to the minimum IT salary? Return the first name, last name and the difference as a number”, it encountered the following recurring errors:

  • Misuse of keys, joining by departments.job_id (non-existent)
  • IT is in WHERE clause with LIKE searching in jobs.job_title (not using the departments table)
  • IT filtering is completely ignored
  • Forgot the GROUP BY clause
  • Wrong calculation of differences (it calculated against each employees current salary, so the results were always zero)
  • Random semicolons in the middle of statements
  • Using Oracle SQL statements, not PostgreSQL

If we extended the conversation into a chat and gave feedback on the error, LLaMa could resolve some of the problems, changing the Oracle statement with another function that works in Postgre as well, adding the forgotten GROUP BY and filtering for the IT department. However, in some cases, like the above-mentioned example, it could not figure out the solution even with these little nudges.

We tested most of the questions against the bigger version, LLaMa-2–13B–which failed to produce much improvement: it could handle cases just as well as LLaMa-2–7B and produced the same kinds of errors in the same cases. The reasons were also similar, as it sometimes failed to create the right JOIN clause and also missed, for example, the filtering for only the specific department. Interestingly, in some cases, the 13-billion-parameter version generated specific IDs that had nothing to do with the questions.

Both LLaMa models were used through HuggingFace services. The responses were stable and relatively fast, although the bigger model usually took twice as long to produce an answer as the 7-billion-parameter version. The model also gained performance if we kept the chat history limited — the ideal was to maintain one query at one time.

Experiment #2: Knowledge Graph

As mentioned in the second post in the series, there’s great potential in integrating knowledge graphs with LLMs. These models still face serious, currently inherent limitations, be it a lack of explainability, knowledge cutoffs, training and fine-tuning problems due to immense resource needs, questionable reasoning capabilities, hallucinations and inconsistency in factuality. Knowledge graphs could provide a way to overcome many of these limitations, so our main question with our second experiment was whether the technology was mature enough for the task–and if not, how long before it could be.

Knowledge graphs are an exciting development, with loads of research around the topic–but there aren’t too many of them actually deployed at businesses. This is because building a knowledge graph that could effectively serve multiple use cases is anything but straightforward. Best practices are yet to be established, which leads to our first notable takeaway on the subject: since KGs are uncommon, LLMs also have limited knowledge of them. This means that the models are aware of the concept and the possible tools and use-cases, but we cannot expect the level of performance seen, for example, with the SQL use cases. Then there’s the question of fine-tuning with KGs: LLMs are based on prompt-completion pairs, while a graph has nodes and edges, which makes fine-tuning an alignment between these both essential and challenging.

The Game Plan

As the first step, we tried to build a KG with an LLM. Given raw textual data and some prompt engineering, the model could build a decent graph out of the provided information. We then examined how well it could use the graph to extend its prompts with relevant information. For this experiment, we used Neo4j as the graph database management system. It’s easy to get started with and provides hosting for databases with secure API connections. Neo4j uses Cypher for querying, which was recently added to LangChain, a framework for developing language model-powered applications by providing easy integration with multiple LLMs, tools to adapt technologies like SQL- and graph-querying modules with syntax-checking and automated direct connection to sources, as well as enabling the creation of chaining solutions. We’ve already established in the previous post that LangChain proved to be an excellent tool during our experiments, and we used it here to create our test cases. This time around we used OpenAI only, since their models had consistently displayed the best performance, and since building a KG in itself was a challenging task, we didn’t think there was much new learning to be done by making other models struggle.

Can an LLM Build a Knowledge Graph?

It sure can: OpenAI models proved absolutely capable of creating KGs — it all just depended on finding the right prompt.

Our goal was to transform information into a structured format suitable for creating a knowledge graph. We achieved this by experimenting with prompts, refining them iteratively. We cast the model in the role of a researcher tasked with building knowledge graphs from narratives. We employed a one-shot prompting approach, providing it with a short story along with the desired output format. This format mandated listing all entities identified in the text followed by their relationships.

Once we had the entities and their connections, we needed to translate this into specific queries for KG creation. We devised another prompt: the model received the previous output as input and was instructed to generate Cypher statements. We emphasized the importance of avoiding duplicate nodes and edges. This output was monitored using LangChain’s GraphCypherQAChain class, which performed basic checks before executing the queries on our Neo4j instance.

We tested GPT-3.5 and GPT-4 on constructing a graph based on a two-paragraph excerpt from the Wikipedia plot summary of Charlie Chaplin’s Modern Times. Both models encountered challenges in categorizing certain nodes, potentially leading to issues with Neo4j’s labeling system (name, type, and label parameters). However, aside from this classification hurdle, the generated graphs accurately reflected the story’s elements.

GPT-3.5’s Attempt at a Knowledge Graph

The GPT-3.5 model generated a knowledge graph (KG) representing the provided content. The graph uses color-coded nodes to differentiate between character types: blue for characters (personas), purple for objects and green for locations. The KG captures most of the story’s significant relationships. However, the model’s earlier type classification issue persists, resulting in duplicate nodes. For instance, “police” and “Ellen” appear as both characters and objects. Additionally, the model occasionally struggles with synonyms, creating separate nodes for “policeman” and “police.” Despite these limitations, the overall structure reflects the story’s logic effectively.

Knowledge graph produced by GPT-3.5

GPT-4’s Attempt at a Knowledge Graph

GPT-4 created a more condensed graph than its smaller sibling. It didn’t produce duplicate nodes–there’s only one “police” and one “Ellen”, but still have the “policeman”. Interestingly, GPT-4 constructed fewer nodes, which still covers the core of the story, but with less detail: the whole brick-throwing part and the job with the mechanic are missing. However, the node types are classified better: now the “café” is a location and every personas is classified correctly. When creating the edges, the GPT-4 model uses richer descriptions like WORK_AS instead of IS and the more precise LIVES_IN instead of LOCATED_IN.

Knowledge graph produced by GPT-4

Can an LLM Actually Use a Knowledge Graph?

Unfortunately, the models we tested could not use the KGs. The key issue is the lack of a working comprehension of the returned information of the graph queries–for an LLM without any context, this clearly proved too tall an order, even though both models had been fed the schema and existing relations of the graph.

Basically, GPT-3.5 can create great Cypher queries but has a hard time understanding them. During our experiment, the syntax and clauses were nearly always correct and functional, which was cause for optimism–but the model ultimately proved unable to generalize properly. When asked “Tramp worked as what?”, it translated the relation to “WORK_AS” — however in the graph, this connection is represented by an “IS”, so the model couldn’t find anything. In some cases, GPT-3.5 needed to be told which nodes or relationships it should use, which made for a cumbersome process. On the bright side, it could use the basic graph algorithms and, for example, find the shortest path between nodes:

MATCH (p:person {name: ‘workers’}), (o:object {name: ‘café’}),
path = shortestPath((p)-[*]-(o))
RETURN path

GPT-4 fared much better at understanding the query results and was able to comprehend the underlying logic of the relations. It could also align the verbs and nouns in the question with the existing nodes and edges without making too many mistakes in types and labels. However, in a few cases, GPT-4 still messed up the answer, even though the generated query was perfect, and the returned information was clear. At times, it would mess up the direction of the relationships. In general, GPT-4 showed great talent for creating Cypher queries, even complicated ones. Multiple aggregations or using different grouping hierarchies tend to lead to problems, but this is understandable due to the model’s lack of information on the topic.

A reliable and — in terms of enterprise applicability — promising application for KGs is to leverage them for information “pointers” in a retrieval use case. The KG should be constructed to not only contain the entities and connections from the raw text but link text chunks to the nodes as well. This way, the models could leverage the graph’s architecture and reference the relations between raw text and graph in a completely different way, e.g. by using the embedding vector-based search. This eliminates the spanning information problems, since the models can now link text parts that were “distant” with respect to their embedding vector but have a deep connection that is needed for answering.

Overall, despite the current limitations, LLMs leveraging KGs are a promising path for development. There’s a lot of untapped potential here that, a few years down the road, could lead to significant performance and reliability gains.

Experiment #3: Smart Gardener

With our last experiment, we wanted to reach out into the physical world and create a solution which could interact with its environment via sensors and actuators. For this, we started to develop a so-called Smart Gardener — a conversational assistant that can look after office plants. The results we’ll be sharing concern the first phase of the experiment, which focused on the development of the software and included the entire LLM-based workflow. We might cover the physical/hardware-centric part in a later post too–definitely let us know in a comment if you’re interested.

Our target requirements were that the application would

  • have access to camera recordings of office plants;
  • recognize the type of the plant based on camera recordings;
  • perform health checks on plants;
  • identify plants’ conditions by referencing it with a database and suggest adjustments if needed;
  • display historical variations in sensor data;
  • have actuators to water the plants;

Core Architecture

For the architecture of the experiment, we used chaining — also known as a chain-of-thought framework — whereby a high-performancing LLM serves as the “brain” of the whole process, and connected additional tools for extended functionality. We paid special attention to defining the interfaces of certain tools so that we could easily adapt the system to use cases other than smart gardening. LangChain once again proved essential in creating the chaining flows which this entire use case is built on.

We equipped the LLM — OpenAI’s GPT-3.5 as the “brain” — with the following tools for the experiment:

  • Plant identifier tool: a small neural network hosted on HuggingFace that can recognise a set of plants. The expected input is an image, and the output is the classification result: the plant type.
  • Internet search tool: a DuckDuckGo interface that can search for specific topics. We told the LLM to use it to “find the ideal conditions for plants’”. We also added some restrictions: the LLM cannot use the search tool without knowing the plant type and must ask specific questions. Repeated searches were allowed. These tweaks were intended to limit hallucinations.
  • Database search tool: a connection to a SQLite database that contained historical sensor information of plants: temperature, humidity, light conditions, PH level and soil moisture. The hardware is still under construction at the time this post comes out, so we created synthetic data for this tool.
  • Condition comparison tools: another GPT-3.5 model under the hood. It compares the actual conditions of plants against their ideal conditions and makes decisions and recommendations.

Plot chart tools: a third GPT-3.5 model, with a specific prompt to create a Python chart for the Streamlit application. It identifies what type of data it should use for the chart and has access to the database search tool to query the information for the required plot.

The whole application was bundled up in a containerized Streamlit application to make it easy to set up anywhere, and it also features a web-based interface.

User Interface

The web page provides two fields for input: one for uploading a photo of your plant and one where you can ask questions. Since the hardware components are yet to be installed, the plants are identified by their pot number, which means you need to provide the pot ID so it can fetch the right set of data from the database. We added a default prompt to show an example of a precise question.

The UI for our Smart Gardener experiment

Can an LLM Make Our Plants Happier?

Smart Gardener’s first run was a lasting experience. The code logs the whole chain-of-thought process of the LLM, so we could track every step the model took. These were moments when we could see the sparks of AGI. The LLM model perfectly sequenced the exact steps it needed to perform the whole process. It took errors in stride: when it messed up its own syntax, it could easily rephrase its internal prompts and try again. If a tool didn’t provide the required information, the model moved on without inserting hallucinations into the process and simply stated that a piece was missing. Here’s some of the LLM’s output:

Thought: I need to identify the plant in pot 10 and then check its most recent condition measurements. I can use the plant_identifier tool to identify the plant and the DatabaseSearch tool to get the most recent measurements.
Action: plant_identifier
Observation: Philodendron ‘Imperial Green’
Thought: I have identified the plant in pot 10 as Philodendron ‘Imperial Green’. Now I need to use the DatabaseSearch tool to get the most recent condition measurements for this plant.
Action: DatabaseSearch
Action Input: Pot number 10

From the log, we could see the first two cycles. The model realizes that it first needs to identify the plant from the photo, then get the corresponding data. It uses the right tool with the expected input formats and handles the output as well by understanding the results.

Thought: I have the most recent condition measurements for the Philodendron ‘Imperial Green’ in pot 10. Now I need to search the internet for the ideal range for each metric and provide suggestions for adjustments based on the recorded values.
Action: InternetSearch
Action Input: Ideal temperature range for Philodendron ‘Imperial Green’
Observation: Your Philodendron Imperial Green will thrive in temperatures between 75°F and 85°F (23°C to 29°C) ….
Thought: I have found the ideal temperature range for Philodendron ‘Imperial Green’ to be between 65°F and 85°F (18°C to 29°C). The recorded temperature for pot 10 is 12.094217640511111, which is significantly lower than the ideal range. I suggest adjusting the temperature to be within the ideal range by providing additional heat or moving the plant to a warmer location.

After finding information from the database, the model starts to search for the ideal conditions — the above example shows the first instance, with the model checking the internet for the temperature information. As you can see, it was able to retrieve the relevant details–without any unnecessary noise — from the Internet and already provided a basic recommendation.

The model takes each measurement — retrieved from the SQL database — and sees if it can find the right information. Then, it constructs the final answer and creates a chart of the temperature changes of the last few records of the plant. All of these go back to the user interface showing the results.

Output from our Smart Gardener

There are a few weak points in the process: the reproducibility, handling of dates and resolving differences in categorical or numerical values. Even though the model parameters are set to be as consistent as possible, the results for the same photo and same prompts differ. Sometimes, the model can find information for more features, while in other cases, for reasons currently unknown, it provides a final output based on just a few observations, without going through all measurements. When querying the database, it sometimes uses the wrong dates and fails to return the most recent measurements, despite being asked specifically and having set the parameters accordingly.

Another issue arises when trying to compare different scaled measurements. For example, soil moisture information is currently stored in a categorical variable: dry, moist, wet. Meanwhile, the information found on the internet — and scraped by the model — suggests different frequencies for watering, so the LLM is currently unable to decide whether it has watered a plant enough or not. Adding the hardware components should solve this issue and enable the model to create consistently accurate recommendations and take action when necessary.

Despite the challenges listed above, our experiment made clear that chaining technology can greatly extend the utility of LLMs. Thanks to the resulting adaptability, it could be dropped into a variety of workflows to replace manual tasks: choosing the right neural networks for predictive tasks, classification, regression, searching the web, connecting to an internal database to retrieve knowledge, etc.

This high level of adaptability for LLMs opens up a new perspective on application development, making it possible to, for example, automate the dashboard creation process. Tools such as anomaly detection modules, packages to create charts, database connections to retrieve the data for the charts, internet connectivity for competitor analysis will enable users to define their needs in natural language and leave it to the LLM to orchestrate the entire process, from creating the underlying logic to connecting the individuals pieces and, finally, provide quick and effective output.

Final Thoughts

Throughout this series, we provided early conclusions as they surfaced at various points during our experiments, and as we put in place the final pieces in this project, our experiences only reinforced and expanded on our early findings. We’ll cover the main takeaways below, but the long and short of it is that large language models have shown immense potential for both simplifying and elevating enterprise use cases, with even the tasks that are currently beyond leading models’ abilities appearing well within reach in the foreseeable future.

OpenAI’s models have consistently delivered outstanding results in various tasks, standing out for their ability to provide concise and relevant responses without unnecessary babbling. Their remarkable understanding of user intents sets them apart from open-source models and, beyond their reduced frequency of hallucinations, their effectiveness in high-stakes communication is even underscored by their improved capability to express uncertainty by simply saying “I don’t know” or “I am not sure.”

Resource-intensive inference remains a challenge and the primary cost of impressive performance. Even smaller models demand GPU resources, while larger models like LLaMa, Alpaca, and GPT-3.5 require GPU clusters. This makes the integration of open-source models a complex and expensive task–despite being free, hosting an architecture locally or in a cloud environment will also generate considerable costs. However, optimization techniques are becoming more widespread, raising the prospect of running these models on more accessible commodity hardware in the future.

Context window size has been a significant limitation for open-source models. This parameter defines how much information can the models handle at once and how much input can it take as the prompt–precise prompts are usually long with loads of context information, and only a few models can handle such inputs. While developments are underway to increase this parameter, OpenAI’s models, particularly the GPT-4 with its 32K context size, maintain a significant advantage over, for example, the LLaMa-2 based top HuggingFace benchmark leaders, like the Platypus2, whose context length is 4096 tokens. Innovations like Anthropic’s Claude model, achieving a context length of 100K tokens, show promise in further expanding these boundaries.

From a privacy standpoint, using OpenAI’s models involves data passing through their servers, potentially raising concerns about sensitive information. Instances like the Samsung source code incident underline the need for caution. In contrast, employing open-source models within organizational boundaries can mitigate such privacy risks.

Security considerations also come into play when depending on external API services like those provided by OpenAI. Open-source models allow organizations to take control of their security measures and adapt them according to their needs, e.g. via customizable cloud solutions.

Distinguishing between commercial and non-commercial use is crucial–always check not only the license of the model but the dependencies and model weights as well, and apply the strictest ones. For example, Vicuna is officially for commercial use, but you need LLaMa weights for initialization–and LLaMa is for non-commercial use, so in this case your Vicuna will fall under non-commercial use as well.

LangChain’s wider set of pre-implemented tools, custom tool creation and support for open-source models places it above HayStack, but both frameworks offer an impressive set of features for enterprises to create custom-tailored solutions–only with HayStack your model options are currently limited to OpenAI. The chaining functionality opens new horizons in application development, where users can define a problem with natural language and provide the necessary tools but leave it to an AI to actually build the complex solution.

And lastly, don’t sleep on knowledge graphs–they may still need some research to live up to their potential, but when used as pointers to information chunks, they are already effective enough to outperform embeddings in certain cases.

There’s a good chance that the near future will prove decisive in setting a direction for large language models for years to come and see the technology–and the systems built around it–elevated to primary sources of competitive advantage. Most of the building blocks are already firmly in place, and with the current pace of development and adoption, the early growing pains will likely prove fleeting.

If you need help implementing or improving an LLM-based solution to improve critical decision-making capabilities–or just make sure your office plants are happy and healthy–we at Starschema are here to leverage our expertise in promoting advanced analytics at Fortune 500 companies and beyond to help you make the right choices and get the most value out of them. Get in touch — we’d love to talk to you.

About the Authors

Szilvia Hodvogner is a data scientist at Starschema with a degree in computer science, specializing in artificial intelligence and computer vision. She has extensive experience working for research-oriented companies, where she worked with predictive models and natural language processing. At Starschema, Szilvia currently works on GIS and NLP projects. Connect with Szilvia on LinkedIn.

Balázs Zempléni is a data scientist at Starschema. He holds a degree in Engineering and specializes in digital image and signal processing. He has worked for multiple banks in various data engineering and business intelligence roles. In recent years, he has focused on developing a natural-language processing solution to improve internal business processes based on textual data. In addition to his work, Balázs is an avid presenter at meetups and conferences. Connect with Balázs on LinkedIn.

Bálint Kovács is a data scientist at Starschema with a background in software development. He has worked on diverse roles and projects, including as a research fellow and assistant lecturer at a top Hungarian university, a deep learning developer at a big multinational company and, currently, as a consultant data scientist. He enjoys diving deep into user data to uncover hidden insights and leverage them to create effective prototypes. Connect with Bálint on LinkedIn.

--

--