Large Language Models and where to apply them

Corpy&Co. Tech Blog
11 min readFeb 28, 2023

--

Author: Iordan Iordanov, CTO at Corpy&Co., Inc.

I remember vividly when I discovered Akinator during my university days. A friend showed me this surprisingly simple website that would always manage to somehow guess the character that you would think about. Just by asking a few questions, it would always find a way to identify who you are thinking about, both real and imaginary characters. At the time, we were all baffled by the “intelligence” of Akinator, and we kept speculating about how it could guess our thoughts so accurately.

Until today, I am still baffled. Of course, I have my speculations, but I will keep my thoughts to myself in this post. Just for the fun of it, I pitted Akinator against ChatGPT. I thought of the character of Dr. John A. Zoidberg from Futurama (because, well, why not Zoidberg?). Akinator guessed my pick in 22 questions. When I tried asking ChatGPT, it guessed that my character was Superman in exactly 0 questions. I must say, that was… unexpected, and a little bit overwhelming.

I tried giving another chance to ChatGPT, but I gave up at Question 28. Frankly, it didn’t look like we were getting anywhere. In the end, it looks like Akinator has nothing to fear! It is still the top-ranking character-guessing game!

Of course, we couldn’t judge the utility of ChatGPT by its ability to beat Akinator, right? After all, ChatGPT is such a complicated model, and it is able to generate so many things, it would be unfair to judge it by its ability to guess characters. Of course.

I am joking, surely, but not too much, in reality. Let’s talk about it a little bit more seriously. What are Large Language Models (LLMs), and why is there so much buzz about ChatGPT lately?

The introduction of Transformers by Google in 2017 brought about a completely new wave of research in Natural Language Processing (NLP). Transformers became the de facto encoding format for virtually all state-of-the-art language models. Through successive refining, it was only a matter of time before the third version of the Generative Pre-trained Transformer (GPT-3) was announced in 2020 by OpenAI, the company that released ChatGPT a couple of months ago. In fact, ChatGPT is based on a fune-tuned version of GPT-3.5.

The release of GPT-3 itself was a milestone — the model can generate sequences of text virtually indistinguishable from human-written text. Many were worried about the potential implications of such a model: is it going to make writers redundant? Is it going to replace software developers? How about fraud? How about plagiarism?

The same thing is happening nowadays with ChatGPT. Concerns are being raised again about the same points, and even some new ones. The opinions of specialists range from “it’s an apocalypse” to “it’s the most significant breakthrough of the last 10 years”, to “it’s complete nonsense”. So, who is right?

This article is yet another opinion on the matter, but I will try to make it as solidly rooted in reality as I can. Let me start by stating clearly that all the opinions that I express here are my own, so if you have any issue with what I say, please take it with me, and do feel free to contact me directly, I will be happy to discuss further.

Before we go into the details, let me state clearly my opinion on ChatGPT and all similar models that are sprouting all over the place. ChatGPT and other LLMs are very useful and impressive tools, but they should be seen for what they are — Large Language Models, the emphasis being on the Language part. I hope that this statement will become clearer by the end of the article.

So let’s start by understanding how an LLM works. Independently on the underlying technology or training regime, LLMs all have the same purpose: to generate text that “looks natural”. In other words, the basic purpose of an LLM is to learn how to generate sequences of words that can be understood by a human, and have a specific meaning. We will need to delve deeper into the concept of “meaning” but for now let’s stay at this level of detail.

The way that an LLM is trained is the following: you get an enormous amount of data, convert it to tokens, and show it to the model over and over again until it figures out what is the relationship between various token combinations in different contexts. I am grossly oversimplifying the process, of course, but details are available in the GPT-3 paper linked above. This process is precisely how any AI model is trained nowadays, really, not only LLMs. The data must be processed properly, features must be engineered carefully, the training hyperparameters must be studied meticulously, but essentially the process is the same. The biggest difference of an LLM from “standard” models is that the LLM is trained on a shamelessly exhorbitant amount of data so that it has a rich enough corpus from which to learn. Such models are trained with the mindset that “more is better”, although I must admit that there are efforts being done to also keep the quality of the data at reasonable levels.

In other words, LLMs are so good at generating natural-looking text just because they have seen so much of it. These models have learned what is the expected structure of a sentence and can reproduce it with extremely high fidelity because they have been trained on billions of tokens for thousands of epochs. Let’s keep this as the first important point from our discussion.

Next, let’s focus on the evaluation of an LLM — how do we judge that an LLM “performs well”? In the case of GPT-3, in particular (which, as you may have noticed, is a favorite in this article), one of the evaluations is done by humans. The evaluators are presented with several news articles; some are written by humans, and some are written by GPT-3. The task that the evaluators are given is to state which articles were written by a human, and which by an AI. If the AI manages to convince the evaluators, then it’s a success. GPT-3 managed to convince the human evaluators half of the time, on average. Not bad, if you ask me.

So, the takeaway here is that GPT-3 can generate natural-looking text that can convince people that it was written by other people. GPT-3.5 and ChatGPT, of course, take this ability to a whole new level. It has gotten to the point where abstracts generated by ChatGPT can even fool scientists. But let’s not get ahead of ourselves — for now, let’s focus on the fact that ChatGPT can generate text that looks as if it was written by a human.

Now, let’s go a little bit deeper. This is the point where we talk about the concept of “meaning” that I mentioned above. This might get a little bit confusing, but let’s consider what is precisely the meaning of a simple phrase, for instance “today is sunny”. If this phrase is presented to a human, what would they understand? For example, a human might infer that: right now it is daytime, as opposed to night; there are no clouds in the sky; it is quite certain that it is not raining or snowing right now; the temperature could be high enough to go out without a jacket, depending on the season and location; and many other factual conclusions or speculations that may be drawn from such a simple sentence.

On the other hand, what could a language model linfer from this simple information? Of course, the model can verify that the structure of the sentence is correct, and the semantic of that phrase can be matched to specific semantics that the model has seen during its training. The model can infer certain correlations between “sunny” and other phrases it has seen that contain synonyms or antonyms. However, no matter how many phrases the model sees, it will never be able to comprehend the meaning of this simple phrase in the same sense that a human would. The model will be able to capture the expected context, and it will be able to place a generated phrase within the same context, but there is no deeper meaning behind what the model produces.

In other words, the model has no way to verify whether “today is sunny” or not, in reality. It could be possible to execute a web query to a weather platform or something similar to verify the validity of this statement, of course; however, the isolated model itself has no way to check whether this simple statement is true or false without external input. In the same way, the model cannot infer a new, unknown situation without such external input. The GPT family in particular adopts a technique called Zero-shot learning, which enables the model to potentially learn new concepts and notions, but the manner in which the model learns about new notions in this way must be extremely well-controlled. Otherwise, it is possible that the model learns counterfactual information by assimilating conflicting statements from different sources. The language model does not have built-in fact-checking, you see.

This consideration points at the conclusion that the information generated by ChatGPT should not be trated as a fact by default. The model itself generates sentences that are stated in an assertive and factual way, which can be very misleading. However, in the same way that ChatGPT needs external input to verify the statements that are given to it, humans also need external input to verify the validity of the statements produced by ChatGPT. Simply said, ChatGPT must be fact-checked, because its notion of “meaning” doesn’t necessarily include “truthfullness”. In the context of LLMs, a “meaningful” output is one that appears natural, as we have re-iterated earlier. The consideration about truthfullness is somehow secondary, and it will perhaps improve over next releases, but currently, there are still improvements that must be done.

So, we have stated three points until now:
1. ChatGPT is trained on overwhelmingly large amount of text written by humans;
2. ChatGPT is able to generate text that looks convincing to humans: they can believe the text is written by another human;
3. ChatGPT produces assertive and factual statements without fact-checking; therefore, ChatGPT needs to be fact-checked because its output is sometimes blatantly false, and even controversial with previous outputs by ChatGPT itself.

That looks like a recipe, doesn’t it? You take a huge amount of data, you throw it at a model that has 175 billion parameters, and you get a chatbot that has the general idea of how it is expected to reply in a manner that will sound natural to you. However, there is no way to control how the chatbot learns the information, so you hope that it will be able to figure it out by itself; just in case, you need to make sure it’s not spouting out nonsense by fact-checking every generated sentence.

I fully realize that my comments and analysis until now have a condescending tone, and believe me, it is done consciously. Why, you ask? Because ChatGPT, as any other LLM, is not the solution to all the problems that we have; in fact, it introduces new problems. One of the extremely interesting points that I find very useful about ChatGPT is the verification of the Law of Large Numbers (funnily enough, LLN): “the average of the results obtained from a large number of trials should be close to the expected value, and tends to become closer to the expected value as more trials are performed”. If you feed a huge amount of data to a huge model over a very long period of time, the model is bound to remember something, right? Well, it works, apparently! How does it work? Unfortunately, with a model that has 175 BILLION parameters, we cannot really know.

As I stated in the beginning, I do believe that ChatGPT is a very useful and impressive tool. It certainly has its utility in performing menial tasks that might be non-trivial, but its abilities cover only the domain of the text, at least for now. It has no knowledge of other topics, and its understanding is limited by the corpus on which it has been trained. Even within that enormous corpus, there is still the limitation of how the model parses, processes, and “understands” the information. If humans implement the learning rules, they will either implement them in a non-complete manner (because we do not have a full undersanting of “understanding”), or they will over-complicate them, which will lead to inevitable mistakes and conflicting definitions; if humans do not implement the rules, then we have no way to verify that the rules the model learns are compatible with our own rules. We can only rely on proxy methods, which have their own shortcomings, and that is a topic for discussion in itself.

I am excited about ChatGPT, I really am. I do not believe it is going to change the fabric of reality, but I do believe it is an interesting and exciting technological achievement. The public release of ChatGPT has triggered an avalanche of new AI-powered solutions by Microsoft, Google, Meta, and who knows who else will follow soon. Of course, all these large companies had been working on their own LLMs in the bakground, but the release of ChatGPT seems to have forced them to unleash their chatbots in an unprecedented war of the assistants, if you wish. Is it a good thing? It is certainly an interesting thing, and I am happy to have a new chatbot to keep me company during my morning commute. On a serious note, we can only speculate for now; however, the fact that the direction of the mainstream research and technological advancement nowadays is to build increasingly larger models trained on increasingly larger datasets is, for me, alarming.

I fervently support the stance of Andrew Ng on data-centric approach to AI. Instead of pumping heaps of new data into increasingly larger models, it is much more meaningful to curate a small, carefully selected dataset that represents the reality that the model is expected to learn. Iteratively improve the quality of the data, and learn its impact on the model performance — gain control over the model through the data it is being fed.

Large language models are fun to play with, but let’s be honest — they are not ready to be used in mission-critical applications, not by a long shot. Because of this simple truth, there is no danger for human experts to be replaced by ChatGPT in the near future, at least in my humble opinion. The real test for a “human-level AI” is whether the model can convince stakeholders and experts that it is reliable, robust, consistent, and responsible enough to handle mission-critical tasks. This criterion applies not only for ChatGPT, but for any large model. As long as the answer to that question is “no”, we must continue on the journey to Quality-Assurable AI. From the experience I have accumulated while working on CONFIDE with my team, I can confidently say that convincing stakeholders about the validity of an AI solution is the most important and significant test for the model. One of the most crucial questions that we face at Corpy is, why does the model give wrong answers? If we cannot answer this question clearly, our clients will never be able to take us seriously, of course.

With the release of ChatGPT, multiple questions are raised by multiple experts in the field of AI and not only. Fairness, ethics, legal matters, there are multiple ramifications of the appearance of widely used LLMs. As always, the inevitable question looms: should we build something just because we can? Or should we deepen our understanding incrementally and carefully build a meaningful and traceable pipeline that will be useful in practice?

With the vision to enable Mission-Critical AI, we at Corpy have made our choice. There isn’t really a choice to be made. The only way to integrate AI-based solutions in tasks that require high level of responsiveness and reliability, is to follow robust development methods grounded in sound theoretical and experimental results. We cannot do guesswork. This is our responsibility and commitment.

--

--