Tackling Language Model Limitations: A Dual Approach to Compositionality and Modular Reasoning (part 1)

11 min readOct 4, 2023

Welcome back to a new series where we will be introducing inherent shortcomings of language models, and exploring some of the methods by which we can augment models to overcome those shortcomings. Today you might hear from your colleagues or on the interwebs that Large Language Models (LLMs) are the new hammer you need for every nail, or that they are the slippery slope that will procreate the singularity which will enslave us to our computer overlords. I don’t have any interest in toying with either of these ideas, but I do want to talk about some deficiencies of current language models, as well as ways we can create systems that workaround their inherent problems. I will be breaking down two insightful and pragmatic articles that you can apply to your LLM systems to make them more reliable and less likely to hallucinate without needing to dive into the scientific literature [1] [2].

Overview

The problems

Lack of access to current and proprietary information
Lack of reasoning ability
Failure to compose final answers when data doesn’t appear together in training data
Data missing from the training set confuses the LM often leading to hallucinations

The Solutions

Modular Extensibility: giving our language models tools to explain data on which they have not been trained
Elicitive Prompting: where we give a language model a set of steps to use that improve its ability to compose a correct answer to the input question

Constraint of Machine Learning

The fundamental constraint in machine learning that language models are not immune to is the way models react to inputs that fall outside of their training data. In the general case if you give any model an input of information that it was not trained on it will not be able to give you a sufficient answer. In the case of language models it does much worse, it will hallucinate an answer that does not exist. You only need to think for a few seconds about various fields where these tools are being applied before you can come up with a use case where this would produce catastrophic results. All of the following problems we will address today are merely different manifestations of this fundamental constraint of machine learning. What is fascinating are the new ways it rears its ugly head as our models get larger and more useful… maybe I should take those singularity doomers more seriously 😅.

Information Access

One way that language models struggle to be useful for some practical use cases are their relatively limited access to information. I’m sure you’ve gotten the message in chat GPT or other LLMs that say something to the effect of: “I have no information after some date in 2021 and thus cannot tell you what the weather is today in Hyderabad.” These multi-billion parameter models have been trained on what is effectively the entirety of the index-able internet, which is hella expensive and consequently is not done every couple of weeks to make sure you can figure out who won the champions league this year. Another information access problem is that of proprietary data that you or your clients need in order to accomplish their goals with these tools. Whether it be an internal knowledge base the LLM needs access to in order to augment a customer service agent’s knowledge gap in your business, or the need to feed your calendar to the chatbot booking your appointments, having access to proprietary real time data is a great limitation to language model adoption in industry.

Reasoning Ability

A separate issue that also stems from the general constraint of machine learning is the inability for language models to reason. There are a myriad of manifestations of this problem, one of which was the focus of the empirical study in the MRKL Systems article [1]. They used GPT-3 and Jurassic-1 to perform basic arithmetic; It performed okay on 2-digit arithmetic, but performed pretty poorly on 4-digit numbers. We know computers do math well so why can’t LLMs? Well this stems from the fact that they are trained to “understand” the written word which is using heuristics to compare written language. When we ask them to do math they guess based on their training data, and due to the statistical nature of those guesses, it is wildly inconsistent.

Failure to Compose Final Answers

Arithmetic is not the only domain where language models fail to reason well; they also fail to compose data that are very simple yet disparate observed facts that LLMs fail to compose into simple solutions. This is a horribly vague way to introduce this problem so let’s talk about it in terms of the research that was performed in the article: “Measuring and Narrowing the Compositionality Gap in Language Models” [2]. They ran an experiment on a dataset referred to as Compositional Celebrities (CC), which is a set of 8700 2-hop questions that combined: “frequently stated facts in improbable ways (e.g., “Who won the Masters Tournament the year Justin Bieber was born?”), allowing us to disentangle memorization and reasoning”. Another example of one of these questions was: “what is the calling code of the birthplace of Frida Kahlo?”. So in their words this is how they define this gap:

We introduce the term Compositionality Gap to describe the fraction of compositional questions that the model answers incorrectly out of all the compositional questions for which the model answers the sub-questions correctly…

So they measured the rate at which a language model answered the question wrong when they got both of the sub-problems correct. So for example in the first example they want to know how often the final answer was wrong when the language model could figure out that Justin Bieber was born in 1994, and José María Olazábal won the masters in 1994. The results of their study are staggering using GPT-3 with davinci-002 embeddings, on the hardest categories (Birth Year/Literature Nobel Prize Winner) GPT-3 only answers 1.2% of the questions correctly, while answering 80% of the sub-questions correctly. So from this study they concluded: “as the GPT-3 model size grows, it knows more about the world, but its ability to compose this knowledge increases at a slower rate.” At the time this study was performed GPT-3 was the largest available LLM to test on, and it fundamentally failed to compose correct answers to questions that required it to compose two simple facts. Since it could determine the individual facts, the real fault was its inability to reason and compose them. The takeaway here is not that we’ve come a long way since GPT-3, rather models will continually memorize better and retain more knowledge without improving their reasoning skills at a comparable rate.

Now that we have enumerated some of the problems with LLMs whether they be problems with information access or their relative inability to compose facts, let’s take a look at some of the solutions that these two papers propose to improve the efficacy of these models in the wild.

MRKL Systems pronounced “Miracle”

Allow me to introduce the “Miracle” system in the words of the authors and then we’ll break down the solution in terms of the problems addressed above:

A MRKL system consists of an extendable set of modules, which we term ‘experts’, and a router that routes every incoming natural language input to a module that can best respond to the input (the output of that module can be the output of the MRKL system, or be routed to another module. These modules can be:
Neural, including the general-purpose huge language model as well as other smaller, specialized LMs.
Symbolic, for example a math calculator, a currency converter or an API call to a database.

The MRKL system provides a systematic solution to both the information retrieval and reasoning problems. By introducing a set of modules that our LLMs can route the input question or sub-questions to, it provides the language model a capacity to augment its weaknesses with external tools. The MRKL paper describes the router as a simple neural net that is easy to retrain on a new set of tools; However, if you are using Langchain to build your agents it is even simpler than that. You can simply define your tools with a name, a description, and the function to run when the language model decides the respective tool is the best for the question proposed as seen below.

search = SerpAPIWrapper()
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    )
]

The authors suggest a substantial list of benefits to using said methodology:

Safe fallback: In case the input doesn’t match any existing expert module, the router sends the input directly to the general-purpose huge LM.
Robust extensibility: Since each expert is trained independently we are able to cheaply add new capabilities while guaranteeing that they do not compromise the performance of existing ones. The only component that requires retraining is the router which is a relatively lightweight task.
Interpretability: When the router invokes a specific module, that often has the side benefit of providing a rationale for the MRKL system’s output (“1 + 1 = 2 because the calculator said so”); such explanations are crucially lacking in existing language models.
Up-to-date information: The integration of external APIs allows the MRKL system to hook into dynamic knowledge bases, and correctly answer inputs that static models cannot.
Proprietary knowledge: Access to proprietary databases and other information sources.
Compositionality: By routing compounded multi-hop inputs to different experts we are able to naturally integrate their responses and correctly address complex inputs.

Let’s unpack a few of these so we can better understand the power MRKL provides us. A counterpoint to Safe fallback is that we can prompt our LLM agent to NOT provide an answer rather than feeding the question directly to the LLM when none of the MRKL experts are helpful. This is an important strategy when the risk of providing a wrong answer outweighs the benefits of providing an answer that is not derived from the experts provided to it. Robust extensibility is one of the strongest arguments for using this method, and using langchain it does not even require us to retain the router. Up to date information and proprietary knowledge are pretty self explanatory, but are likely the most profound when it comes to using LLMs at industry. Finally the interpretability and the compositionality arguments are further supported by the Compositionality gap paper. They suggest: “the probability of an LM being able to compose two facts grows as its confidence about each fact rises” [2]. So being able to provide reasoning the language model can interpret increases the ability of the model to compose a correct final answer. The MRKL system is a powerful idea that researchers and practitioners alike will be building on for a long time to come. Now shifting gears we are going to build on this final idea of compositionality in our next section.

MRKL system design — MRKL system with some example experts

Elicitive Prompting and Self-Ask

Elicitive prompting is the process of crafting prompts that extract more informative, accurate, or nuanced responses by posing questions or statements in creative ways to guide the language model towards the desired answer. The first and canonical enumeration of this strategy is the chain of thought prompting technique, where you tell the prompt, “Let’s think step by step,” to prompt the model to use a step by step process to generate an answer [3]. This method is useful, but can be improved. The Self-Ask prompting strategy is defined as follow:

Our method builds on chain of thought prompting, but, instead of outputting a continuous undemarcated chain-of-thought, our prompt has the model explicitly state the next follow-up question it wants to ask before answering it…
Self-ask (depicted in Figure 3) requires a one or few-shot prompt that demonstrates how to answer the questions… We then insert a phrase “Are follow up questions needed here:” at the end of the prompt since we found that doing so slightly improves results.

Self-Ask, Chain of Thought, and direct Prompting — Illustrates direct, Chain of Thought, and Self-Ask Prompting

Self-Ask extends chain of thought by forcing the model to be more explicit in breaking down the question, and composing the sub-answers before returning a final answer to a question. This aims to solve some of the failures of chain of thought. Chain of thought can lose context of what the original question was in building up a final answer due to the length of the context window of the language model. Meanwhile self ask will provide the entire context from question, sub-questions, and intermediate answers, all the way through the reasoning process up to the final answer. Furthermore when chain of thought provides bad logic in an intermediate step it will likely go off the rails and provide a final answer that has strayed from the original question. This is still possible with self-ask; However, by breaking down sub-questions and intermediate answers in greater detail it is far less likely for this mistake in reasoning to occur.

The authors of “compositionality gap” put it more concisely than I can:

We Hypothesize that the advantage of self-ask over chain of thought is that it disentangles the decomposition of the full question (by formulating sub-questions) from the actual answers to those sub-questions. In addition, the rigid scaffolding self-ask provides makes it easier for the model to state the final answer in a concise, parseable way.

As it pertains to both the problems of reasoning and composition of final answers the self-ask method clearly provides a much more robust prompting strategy to increases the language model’s confidence in the sub-answers which are used to compose the final answer. This is supported by experimental evidence as well. On the Compositional Celebrities dataset introduced earlier, chain of thought reasoning obtained 45.7% accuracy while Self-Ask obtained 79.6% accuracy which is no small improvement between the two methodologies [2].

Where do we go from here?

So the question remains: what can we take away from these two papers and the methodologies they introduce? The MRKL System article provides us with two takeaways: 1) external tools provide us the ability to augment the language model’s reasoning deficiencies in domains poorly suited to LMs. 2) external tools also provide information retrieval to reason with information to which it does not have access. From the Compositionality Gap article we learned that LLMs are getting better at memorizing the world much quicker than they are improving at reasoning about it, and consequently we need to employ creative prompting strategies to provide adequate context and strategy to enable them to address the queries they’re given.

If we treat language models as the everything tool, we will inevitably end up trying to hammer a screw. But if we use language models for what they are good at, and provide tools and guard rails to augment them where they are weak, we can build systems and agents that can solve increasingly complex problems with greater levels of accuracy.

Thank you for your attention reading this article, it is your greatest commodity and for that I am grateful! If you enjoyed this article check out part 2 in the series: Evolving Language Model Prompting: Prompting Strategies for Enhanced Language Model Performance. If you haven’t seen it yet, you might enjoy my three part series on vector databases: Vector DBs the secret Sauce, the art of embeddings, and Understanding the Algorithm.