Time Impact of Logical Complexity in AI Prompt Processing: Gaining Intuition
Ever noticed that “complicated” prompts take longer to run than simple ones?
Why is that?
Answers From the Internet
I did a little digging across the internet, here’s a couple of the more interesting answers I found.
“The complexity of a question has an impact on the inference time of large language models (LLMs). In general, as the complexity of a question increases, the inference time of LLMs also tends to increase. This is because complex questions require more processing and reasoning, which takes more time for the LLM to generate an answer.”
Source: https://typeset.io/questions/how-does-the-complexity-of-a-question-affect-the-inference-2nuwy0y1lt
“Complexity of Processing: More complex questions often require deeper understanding and more extensive processing. The AI has to consider various aspects of the question, search through a larger volume of its training data, and sometimes perform logical reasoning or synthesis of information from different sources. This process is computationally intensive and takes more time.
Longer Responses: Complex questions often require longer, more detailed responses. Generating these responses involves constructing more sentences, ensuring coherence, and checking for accuracy, all of which take additional time.
Additional Tools Usage: For some questions, ChatGPT might need to use additional tools like a Python environment or a browser tool to fetch or calculate the required information. Interacting with these tools and integrating their outputs into the response further adds to the response time.
Ambiguity and Clarifications: Complex questions sometimes contain ambiguities or require clarifications. The model might need to spend extra time understanding the context or asking follow-up questions to ensure an accurate response.
Resource Allocation: Depending on the server load and resource allocation, response times can vary. More complex queries might require more computational resources, which can be affected by the current demand on the system.
In summary, the increased time for responding to complex questions is mainly due to the higher computational demands for processing, generating longer responses, using additional tools, handling ambiguities, and the overall resource allocation of the system.”
Source: GPT-4
Models Need Time to Think
The consensus is generally that models need more time to think when answering difficult questions, but what does this actually mean?
There are two ways this idea is presented:
1. Time to think == More output tokens
- More time to think being equivalent to more output tokens would be logical. More output tokens would allow the model to “reason” at the cost of additional next-token predictions and, therefore, longer inference times.
2. Time to think == More Processing Time
- This one doesn’t make sense to me. LLMs use the transformer architecture to predict the next token in a sequence from left to right. Why would computing the most probable next token in a sequence depend on whether the preceding tokens were “complicated” and needed more reasoning?
- Inference times increasing with question complexity seems like a natural conclusion to draw if an LLM were human, but as we all know (or at least I hope), that’s not the case!
- The core question to ask here is: “Does question complexity explain the variance among LLM inference times?”
Experiments
This experiment will not be a proper benchmark or academic experiment; it is a practical exploration to help form some intuition on the topic.
I want to understand two things:
- How does the number of tokens affect inference time,
- How question complexity affects inference time.
To do this, I’ll create a set of prompts of varying lengths and complexities and measure the time the LLM takes to respond.
You can check out the code and prompts here:
Results
Looking at the Plots
- Number of output tokens influences inference time far more than input tokens or complexity.
- There are not many outliers from the surface predicted by the number of tokens; it seems to fit quite well. To confirm this, I decided to run some statistical tests. — let’s proceed!
Linear Regression
One obvious confounding factor here is network latency. To control for this, every prompt was run sequentially over the course of a minute or so.
R-squared:0.988
Adj. R-squared:0.986
F-statistic: 555.8
Including Complexity in the Prediction
Notice the negative coefficient on complexity. That’s in direct opposition of what we might expect.
R-squared: 0.988
Adj. R-squared: 0.985
F-statistic: 360.7
Observations
- The number of output tokens disproportionately influences computation time. Response times increase ~177 times faster with increasing output token number compared to an equal increase in input size. This difference in impact makes sense since transformers do not require the (input) embeddings to be processed in order, allowing for parallelization and a lesser effect on computation time.
- Question complexity doesn’t seem to have much of an impact. With a P-value of 0.083, it’s close to a traditional academic bar of 0.05 but has a confidence interval of [-0.363, 0.025] which is centred around 0. More data would be required to draw strong conclusions but even at the high end of the interval, it would be ten times less impactful on inference time than output tokens.
Intuition Gained
- Optimizing speed in LLM applications is massively impacted by optimizing the number of output tokens generated by the model.
- Input augmenting techniques like few-shot prompting should be considered before output augmenting techniques like Chain of Thought.
- Suppose we assume that an LLM call’s cost is captured in equation 1 as the constant with a value of ~0.3 seconds. We can value LLM calls input and output tokens in terms of time. Remember, these numbers would have error bands, so they’re an indicator of magnitude, not precise values.
- One LLM call has the same time cost as ~20 additional output tokens.
- One LLM call has the same time cost as ~3300 additional input tokens.
- 177 Additional Input tokens have the same time cost as ~1 output token
Follow-Up Thoughts on Agent Design
1. Performance Monitoring
Tracking the total number of output tokens used across chain runs in production is a network latency-agnostic way of measuring Agent performance.
2. Optimizing for speed (and cost)
Can we speed up complex agents by leveraging additional LLM calls early in the answer generation process to reduce total output tokens later in the answer generation process? It’s like sorting an array before performing operations on it. Pay a small upfront cost for massive gains later in the process.
Perhaps more simply, can we trade off network latency to reduce output tokens?