Perturbation analysis and LLMs: How sensitive are LLMs to their input?

Published in

Data Science at Microsoft

12 min readMar 19, 2024

A perturbation is a small change to an initial reference point. For example, the calm waters of a river can become choppy if they encounter some boulders in their path. In the context of Large Language Models (LLMs), consider for a moment a perturbation that either:

Replaces a word with a typo
Replaces a word with a synonym

In these situations, we can automate perturbations of some input data and measure how different the output is from the language model. To do so, we must ensure that we are controlling for the randomness in the models — in this case in an example from OpenAI — by setting the seed parameter (the initial value for the random number generator used by the model) and ensuring that temperature (which controls the level of randomness in the resulting text) is set to 0.

Let’s start with an example. Many projects involving LLMs (particularly LLMs hosted behind an API) involve prompt engineering at some point in their lifecycle. A natural question to ask is which of the following two prompts generates a higher quality output.

Determining the better response to the prompts above is the essence of prompt engineering, which happens to encapsulate perturbation analysis. In giving the perturbations more of a spotlight here, I’d like to quantify how different a response we can expect for such a minute change in input.

In general, I could argue that inserting two different user responses into a prompt template is a perturbation and, as prompt engineers, we want the response from the system to remain consistent (with respect to some metric). By understanding the variance of the output we can quantify the precision of our model which, when combined with a high accuracy rate, will give us further confidence in the end-to-end system.

The statements and conclusions that I make in this article about the variance and precision of the model will, more fundamentally, be a property of the prompt template that is used to guide the model to specific behaviors for a wide variety of user inputs. The information we derive from this analysis will not necessarily inform us about how to change the prompt template, but it will provide us with better clarity about how the prompt template performs.

Section 2: A summarization problem

The NLP task for our example involves summarization (continuing with the example from above). In it we ask ChatGPT 3.5 Turbo to summarize the first 1500 tokens for data points from three different sources:

For each dataset, we load the first 100 datapoints. In the case of PubMed and Arxiv, we use the abstracts. For Goodwiki, we use the markdown content.

For this task, we use a remarkably simple prompt template:

where {data} is replaced by the data in our dataset (unperturbed or otherwise).

Why these three datasets? They come from different data domains and help us understand whether the results actually vary among different data domains.

We then define three different perturbation functions to insert perturbed data into the prompt template.

Insert a typo of a single letter with respect to the QWERTY keyboard distance (only adjacent characters are valid typo errors and only alphanumeric keys are considered).
Replace a selected word with a random word from the Natural Language Toolkit (NLTK) thesaurus.
Replace a selected word with a random word from the top five most similar GloVe words. GloVe is a set of pre-computed embeddings for a given vocabulary. Each word in the vocabulary has an assigned vector representation. A similar GloVe vector is computed using the cosine similarity between two vectors.

Each time a new data point is loaded, we randomly flag a maximum of five words per data point and perform one of the above operations three times. This means that one data point has potentially 15 different perturbations associated with it for each perturbation type! For 100 data points that brings us to about 1500 perturbations of each type.

We generate a response from an OpenAI model (ChatGPT 3.5 Turbo) for each data point and for each perturbation of that data point. See Figures 1–3 below for examples.

Section 2.1: Metrics

We’ve been referring to similarity and dissimilarity throughout this article and have, as of yet, left the definitions vague. For the problem of generated text, we define similar and dissimilar in two ways:

Cosine similarity (cosine_sim): We measure the cosine similarity between the text-ada-embeddings-002 of a string to a reference string.
Intersection Over Origin (IOO): A metric related to Intersection Over Union (IOU), except that we measure the number of sequential characters that are shared between a string and a reference string divided by the total number of characters in the reference string.

The cosine similarity captures the semantic meaning of the text. Broadly this is interpreted as the intent behind the string. IOO, on the other hand, captures differences in the characters used to create the semantic meaning. Having a high IOO almost always implies the cosine similarity will also be high. However, having a low IOO still allows for the cosine similarity to be high in certain cases.

Section 2.2: PubMed

To understand the perturbation analysis more deeply, we can examine the precision of the prompt template and model on one data point with respect to each perturbation type.

Examples

In the three figures below, a simple way to interpret the material is that black text and green text are the original inputs and outputs, while black text and red text are the perturbed inputs and outputs.

Figure 1 shows an egregious example of a datapoint where a simple typo leads to a vastly different set of words being selected by the OpenAI model. When “least” was changed to “lesst” all the words in green do not appear in the new output and are instead “replaced” by the words in red. The words in black are the only consistent words between the two outputs.

Figure 1: Typo perturbation and OpenAI response

Changing a word to a synonym, however, seems to have less of an effect on the output, as shown in Figure 2.

Figure 2: Thesaurus perturbation and OpenAI response.

As shown in Figure 3, the GloVe vector selection measures similarity in a manner that can sometimes perturb a word with something that is not necessarily a synonym in every context. The words permanent and temporary are similar in that they describe states of being but they carry vastly different meanings in most contexts. With respect to the GloVe vector distance, the inputs are very similar and yet we observe a relatively large change in the output.

Figure 3: GloVe perturbation and OpenAI response

For a high precision prompt template, we should see high similarity for both the cosine similarity and the IOO of the outputs.

Perturbation type: Typos

For each perturbation type, let’s see the effect on the output data in the PubMed domain.

First, let’s measure the similarity of the inputs after we insert a typo. Then, let’s compare whether or not the presence of the prompt template has a significant impact on the cosine similarity.

Shown above is the histogram of the cosine similarity of the input data perturbed by a typo. As expected, the typos have a very small effect on the cosine similarity (note the x-axis bounds). This is partially due to the length of the input but is generally an expected result. One could imagine that a typo in a single word has more impact than a typo in an entire paragraph. Now let’s see the effect on the output.

This is much more interesting. We can see that the distribution of outputs is semantically consistent across all the data points (with slightly more variance than the input) but there is a potentially large difference in the characters used to generate that output as evidenced by the histogram of the IOO metric.

This is either good news or bad news, depending on your context. When a business scenario is highly dependent on a specific set of tokens being present in your output, perhaps the high variance of the IOO metric is concerning. If instead, as is the case with the summarization problem, you care only about the semantic information contained within the output, then the ChatGPT 3.5 Turbo model is incredibly precise! (At least for data in the PubMed domain.)

Perturbation type: Thesaurus

We can perform the same analysis for the thesaurus perturbations.

For the thesaurus perturbations we see many of the same trends as we did for the typo perturbations. Now for the last perturbation type.

Perturbation Type: GloVe

Conclusions: PubMed

For the PubMed dataset, we see that the distribution of similarity scores is essentially invariant across all perturbation types considered with perhaps slightly more variance in the output from the GloVe perturbations than with the other two types of perturbations.

Section 2.3: Arxiv

Here we repeat the same analysis as in the prior section except within a different data domain and with the goal of understanding whether there are significant changes between the results for data of varying types (as one can never be too careful).

Perturbation type: Typo

Perturbation type: Thesaurus

Perturbation type: GloVe

Here we see a repeat of the patterns from the prior section, which is not surprising. The data domain difference between PubMed abstracts and Arxiv abstracts is not very large, though the Arxiv abstracts appear to have more special characters (which could have caused issues for the semantic analysis). We do see a slightly lower mean for the IOO metric across all the perturbation types, but it is well within reason.

Section 2.4: Goodwiki

To triple-check our work, let’s review the behavior using a vastly different dataset: Markdown text of Wikipedia pages. We maintain a restriction to the first 1500 tokens from each page for our summarization.

Perturbation type: Typos

Perturbation type: Thesaurus

Perturbation type: GloVe

Section 2.5: Analysis

The conclusion we can draw from the summarization problem examples is that the data domain does not significantly affect our metrics. However, it is difficult to state how general this result is. The two variables to be aware of are the data domain as well as the NLP task. For example, it could be that a question-answering task might have significant variance in the output semantics across differing domains while the summarization task does not.

Overall, we have found that for the task of summarization in these data domains, there is little variance across the metrics for each domain — and that, for each dataset, there is a high variability in the output tokens used with respect to a small semantic perturbation, which is a slightly surprising result. Additionally, there is very little variation in the semantics of the output text from which we conclude that, while the tokens can have a relatively high variability, the semantic meaning of the full output has very low variability with respect to the perturbation types explored here.

As a final note, consider that the analysis we performed and the results we collected are a property of the specific prompt template we used. Using a different prompt template may lead to significantly different results. Varying the prompt template is what we look at in the next example problem.

Section 3: A classification problem

Let’s turn to a slightly more sensitive use case: classification. In this example, we use the IMDB dataset and perform the same perturbation operations to the data except that now the output is intended to be either a 0 or a 1.

Throughout this section we ingest the same data but consider two different prompts and how each prompt responds to perturbations in the input.

Note that our goal here is not to create the best prompt template but simply to evaluate which template is more resistant to perturbations in the input data.

The problem statement in this section is the same but now we have a different output metric: We measure how many times a perturbation induced a change in the classification. To measure this, we check for consistency between the predicted class from the original data point and its corresponding perturbations.

Section 3.1: Results and analysis

For classification, the only metric of interest to us is whether the assigned class varies after insertion of a typo. Below is a brief chart that summarizes the impact of typos per prompt template.

We performed the same style of perturbations as before, so there are minor discrepancies in the total number of data points for each prompt template because the perturbations to the data were not identical.

Surprisingly, there was no measurable benefit to the prompt template’s resistance to perturbations by adding a one-shot example. In fact, we cannot confidently draw conclusions as to which prompt template performs better with respect to our perturbations as the results are well within the scope of being due to random noise. Perhaps this is because the data domain is straightforward. Or perhaps the one-shot example was too simple to make significant impacts.

Section 4: Conclusions

A perturbation analysis can be thought of as a measurement of the precision of a fixed prompt template and large language model. A highly precise prompt template and model will, in general, increase confidence in the system for which the large language model is deployed.

Importantly, the NLP task is subject to change for many business scenarios and yet, as evidenced by the summarization example and the classification example, a perturbation analysis can still be performed. The independence of this analysis from any of the downstream metrics is a large benefit to the process because you may perform the analysis early in the development of a solution to better understand the prompt templates and the chosen model.

A further benefit is that the automation of the perturbation process presented here allows for a low data environment to be augmented in a meaningful way. In particular, the scaffolding code for a perturbation analysis can be implemented in parallel with the labor-intensive task of labeling data.

Daniel Ferguson is on LinkedIn.

Perturbation analysis and LLMs: How sensitive are LLMs to their input?

Section 2: A summarization problem

Section 2.1: Metrics

Section 2.2: PubMed

Section 2.3: Arxiv

Section 2.4: Goodwiki

Section 2.5: Analysis

Section 3: A classification problem

Section 3.1: Results and analysis

Section 4: Conclusions

Written by Daniel Ferguson