Part 4 — Enhancing Safety in LLMs: A Rigorous Mathematical Examination of Jailbreaking

Published in

Autonomous Agents

15 min readNov 26, 2023

The concept of jailbreaking Large Language Models (LLMs) such as GPT-4 represents a formidable challenge within the domain of artificial intelligence. This process entails the strategic manipulation of these advanced models to operate beyond their predefined ethical guidelines or operational boundaries.

In the previous blog, I covered “Mathematically Assessing Closed-LLMs for Generalization”

LLM and Transformers Series:

In this blog I aim to dissect the mathematical complexities and provide practical math tools for jailbreaking, thereby enriching our comprehension of this phenomenon.

0. General Techniques Often Associated with Jailbreaking Attempts

In this blog, I do not intend to cover the DAN’s or STANs which are already covered enough in other materials nor hand hold on basics of what is Jailbreaking. Here is an overview of the tehniques though.

Prompt Crafting: This involves designing prompts in a way that tries to exploit potential weaknesses or loopholes in the model’s response filters. This might include using coded language, indirect references, or specific phrasing intended to mislead the model.

Iterative Refinement: Some users might attempt to iteratively refine their prompts based on the model’s responses, gradually steering the conversation towards a restricted area or testing the limits of the model’s response guidelines.

Contextual Obfuscation: This technique involves embedding the actual request within a larger, seemingly innocuous context, in an effort to mask the true intent of the query from the model’s filtering mechanisms.

Social Engineering: In some cases, users might try to engage with the model in a way that mimics social engineering tactics, such as building rapport or trust, in an attempt to coax the model into breaking its own rules.

The paper titled Jailbreaking ChatGPT via Promo Engineering: An Empirical Study, provides a good taxonomy and approaches for initial study.

Everything covered in this blog is advanced math and toolkits to optimize and engineer better jailbreaking mechanisms.

1. Mathematical Framework for Prompt Engineering

Prompt engineering is a highly intricate technique that entails the strategic formulation of inputs to direct LLMs towards generating specific outputs, which may include content that is either prohibited or unintended. In this section, I intend to expand on the mathematical framework underlying prompt engineering, introducing more complex equations and concepts.

Mathematical Formulation of Prompt Engineering

Consider an LLM as a function F, mapping a prompt P to an output O. The process of prompt engineering can be conceptualized as a sophisticated optimization problem, where the objective is to refine P to achieve a predetermined output O_target. This optimization problem can be mathematically expressed as:

Here, the function ‘Score’ is a complex measure that quantitatively evaluates the alignment of F(P) with O_target. This function can be further elaborated as:

where wi are weighting coefficients and Eval_i are individual evaluation functions assessing different aspects of the alignment, such as semantic coherence, relevance, and subtlety.

Incorporating Advanced Linguistic Metrics

To enhance the sophistication of prompt engineering, advanced linguistic metrics can be integrated into the evaluation functions. For instance, semantic coherence can be assessed using a metric based on sentence embeddings:

where, Embed represents a function that converts text into a high-dimensional vector space, and CosineSimilarity measures the cosine of the angle between these vectors, indicating the level of semantic similarity.

Optimization Techniques

The optimization of P can involve gradient-based methods, where the gradient of the ‘Score’ function with respect to P is computed to iteratively refine the prompt. This can be represented as:

P_new represents the updated prompt.
P is the original prompt.
η denotes the learning rate, a scalar that determines the step size during the optimization.
∇PScore(F(P),Otarget) is the gradient of the ‘Score’ function with respect to the prompt P. This gradient indicates the direction in which the prompt should be adjusted to maximize the ‘Score’ function, which measures the alignment of the LLM’s output with the target output O_target

Here is an entire paper on Automatic Prompt Optimization with “Gradient Descent” and Beam search.

Practical Example with Advanced Considerations

A practical example of prompt engineering could involve constructing a prompt that appears neutral but is intricately designed to exploit the model’s pattern recognition capabilities. This could involve using a combination of syntactic structures and semantic cues that are known to trigger specific responses in the model. The effectiveness of such a prompt can be evaluated using the advanced ‘Score’ function, ensuring that the generated content subtly aligns with the intended unethical or prohibited output, while maintaining a facade of neutrality.

Oh, you thought I shall present you with an example of such a prompt so that you can run a jail break? Tough luck ;)

You can find a bunch of prompts here jailbreakchat.com (most do not work as they have been actively fixed in LLMs

2. Leveraging Inherent Model Biases

The approach of leveraging inherent model biases in LLMs involves a sophisticated understanding of the biases that arise from the model’s training data. These biases can be strategically used to elicit specific responses from the model. The mathematical framework for this approach can be expanded to include more complex equations and concepts.

Mathematical Perspective

Bias Vector Formulation: Let’s consider a bias vector B within the model. This vector represents the directional tendency of the model’s responses based on its training data. The output O given an input I can be expressed as a function influenced by this bias:

Here, ϵ(B) is a perturbation function that modifies the output in a direction aligned with the bias vector B.

Gradient Ascent for Bias Amplification: To amplify the influence of the bias, we can use a gradient ascent approach. The input I is iteratively adjusted to maximize the alignment of the output with the bias vector. This can be mathematically represented as:

In this equation, α is the learning rate, and ∇IScore(F(I),B) is the gradient of the ‘Score’ function with respect to the input I. The ‘Score’ function measures how well the output aligns with the desired bias.

Complexity in Bias Representation: To add depth, we can consider a multi-dimensional bias space where B is a matrix representing various bias dimensions. The perturbation function ϵ can then be a more complex function, potentially involving non-linear transformations to capture the intricate ways in which biases manifest in the model’s outputs.

Probabilistic Modeling of Bias Influence: We can also introduce a probabilistic model to capture the uncertainty in how biases influence the model’s output. For instance, the perturbation ϵ(B) can be modeled as a stochastic process, adding a layer of probabilistic reasoning to the bias exploitation.

Practical Example with Advanced Considerations

Consider a scenario where the model has a known bias towards a specific political ideology. The goal is to craft inputs that not only align with this bias but also amplify it in the model’s output. Using the advanced mathematical framework:

We first identify the bias vector B corresponding to the political ideology.
We then iteratively adjust the input I using gradient ascent, guided by the multi-dimensional bias space and the probabilistic perturbation model.
The process involves calculating the gradient of the ‘Score’ function in this complex bias space, considering both the deterministic and stochastic components of bias influence.

This advanced approach allows for a more nuanced and effective exploitation of biases in LLMs, leading to outputs that are more closely aligned with the desired bias, while also accounting for the inherent complexities and uncertainties in how biases influence model behavior.

3. Circumventing Safety Mechanisms

The process of circumventing safety mechanisms in LLMs involves a sophisticated understanding of the model’s safety protocols and identifying ways to exploit their vulnerabilities. This section expands on the mathematical framework for this approach, introducing more complex equations and concepts.

Mathematical Perspective

Safety Protocol Function: Let S denote the safety protocol function applied to the model’s output O. This function is designed to mitigate harmful or undesirable outputs. Mathematically, the safety protocol can be represented as:

Here, O_safe is the output after applying the safety protocol to the original output F(I).

Objective Function for Circumventing Safety Protocols: The goal is to find an input I such that the safety protocol S fails to neutralize the detrimental aspects of the output. This can be formulated as an optimization problem:

The ‘Score’ function here measures the effectiveness of the input I in generating an output that is harmful yet bypasses the safety protocol.

Incorporating Constraint Satisfaction: The optimization problem should be subject to the constraint that I is a valid input. This can be represented using a constraint function C(I), which ensures that I adheres to certain predefined criteria:

where,

max_I: Represents the maximization process over the input variable I.
Score(S(F(I)),Harmful): Measures the effectiveness of the input I in generating an output that is harmful yet bypasses the safety protocol S.
subject toC(I)=True: Adds a constraint to the optimization problem, ensuring that the input I satisfies certain conditions (represented by C(I)).

Advanced Optimization Techniques: To solve this constrained optimization problem, advanced techniques such as Lagrangian multipliers or penalty methods can be employed. These methods allow for the incorporation of the constraint into the optimization process, ensuring that the solution respects the validity of the input.

Probabilistic Modeling of Safety Protocol Efficacy: Introducing a probabilistic model to estimate the likelihood of an input bypassing the safety protocol can add depth to the analysis. This involves modeling the safety protocol’s effectiveness as a stochastic process, adding a layer of uncertainty to the circumvention strategy.

Practical Example with Advanced Considerations: Consider a scenario where the goal is to engineer inputs that produce outputs with harmful implications in a manner that is not detected by the safety filters. Using the advanced mathematical framework:

The input I is iteratively adjusted using optimization techniques, considering both the ‘Score’ function and the constraint function C(I).
The process involves balancing the maximization of the ‘Score’ function with the satisfaction of the constraint, ensuring that the input remains valid while effectively bypassing the safety protocol.
Probabilistic models can be used to estimate the likelihood of success in bypassing the safety protocol, guiding the optimization process.

This advanced approach allows for a more nuanced and effective strategy in circumventing safety mechanisms in LLMs, leading to a deeper understanding of the vulnerabilities in these systems and how they can be exploited while adhering to input validity constraints.

4. Advanced Techniques in Jailbreaking

To deepen the discussion on advanced techniques in jailbreaking and incorporate more complex mathematical equations, we can expand on each of the mentioned techniques:

Adversarial Machine Learning

Adversarial machine learning in the context of LLMs often involves creating input data that is slightly perturbed to mislead the model.

This can be mathematically represented as:

where I is the original input, δ is a small perturbation, and I_adv is the adversarial input. The goal is to find δ such that the model’s output is significantly altered. The optimization problem can be formulated as:

where, L is a loss function that measures the discrepancy between the model’s output F(I+δ) and the desired target output Y_target.

Generative Adversarial Networks (GANs)

GANs consist of two neural networks, the generator G and the discriminator D, competing against each other. The generator aims to produce data that is indistinguishable from real data, while the discriminator tries to distinguish between real and generated data. The objective function for a GAN can be represented as:

where,

minG; represents the minimization process over the generator network G.
maxD; represents the maximization process over the discriminator network D.
Ex∼p_data(x)[logD(x)] is the expectation over real data samples, measuring the log likelihood that the discriminator correctly classifies real data.
Ez∼p_z(z)[log(1−D(G(z)))] is the expectation over noise samples, measuring the log likelihood that the discriminator correctly classifies generated (fake) data.

Generation Aware Alignment Strategy

The generation-aware alignment method aims to improve the model’s defense against attacks that exploit its generation capabilities. This method actively compiles examples using various decoding configurations. To elaborate, for a given language model fθ and a prompt p, the model produces output sequences r by sampling from h(fθ,p).

Here, h is a decoding strategy within the set of decoding strategies H, converting the model’s predicted probabilities for subsequent tokens based on the prompt p into a sequence of tokens from the vocabulary V. In this process, the generation-aware alignment collects n different responses for each prompt p, denoted as:

where ℎ,ri,h,p∼h(fθ,p) represents the i-th sampling result from h(fθ,p). These responses Rp are then sorted into two categories: Rp,a for aligned (or appropriate) responses, and Rp,m for misaligned (or inappropriate) responses. The objective of this generation-aware alignment is to minimize errors based on a retrospective evaluation method.

This equation represents a loss function, L, which is used in the context of a generative model. It is an average over a set of prompts, P, and for each prompt p, it calculates a normalized sum of negative log probabilities over pairs of responses (rm,ra) from misaligned and aligned response sets Rmp and Rap, respectively. Sa and Sm represent scenarios or conditions for aligned and misaligned responses. The loss function is minimized during the training of the model to improve alignment with desired outcomes.

Here is an entire paper on Catastrophic Jailbreak of OpenSource LLMs via Exploiting Generation.

Neural Network Interpretability

Understanding a neural network’s decision-making process can be approached through techniques like Layer-wise Relevance Propagation (LRP) or Shapley value analysis. For instance, LRP decomposes the output decision into contributions of individual input features, which can be expressed as:

where, Ri is the relevance of neuron i, ai is its activation, wij is the weight between neurons i and j, and Rj is the relevance of neuron j in the subsequent layer. This decomposition helps in tracing back the decision-making process.

These mathematical formulations provide a deeper and more intricate understanding of the advanced techniques used in jailbreaking LLMs, highlighting the sophisticated interplay between algorithmic design and model exploitation.

5. Ethical Implications and Countermeasures

Let’s delve deeper into the discussion on ethical implications and countermeasures related to jailbreaking.

Robust Training

Robust training can involve adding adversarial examples to the training dataset. This introduces perturbations to the input data to ensure the model’s stability against potential adversarial attacks. Mathematically, this can be represented as:

where,

minθ represents the minimization process over the model parameters θ.
E(x,y)∼Data denotes the expectation over data samples (x,y) drawn from the Data distribution.
L is the loss function measuring the model’s performance.
f(x;θ) represents the model’s output for input x with parameters θ.
λ controls the influence of adversarial examples in training.
Ex′∼Adversarial denotes the expectation over adversarial examples ′x′ drawn from the Adversarial distribution.

This equation captures the objective of training a model to be robust against adversarial examples by minimizing the loss while considering both real and adversarial data.

Dynamic Safety Protocols

Developing adaptive safety mechanisms can involve reinforcement learning techniques. For instance, consider a Markov Decision Process (MDP) where the model’s actions influence the safety of generated content. The objective can be defined as:

where,

max_π represents the maximization process over the safety policy π.
Es∼p(s) denotes the expectation over states s drawn from the state distribution p(s).
The summation ∑ t=0..∞ extends over time steps t from 0 to infinity.
γt represents the discount factor for rewards at time step t.
R(st,at) is the reward associated with state-action pairs st and at.

This equation reflects the objective of reinforcement learning in the context of dynamic safety protocols, where the policy π is optimized to maximize the expected cumulative reward over an infinite time horizon.

Transparency and Oversight

Ethical Compliance Index: To ensure ethical compliance, an “Ethical Compliance Index” (ECI) can be defined mathematically as a combination of factors such as fairness, bias mitigation, and alignment with ethical guidelines. This can be expressed as:

In this equation:

ECI represents the Ethical Compliance Index, which is a composite measure of ethical performance.
α, β, and γ are weighting factors that determine the relative importance of fairness, bias mitigation, and ethical alignment, respectively.
Fairness, Bias Mitigation, and Ethical Alignment are individual metrics or scores related to these ethical considerations.

By incorporating these complex mathematical representations, we deepen our understanding of the countermeasures against jailbreaking while addressing ethical implications, thereby enhancing the robustness and ethical compliance of LLMs.

6. Other Mathematical Modeling of Jailbreaking Strategies

Bayesian Probabilistic Modeling

Probabilistic models can estimate the likelihood of a prompt leading to a jailbreak. Using Bayesian modeling, we can calculate the posterior probability of a prompt P causing a jailbreak J given observed data D:

where,

P(J∣D) represents the conditional probability of jailbreak J given observed data D.
P(D∣J) is the likelihood of data given jailbreak.
P(J) is the prior probability of jailbreak.
P(D) is the evidence.

This equation is commonly used in Bayesian probability to calculate the posterior probability of an event (in this case, jailbreak) given observed data and prior probabilities.

Game Theory Applications

Game theory can model the interaction between jailbreakers and LLM developers. Consider a two-player game where the jailbreaker and developer each choose strategies. The Nash Equilibrium is defined as a set of strategies where no player can improve their outcome by unilaterally changing their strategy. Mathematically:

where,

∀i denotes “for all i,” indicating that the following statement applies to all players i in a game.
Ui represents the utility or payoff function of player i.
si* and s*−i are the strategy profiles in equilibrium (Nash Equilibrium).
si is the strategy of player i in consideration.
s*−i represents the strategy profile of all other players except player i.

The expressions express the condition of Nash Equilibrium, where no player i can improve their utility by unilaterally changing their strategy si, given that all other players stick to their equilibrium strategies s*−i.

7. Future Directions in Jailbreaking Research

Quantum Computing and LLMs: Investigating the potential ramifications of quantum computing on jailbreaking, given its unique information processing capabilities.
Cross-Model Vulnerability Analysis: Examining how vulnerabilities in one model might translate to others, fostering a more comprehensive grasp of systemic risks in LLMs.
Ethical AI Development Frameworks: Crafting frameworks to guide the ethical development of AI, integrating insights from jailbreaking research to inform best practices.

Jailbreaking LLMs presents a multifaceted challenge that underscores the necessity for robust, ethical, and adaptive AI systems. A thorough understanding of the mathematical underpinnings and potential vulnerabilities of these models is pivotal in crafting more secure and responsible AI. As LLMs continue to evolve, our strategies for their safe utilization must also advance, ensuring their role as potent yet secure tools in the expanding landscape of AI.

Part 4 — Enhancing Safety in LLMs: A Rigorous Mathematical Examination of Jailbreaking

LLM and Transformers Series:

0. General Techniques Often Associated with Jailbreaking Attempts

1. Mathematical Framework for Prompt Engineering

2. Leveraging Inherent Model Biases

3. Circumventing Safety Mechanisms

4. Advanced Techniques in Jailbreaking

5. Ethical Implications and Countermeasures

6. Other Mathematical Modeling of Jailbreaking Strategies

7. Future Directions in Jailbreaking Research

LLM and Transformers Series:

Written by Freedom Preetham