Revolutionizing Risk Assessment — Part 3: Tracing and initial prompt engineering

Tomasz Salacinski
11 min readMar 29, 2024

--

In the previous episode we created an interface and the context for interacting with the Claude 2.x LLM via AWS Bedrock. We performed an initial interaction with the model in order to create a joke about Julius Caesar. In the following episode we will create and refine the prompt for building the actual risk register.

You can find the code discussed in this episode here: https://github.com/ishish222/llm-risk-assessment (v.0.1.0)

We will start working with minimal versions of the Asset Inventory and Risk Scenario Register shown below.

Go ahead and create these examples in Google Sheets or similar application and export them as CSV or use the CSV content below:

asset_name,business_impact,description
Web Server Infrastructure,3,"Hosts the company’s web application, serving as the primary point of interaction with users. Critical for service availability and directly impacts revenue and customer trust."
Customer Database,3,"Contains sensitive customer information including personal data, payment details, and transaction history. Its integrity, confidentiality, and availability are paramount for compliance, trust, and operational continuity."
Logging and Monitoring Systems,2,"Collects and analyzes logs from various systems for monitoring performance, detecting anomalies, and facilitating incident response. Critical for operational awareness and security."
Development and Testing Environments,1,Used by the development team for building and testing new features and updates before they are deployed to the production environment. Important for maintaining the pace of innovation while ensuring application stability and security.
risk_scenario_name,likelihood,description
DDoS Attack on Service Infrastructure,2,"An attacker targets the company's service infrastructure with a Distributed Denial of Service (DDoS) attack, overwhelming the servers with traffic and making the service unavailable to legitimate users."
Data Breach Through Phishing Attack,3,An attacker successfully deceives an employee into revealing their credentials through a phishing email. The attacker gains unauthorized access to sensitive data stored on the company’s network.
Ransomware Infection,2,"Malicious software encrypts critical data and systems, rendering them unusable. The attacker demands a ransom payment for the decryption key. This scenario can disrupt operations and lead to data loss if backups are also compromised."
Third-Party Service Failure,3,"A critical third-party service provider experiences a failure or security breach, impacting the company’s ability to offer its digital service, either through direct service disruption or through a breach of data shared with the third party."

Now, we want Claude to connect assets to scenarios and evaluate the risk according to a quantitative formula:

Risk = Business Impact + Likelihood

The resulting table should contain a list of risks. Let’s start with the following prompt:

human_str = """
Please connect the following assets:
{assets}
to the following scenarios:
{scenarios}
Please only connect scenarios that are relevant to the assets.
Please calculate the risk score for each connection based on formula:
Risk = Probability + Impact
Please output the result in the following format:
<risks>
<risk num=1>
<asset>Sample asset 1</asset>
<scenario>Sample scenario 1</scenario>
<risk_score>3</risk_score>
</risk>
<asset>Sample asset 2</asset>
<scenario>Sample scenario 2</scenario>
<risk_score>4</risk_score>
<risk num=2>
</risk>
<asset>Sample asset 3</asset>
<scenario>Sample scenario 3</scenario>
<risk_score>1</risk_score>
<risk num=3>
</risk>
... // more risks
</risks>
"""

prompt = ChatPromptTemplate.from_messages([
("human", human_str)
])

chain = (prompt | llm | parser)
output = chain.invoke({
'assets' : assets.to_xml(root_name='Assets', row_name='Asset'),
'scenarios' : scenarios.to_xml(root_name='Scenarios', row_name='Scenario')
})

When we execute the prompt by pressing the “Create Risk Register” button we get the following error:

In order to understand what went wrong it would be helpful to examine the interaction in Langchain. Let’s create a project, generate an API key and add necessary environment variables to the .env file (and properly export them, of course).

After we execute the chain again we can now see access and explore recorded traces coming in the Langchain project:

We can see that there is a problem with parsing the output by the XMLOutputParser. Let’s examine the Chat interaction to inspect verbatim representation of inputs and outputs.

As you can see, Claude is returning the requested output but my guess is that the xml declaration on top of the model’s response is stopping the XML parser from properly parsing it back into a python object. Why is Claude adding this even if we didn’t ask it to? Due to lack of more specific examples, it’s following the pattern established by provided inputs that too include the xml declaration.

Let’s adjust:

chain = (prompt | llm | parser)
output = chain.invoke({
'assets' : assets.to_xml(root_name='Assets', row_name='Asset', xml_declaration=False),
'scenarios' : scenarios.to_xml(root_name='Scenarios', row_name='Scenario', xml_declaration=False)
})

Then we’ll flatten the output to be able to use it for output DataFrame construction:

flattened_data = []
for item in output['risks']:
flattened_dict = {}
for entry in item['risk']:
flattened_dict.update(entry)
flattened_data.append(flattened_dict)

return pd.DataFrame(flattened_data)

Now with the sample inputs we get an example output:

Let’s verify if the values are following our requested formula:

Risk 1: 3 + 2 = 5 (correct)
Risk 2: 3 + 3 = 6 (correct)
Risk 3: 3 + 2 = 5 (correct)
Risk 4: 3 + 2 = 5 (correct)
Risk 5: 3 + 3 = 5 (incorrect)
Risk 6: 3 + 3 = 6 (correct)

So, what went wrong?

The answer is, while adding two natural numbers is easy for any traditional algorithm, it’s difficult for the LLM. Let’s remember what LLM is: a huge neural network which primary task at any step of text generation to calculate the most probable next token. This is far from efficient method for mathematical calculations.

We can solve this problem in a number of ways, e.g.:

  1. just have the LLM connect the assets with scenarios, calculate estimation separately with traditional algorithm
  2. use LLM function calling technique for calculations
  3. apply prompt engineering techniques to break down complex task into a number of smaller, easier ones

We will focus on the last option for now. It might not provide an immediate solution to the issue but it will allow us exercise some techniques for improving the prompt engineering.

Improving by applying prompt engineering techniques

Let’s go back to the general recommended prompt structure for Claude 2.x that was already mentioned in part 2:

This is a part of a wider guide for prompt engineering for Claude 2.0. As I mentioned, in version 2.1 system messages have been introduced. Quote: “A system prompt is a way to provide context, instructions, and guidelines to Claude before presenting it with a question or task”. This means that we should move everything except the conversation history and an immediate task description or request from the Human message into the System message.

Let’s start with task context, tone, background data, detailed task description and rules.

As you can see, I’ve left a placeholder for any additional rules that I might want to adjust later, at the application level by passing one additional prompt parameter.

Next, we’ll move on to describing the general context of the task at hand.

As you can see, I mentioned examples but I left them out for now. This part requires a brief comment.

In-context examples and their significance

LLMs come in two general flavours: pre-trained and fine-tuned. Pre-trained models are being trained on a large body of available text in order to arrive at weights that allow them to “understand” the meaning of natural language input and to perform general language-related tasks such as:

  • classifying text
  • generating new text that corresponds to the input
  • filling-out missing (masked) parts of the text
  • etc.

Up to a certain point in the history of LLM (I’d say to the publication of the GPT-2 paper) these pre-trained models needed to be fine-tuned for specific tasks. That meant that we needed to prepare training data (example inputs and outputs of a specialised task) and then fine-tune the model to adjust the weights for the model to perform well in this specific field.

But then:

“It wasn’t until the GPT-2 and GPT-3 papers that we realized a GPT model pre-trained on enough data with enough parameters was capable of performing any arbitrary task by itself, no fine-tuning needed.

Just prompt the model, perform autoregressive language modeling, and like voila, the model magically gives us an appropriate response. This is referred to as in-context learning, because the model is using just the context of the prompt to perform the task. In-context learning can be zero shot, one shot, or few shot.” (source)

This means, if we give the model enough to go on in terms of in-context examples, we can achieve high quality output in specialised tasks without fine-tuning.

The question now is how many examples do we need and how we should organize them. In my experience, one or two in-context examples can significantly improve output quality in complex tasks.

In terms of organisation, here’s my approach. The best results I achieved by including in the System message examples of the Human — Assistant message pairs. This means unfortunately that the prompt will become somewhat convoluted. First, we’ll be providing in-context examples (Human-Assistant pairs), then we’ll be adding information on expected output schema, then we’ll be providing immediate input in the Human message… That’s a lot, especially if e.g. we need to include a lot of text in examples (e.g., 50+ page SOC 2 reports). That’s why it’s good to stick to the suggested high-level prompt structure for specific model and use XML tagging.

Let’s leave the examples field for now until we have general structure of the Human-Assistant interaction figured out, then we’ll “flesh it out”.

Thinking steps and expected output schema

I usually include the thinking steps and the definition of expected output schema into the System message as well.

In various prompt engineering tutorials you might often encounter recommendations to “give the model space to think”. It might be strange for an engineer to hear this kind of advice from another engineer but, similarly to “tipping” the model, this is one of those natural language processing patterns that initially seem out of place in the engineering field but actually work. “Giving space” is realised by instructing the model to come up with intermediate results by breaking down the complex task into steps and for each step, checking the results against a set of criteria before proceeding to the next step. In designing the thinking steps in a prompt it’s usually helpful to consult human operator that has expertise in the field.

Initial set of the thinking steps for the task at hand:

Now let’s define the output schema.

As you can see, there is space in the output schema for the model to put in their notes about the intermediate steps. Besides improving the quality of output by decomposing the problem we achieve one more thing — we have more control over the model’s chatter.

Some models (I assume the ones not fine-tuned well to follow instructions) have tendencies to add their chatter before or after the schema no matter how much you beg or threaten them, e.g..

Assistant:

Of course, please find the requested schema below:
<result>
[…]
</result>

or

Assistant:

<result>
[…]
</result>
With the following results you will be able to proceed with the risk treatment procedure.

This is rather frustrating when it comes to parsing the natural language output back to the structured data that can be handled by traditional algorithm.

I noticed that when I add space in the output schema for the model to think the chatter becomes outside the schema is very rare. And of course we can simply discard the chatter if we maintain the control over the schema.

Human message

All of the components for the System message except in-context examples have been sorted out. Now it’s time for the Human message in which we will include information on the specific task with task parameters that will be set on the application level. We’ll skip the chat history for now.

Aaaand that’s it :)

Let’s wrap it in Langchain’s ChatPromptTemplate and update our chain:

In-context examples based on Human-Assistant pairs

Now we need the in-context examples. But we don’t need to create them ourselves. Let’s have the model create them by simply passing a prompt with the in-context examples temporarily empty and seeing what the model will come up with themselves. Let’s load the sample Asset Inventory and Risk Scenario Register, run the chain and see what results we’ll get.

Haha OK that’s what I had in mind when I talked about issues with chatter, here before and after the schema :). But we can still extract valuable examples, let’s check out the trace in Langsmith:

This output needs a little adjustment but we can reuse it. First, let’s extract the schema without the chatter. Next, let’s wrap everything in <response> tags.

We get the following result:

And the trace:

The calculations are now correct.

Summary

In the following episode, we created and executed an initial prompt for risk calculation according to a defined formula based on test data. We encountered a problem with the output parser and diagnosed it using Langsmith tracing function. Then we encountered another problem which is model not being able to execute a simple mathematic operation.

We improved our prompt using prompt engineering techniques for the Claude 2.x model including:

  • using XML tags
  • following recommended high-level prompt structure
  • using System, Human and Assistant messages
  • breaking down the complex task into a simpler one and “giving the model space to think“
  • requesting the model to “think aloud”

In the following episode, we’ll further improve the application by breaking down the single prompt into several smaller, parallel prompts and we’ll turn the model into more autonomic LLM agent that will be able to call functions. Stay tuned!

--

--