Revolutionizing Risk Assessment with LLMs— Part 2: Boilerplate application
In this episode I will be implementing an LLM-based assistant that will support the Risk Manager in performing ISO-27005-based Risk Assessment by automating some of his tasks. The automation will involve natural language processing.
You can find the code discussed in this episode here: https://github.com/ishish222/llm-risk-assessment (v.0.1.0)
The wider context and rationale behind this implementation have been explained in previous episode Revolutionizing Risk Assessment — Part 1: The Context, feel free to check it out. In this episode we’ll be getting out hands dirty and start working with the code.
The initial version of the application will be dealing with the task of selecting a subset of matching scenarios for each asset to combine them into risks. We will need the operator in the loop to supervise the execution of the task and make necessary adjustments, so we also need to account for that.
Technology stack
I obviously selected components of the technology stack that I am most familiar with but they’re exchangeable. In this episode I will deploy a web application in python built on top of the Gradio and Langchain libraries. We’ll be using AWS as a provider of the Anthropic’s Claude 2.x model hosted via Bedrock (and an IaaS provider).
The Gradio package is a very useful interface builder that allows for re-using implementation of the most popular interface components used when interacting with LLMs. Gradio itself is built on top of FastAPI, a popular python package for implementing API interfaces.
Gradio lives in both worlds: backend and frontend. In the backend it allows us to abstract away the input-output interface components and just implement own methods to execute the transformation of data. In the frontend it generates the actual html content that will be served to the users.
Langchain is a set of packages that implement some classes and functions that allow for decoupling interfacing with specific LLM APIs and popular components of processing logic such as building prompts out of templates, providing RAG capabilities, LLM output parsing and structuring, etc.
Langchain’s close relative is Langsmith, LLM tracing, diagnostic and evaluation service. Langsmith deals with a variety of tasks that are related to analysis of the interactions between the application and the LLM on a message level. It does so by recording selected interactions and their metadata and adding them to a separately maintained dataset.
Procedure for automation
As the first step of risk assessment automation we will be combining asset information with risk scenario information to form the risk. On the input we will have a list of assets (Asset Inventory) and a list of scenarios (Risk Scenario Register) and on the output we’ll have the Risk Register.
To reiterate, we want to follow the ISO 27005 procedure in which the organisation decides on specific formula for the estimation of the final risk. The formula can be quantitative (e.g., risk estimation (3) = asset sensitivity (2) + scenario likelihood (1)) or qualitative (e.g., risk estimation (medium) = asset sensitivity (medium) + scenario likelihood (low)).
Where does the LLM part fit into this simple procedure? Why can’t we just do this manually or using traditional algorithm?
We could do this manually but it’s time consuming. First, we need to decide which combinations even make sense and discard all the remaining ones (e.g. the risk: “database” + “ransomware attack” makes sense while the risk “service continuity” + “phishing targeting customers” does not). Then, we need to calculate the estimation. If we have 20 items in Asset Inventory and 20 items in the Risk Scenario Register there are potentially 400 combinations that the operator needs to go through. And if risk re-assessment should be done e.g. every time new function is being considered for a product / service it can become a quite daunting task that would take the operator’s time out of the actual security operations.
We could try to automate it with traditional algorithm, up to a point. The estimation is a simple task of looking up appropriate value in the matrix, but combining the risk components and filtering out those which don’t make sense strongly relies on:
- IT security expertise
- Description of assets and risk scenarios represented in the natural language
The interface
Based on the procedure we defined we know what inputs and outputs we will be dealing with. Based on this we can define the initial version of the gradio interface:
import gradio as gr
with gr.Blocks() as app:
with gr.Row():
with gr.Accordion(open=False, label='Assets'):
with gr.Row():
rr_input_r_assets_f = gr.File(file_types=['.csv', '.xlsx', '.xls'])
with gr.Row():
rr_input_r_assets_load_btn = gr.Button('Load assets (CSV)')
with gr.Row():
rr_input_r_assets_inv = gr.Dataframe(label='Asset Inventory')
with gr.Row():
with gr.Accordion(open=False, label='Risk Scenarios'):
with gr.Row():
rr_input_r_scenarios_f = gr.File(file_types=['.csv', '.xlsx', '.xls'])
with gr.Row():
rr_input_r_scenarios_load_btn = gr.Button('Load scenarios (CSV)')
with gr.Row():
rr_input_r_scenarios_reg = gr.Dataframe(label='Risk Scenarios Register')
with gr.Row():
rr_create_btn = gr.Button('Create Risk Register')
with gr.Row():
rr_create_log = gr.Text(label='Creation log')
with gr.Row():
rr_output = gr.Dataframe(label='Generated Risk Register')
server_name='0.0.0.0', server_port=8080
After running the application and navigating to 127.0.0.1:8080 with the web browser we should see the following interface:
The Assets and Risk Scenarios inner content:
The DataFrame objects will be populated with the content of the CSV documents what the user chooses to load. Now we need to service the loading buttons:
import pandas as pd
def load_csv_data(
input_file: str
) -> gr.Dataframe:
input_file = input_file.name
print(f'Opening CSV: {input_file}')
df = pd.read_csv(input_file)
return df
with gr.Blocks() as app:
# interface definition
rr_input_r_assets_load_btn.click(fn=load_csv_data, inputs=[rr_input_r_assets_f], outputs=[rr_input_r_assets_inv])
We can reuse the load_csv_data function for loading the risk scenarios as well. The output gr.DataFrame object will assume a shape dictated by the contents of the input file. The only requirement it’s of rectangular shape, i.e. all of the rows have the same number of elements.
To service the generation button clicks we’ll add the following:
import pandas as pd
def generate_rr(
assets: pd.DataFrame,
scenarios: pd.DataFrame
) -> gr.Dataframe:
print('Generating Risk Register')
# interacting with Claude LLM
return pd.DataFrame()
with gr.Blocks() as app:
# interface definition
rr_create_btn.click(fn=generate_rr, inputs=[rr_input_r_assets_inv, rr_input_r_scenarios_reg], outputs=[rr_output])
This will serve as a context in which we will be interacting with the LLM to produce the actual risks.
LLM interaction
Since I decided to use AWS as IaaS provider to simplify the deployment I’ll be using Claude LLM served via Bedrock. Of course Claude is only one of tens of available models so we can exchange it for any other in order to test the produced output and maybe evaluate it and compare it.
In order to be able to re-use a wide variety of clasess for buildind LLM applications we’ll interact with the model through the Langchain’s chain. The chain in Langchain refers to any number of special objects called runnables that are chained together so that the output from one runnable is being passed on to the second runnable.
In langchain we have runnables ready to use for creating LLM prompts, wrapping the LLM endpoints and parsing the LLM outputs. For our purpose we’ll be using a chain:
chain = ( prompt | llm | parser)
Where:
- prompt represents the request that we’ll be sending out to Claude
- llm represents the Claude’s endpoint
- parser represents the output parser of content that is being generated by Claude in an attempt to structure the natural-language response back to a form that can be parsed with traditional algorithm
Preparing the llm and the parser is quite straightforward:
from langchain_community.chat_models import BedrockChat
LLM_TEMPERATURE = 0.0
LLM_MODEL_ID = 'anthropic.claude-v2'
session = boto3.Session()
bedrock = session.client('bedrock-runtime')
llm = BedrockChat(
model_id = LLM_MODEL_ID,
client = boto3_bedrock,
model_kwargs = {
"temperature": LLM_TEMPERATURE
}
)
We need to establish a session to AWS bedrock-runtime service using pre-defined user or role and within this service we’ll create a session with specific model which is in our case Anthropic’s Claude 2.1. The temperature parameter is one of the parameters used to adjust the generation process to control the “creativeness” of the model. To simplify, it allows us to decide the diversity of generated tokens of output in context of their probabilities. Higher temperature allows creating more “creative” outputs with unlikely tokens being included but if we are after precision we set the temperature to 0.0 which means that we want only the most probable tokens to be included in the output.
For the output parser we’ll use XMLOutputParser():
from langchain_core.output_parsers.xml import XMLOutputParser
parser = XMLOutputParser()
The choice of the parser requires a bit of explanation which warrants a separate section.
Significance of model training for prompt engineering
The process of training an LLM is very important if we want to improve its generated output through prompt engineering. But what is prompt engineering? Prompt engineering is a process of refining the prompt that is being fed to the LLM model as an input tokens sequence in order for it to produce improved output sequence. This is an iterative process that sometimes involves techniques from a weird area that borders engineering and psychology.
A trivial example of iterative improvement improvement of the prompt can be progressing from the prompt:
“Please tell me a joke about Julius Caesar.”
to:
“Please tell me a joke about Julius Caesar. If I laugh, I’ll give you a 10% tip.”
Believe it or not, but there is research that suggests that you can improve the quality of output by including this sequence of tokens. This is due to the fact that the training material used for building the model was originally produced by a human and humans often perform better when offered rewards. The output produced by the model is determined by the training material. That’s why it’s important to make sure that:
- Training date of the model used is later than the knowledge required for executing tasks has been published
- There is high likelihood that the material used for training included the body of knowledge required to properly execute the task
So for example in our case it might be sufficient that the training took place after publishing the version 2022 of the ISO 27005 standard and there is enough training material related to the standard that it’s very likely it has been included in the training. But this might not be the case with topics related to emerging technologies such as e.g. Zero Knowledge Proofs which witnessed a dynamic evolution in recent years or, Large Language Models themselves.
Another example of the evolution in the training process might be breaking down the continuous text into quasi-chat messages such as:
Human: Please tell me a joke about Julius Caesar.
Assistant: Why did Julius Caesar buy crayons? Because he wanted to Mark Antony!
This is a case of an evolution of the Claude model which started offering the System role in chat-like prompts since version 2.1.
There is number of other examples of training decisions that affect output such as constitutional training, languages used, etc. but for the purpose of this article I only need to mention two:
- overall prompt structure
- Using XML
Prompt structure for Claude 2.x
Claude performs better when the input prompt is formulated in XML and it follows an overall structure presented in this image:
These structure requirements might soon become obsolete (they should already be altered after Claude 2.1 premiere) as Claude itself evolves very fast. I present it here as an example that sometimes it’s required to understand the training process to improve prompt engineering for a specific model and that two different models might have two different “optimal” prompt organisation.
Now we’re ready to deal with the prompt engineering, but I’ll delve into this process in the next episode. For now, let’s start with a simple prompt:
def generate_rr(
assets: pd.DataFrame,
scenarios: pd.DataFrame
) -> gr.Dataframe:
parser = XMLOutputParser()
session = boto3.Session()
bedrock = session.client('bedrock-runtime')
llm = BedrockChat(
model_id = LLM_MODEL_ID,
client = bedrock,
model_kwargs = {
"temperature": LLM_TEMPERATURE
}
)
prompt = ChatPromptTemplate.from_messages([
("human", 'Please tell me a joke about {person}')
])
chain = (prompt | llm )
output = chain.invoke({'person' : 'Julius Caesar'})
return pd.DataFrame({'joke' : [output.content]})
That will give us the following output in the interface:
Summary
In this episode we created the boilerplate code for gradio-based application and, defined inputs and outputs of the LLM interaction and introduced some context that will be useful in our prompt engineering work.
In the next episode we will start working with Anthropic’s Claude 2.x models and we’ll lay foundation for prompt engineering for this model.