LLM Criteria Evaluation with AWS Bedrock and Step Functions

A proof of concept

Pieterjan Criel @pjcr

Published in

Product & Engineering at Showpad

9 min readApr 10, 2024

Introduction

Whether it’s experimenting with various prompts, deciding on model upgrades, balancing cost and performance, or ensuring unbiased outputs, an effective evaluation framework is required.

When you’re dealing with tasks that lack predefined reference labels, such as evaluating production data or non-fact-based queries, the criteria evaluator proves to be a handy tool. It allows you to verify an LLM or Chain’s output against a bespoke set of criteria, focusing on semantic aspects like creativity or relevance, beyond mere factual accuracy.

This article looks at implementing such a criteria evaluation using AWS Bedrock (Claude) and Step-Functions.

All code is available on Github.
https://github.com/PieterjanCriel/bedrock-stepfunction-criteria-evaluator
Remark; deploying this code on an AWS account is not covered in the free tier.

LangChain’s CriteriaEvalChain Class

In LangChain, the CriteriaEvalChain class is designed to facilitate evaluation by allowing you to define and apply your specific criteria to an LLM’s outputs. It returns a dictionary with reasoning, alongside a binary “Y/N” value, providing a comprehensive basis for assessment. LangChain includes several predefined evaluators catering to various criteria such as conciseness, relevance, and correctness. You can also define custom criteria to suit your specific evaluation needs.

The associated prompt is as follows:

You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: {}
***
[Submission]: {}
***
[Criteria]: {}
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.

Which needs to be interpolated with an input, a submisssion and the criteria. For Example:

Input:
What is the capital of Belgium
Submission:
Brussels, the vibrant capital city of Belgium, is not only the administrative heart of the country but also a central hub for the European Union, hosting major institutions such as the European Commission and the European Council. This distinction lends Brussels an international flair, with a diverse population and a cosmopolitan atmosphere that is palpable as you walk its streets.
Situated in the central part of Belgium, Brussels is unique in that it is…
Criteria: conciseness

Based on the above interpolated in the base prompt for evaluation, the output will contain some reasoning that the submission contains way more information than was actually required to address the input; As such the evaluation on conciseness will result in N (not concise, while factually correct)

Leveraging AWS Bedrock and Step Functions to implement a Criteria Evaluator

In this section we’re going to explore way of implementing a Criteria Evaluation chain without depending on LangChain. For each new dataset (hosted on AWS S3), we want to trigger an AWS step-function that will process the dataset (file) and stores the results of a Criteria Evaluation chain into a new file. As such the data does not have to leave the production account and the results are immediately available for further analysis.

AWS CDK

CDK allows you to write your infrastructure as code in a supported programming language, such as Typescript. We will use CDK to manage our deployment.

Processing Files on S3

The Step Functions distributed map offers a robust tool for constructing parallel, serverless workflows for data processing. It integrates effectively with S3, facilitating the efficient handling of millions of objects.

To set up a large-scale parallel workload in your workflows, include a Map state in Distributed mode. This mode processes the items in the dataset in iterations called child workflow executions. You can specify the number of child workflow executions that can run in parallel. Each child workflow execution has its own, separate execution history from that of the parent workflow. If you don't specify, Step Functions runs 10,000 parallel child workflow executions in parallel! Crazy right!

AWS Bedrock integrations

The new Step Functions integrations with Amazon Bedrock allow you to orchestrate tasks to build generative AI applications using Amazon Bedrock. The InvokeModel API action allows you to invoke a model and run the inferences with the input provided in the parameters. Use this API action to run inferences for text, image, and embedding models. It supports large payloads through Amazon S3, enhancing data handling capabilities.

Pre-requisites

You’ll need an AWS Account where you have the permissions to create AWS resources. In addition you need to activate AWS Bedrock for that account (setting up AWS Bedrock).
Some Typescript experience
AWS CDK (CLI) if you want to deploy from your local system

Remark; deploying this code on an AWS account is not covered in the free tier.

Implementing the Criteria Evaluation chain

Diagram of the criteria evaluator with AWS StepFunctions and Bedrock

Several resources need to be created; A Bedrock model (Claude Instant is chosen here), an S3 bucket that will store our datasets and the results. A distributed map state, which will map the evaluation as separate child workflow executions. A BedrockInvokeModel task is mapped on the input file. The payload of this task depends on the chose Bedrock model! For the Claude Instant model, the body requires a prompt, and a value for the temperature and max_tokens_to_sample.

const prompt = "States.Format('Human: " + EVALUATION_CRITERIA_PROMPT + ".  Assistant:', $.input, $.submission, $.criteria)";

body: TaskInput.fromObject(
  {
    "prompt.$" : prompt,
    max_tokens_to_sample: 1000,
    temperature: 1,
  },
)

With the prompt.$ defined as (The $ sign indicates that the value for this key needs to be resolved during runtime):

The States.Format() method allows us to interpolate values from the mapped objects in the prompt!

The EVALUATION_CRITERIA_PROMPT is defined as:

export const EVALUATION_CRITERIA_PROMPT = `You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: {}
***
[Submission]: {}
***
[Criteria]: {}
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.`;

Also note that the prompt is wrapped as a Human: <prompt>, Assistant: string as this is a requirement for LLMs in the Claude model family.

Hereafter the code for the BedrockStepFunctionStack is given. This stack integrates a model from Amazon Bedrock and orchestrates the evaluation process using a Map to process multiple items and a Task to invoke the model as discussed above. This setup runs the evaluation criteria prompt for each item in the json array uploaded to S3 and will save the result in a new file.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import {
  JsonPath, Map,
  S3JsonItemReader, DistributedMap,
  StateMachine, TaskInput, ResultWriter, IItemReader
} from 'aws-cdk-lib/aws-stepfunctions';
import * as bedrock from 'aws-cdk-lib/aws-bedrock';
import { BedrockInvokeModel } from 'aws-cdk-lib/aws-stepfunctions-tasks';

import { EVALUATION_CRITERIA_PROMPT } from './prompts';
import { Bucket, EventType, IBucket } from 'aws-cdk-lib/aws-s3';
import { PythonFunction } from '@aws-cdk/aws-lambda-python-alpha';
import path = require('path');
import { Runtime } from 'aws-cdk-lib/aws-lambda';
import { PolicyStatement, Role, ServicePrincipal } from 'aws-cdk-lib/aws-iam';
import { S3EventSource } from 'aws-cdk-lib/aws-lambda-event-sources';

export class BedrockStepFunctionStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const model = bedrock.FoundationModel.fromFoundationModelId(
      this,
      'Model',
      bedrock.FoundationModelIdentifier.ANTHROPIC_CLAUDE_INSTANT_V1,
    );

    const inputBucket = new Bucket(this, 'InputBucket', {});
    const outputBucket = new Bucket(this, 'OutputBucket', {});

    const distributedMap = new DistributedMap(this, 'Distributed Map State', {
      itemReader: new CustomItemReader(),
      resultWriter: new ResultWriter({
        bucket: outputBucket,
        prefix: 'output',
      })
    });

    const prompt = "States.Format('Human: " + EVALUATION_CRITERIA_PROMPT + ".  Assistant:', $.input, $.submission, $.criteria)";

    const task = new BedrockInvokeModel(this, 'Prompt Model', {
      model,
      stateName: 'Bedrock criteria prompt',
      body: TaskInput.fromObject(
        {
          "prompt.$" : prompt,
          max_tokens_to_sample: 1000,
          temperature: 1,
        },
      ),
      resultSelector: {
        names: JsonPath.stringAt('$.Body.completion'),
      },
    });


    // Step Function
    const stateMachine = new StateMachine(this, 'StateMachine', {
      stateMachineName: 'bedrock-step-function',
      definition: distributedMap.itemProcessor(task),
    });

    const lambdaRole = new Role(this, 'LambdaRole', {
      assumedBy: new ServicePrincipal('lambda.amazonaws.com'),
    });

    stateMachine.grantStartExecution(lambdaRole);

    const triggerFunction = new PythonFunction(this, 'TriggerFunction', {
      entry: 'lambda/trigger',
      runtime: Runtime.PYTHON_3_12,
      handler: 'handler',
      index: 'index.py',
      role: lambdaRole,
      timeout: cdk.Duration.seconds(5),
      memorySize: 128,
      environment: {
        STATE_MACHINE_ARN: stateMachine.stateMachineArn,
      },
    });
    
    triggerFunction.addEventSource(new S3EventSource(inputBucket, {
      events: [EventType.OBJECT_CREATED],
      })
    );
    
  }
}

Notice that the a CustomItemReader is used in the distributedMap. The reason for that is that the ItemReader requires you to set an S3 reference using the IBucket interface. We actually have this bucket in the code; but we want this to potentially work with many bucket, hence we want to pass them from the previous state, or in this case, from the Step Function input. AWS CDK did not provide that functionality, so I extended the ItemReader to render the JSON output we need for that behavior.

interface ICustomItemReader extends IItemReader {
  // Optionally modify the method signature if needed
  render(): any;
}

class CustomItemReader implements ICustomItemReader {
  constructor() {
  }

  render(): any {
      // could be extended with a value for max items; but did not implement that here
      return {
        "Resource": "arn:aws:states:::s3:getObject",
        "ReaderConfig": {
          "InputType": "JSON"
        },
        "Parameters": {
          "Bucket.$": "$.bucket",
          "Key.$": "$.key",
        }
      };
  }

  providePolicyStatements(): PolicyStatement[] {
      // Implementation for providing policy statements
      return [];
  }
}

Finally, to automatically invoke the Step-Function when a new file is uploaded to the S3 bucket, a Lambda method is required to start the state machine.

Results

Let’s upload a dataset (input.json)to S3. It looks like this for two items:

[
    {
        "input" : "what is 2+2",
        "submission" : "this is a mathematics question, I love that; let me check; 2+2 is four; the answer is four",
        "criteria" : "conciseness"
    },
    {
        "input" : "what is 2+2",
        "submission" : "4",
        "criteria" : "conciseness"
    }
]

The lambda method will execute the state machine. Via the Step-function page in the AWS Console, you can follow the progress of this state machine.

Shows the progress of the Mapped task on the dataset

The distributed map state’s iterations run in parallel. Each iteration creates a child workflow that invokes the criteria evaluator with the value from the item interpolated in the prompt.

When all results are collected, the results are exported to the output S3 Bucket. The output file will look like this:

[
    {
        "ExecutionArn": "arn:aws:states:eu-central-1:851725585527:execution:bedrock-step-function-criteria-evaluator/1c3dec56-fd96-34be-acdd-25458d2dcbf6:b9549a1d-4425-3833-9f0c-6621ec6809d0",
        "Input": "{\"input\":\"what is 2+2\",\"submission\":\"this is a mathematics question, I love that; let me check; 2+2 is four; the answer is four\",\"criteria\":\"conciseness\"}",
        "InputDetails": {
            "Included": true
        },
        "Name": "b9549a1d-4425-3833-9f0c-6621ec6809d0",
        "Output": "{\"names\":\" Okay, let me walk through this step-by-step:\\n\\nThe criterion given is \\\"conciseness\\\". This refers to the conciseness or brevity of the submission. \\n\\nThe input asked solely for the answer to the mathematical question \\\"what is 2+2\\\". \\n\\nThe submission provided more than just the direct answer. It included introductory and concluding sentences about checking the answer and descriptions of the mathematics involved. \\n\\nWhile these introductory and concluding parts don't necessarily detract from the correctness of the answer, they do make the submission less concise than solely providing the direct answer of \\\"four\\\" alone.\\n\\nBy including extra, non-essential sentences, the submission is less concise than simply supplying only the answer directly. Therefore, I would say it does not fully meet the criterion of \\\"conciseness\\\" or brevity.\\n\\nN\\n\\nN\"}",
        "OutputDetails": {
            "Included": true
        },
        "RedriveCount": 0,
        "RedriveStatus": "NOT_REDRIVABLE",
        "RedriveStatusReason": "Execution is SUCCEEDED and cannot be redriven",
        "StartDate": "2024-04-08T20:09:42.188Z",
        "StateMachineArn": "arn:aws:states:eu-central-1:851725585527:stateMachine:bedrock-step-function-criteria-evaluator/1c3dec56-fd96-34be-acdd-25458d2dcbf6",
        "Status": "SUCCEEDED",
        "StopDate": "2024-04-08T20:09:45.343Z"
    },
    ...
]

For each item the input and output with additional meta-data regarding the step-functions are saved in the output S3 bucket.

Wrapping up

In this article we implemented a criteria evaluation solution using AWS Step-functions in combination with AWS bedrock. Specifically, the distributed map state in AWS Step-functions was used to iterate over JSON files on S3 that allow to process big datasources. This solution can easily be extended with additional Step-fucntion components (there are 220 of them): e.g. a Lambda function to send out notifications or store the results in a database.

This implementation does not rely on any LLM evaluation frameworks. Instead, it uses a simple prompt to evaluate the output of an LLM running on AWS Bedrock without the data leaving your account.

The only costs associated with this implementation are the Bedrock model tokens (input and output), the S3 storage costs and the Step-functions costs (i.e. state transitions).