Solving Math Problems with LLMs: Part 4

Mastering Error Handling and Retries

4 min readSep 7, 2024

Introduction

Welcome to the final part of our series on solving math problems using Large Language Models (LLMs). Throughout this series, we’ve been building a system that leverages LLMs to solve mathematical problems.

In Part 1, we explored structured outputs and effective prompting techniques, laying the foundation for consistent and reliable LLM responses. Part 2 focused on executing Python code safely, allowing us to run LLM-generated solutions while mitigating potential risks. Part 3 introduced multithreading, enabling us to process multiple problems concurrently and significantly improve the efficiency of our system.

Now, in this final part, we address the challenges and potential points of failure that arise when working with LLMs and executing generated code. We’ll present strategies to make our system more resilient and capable of handling errors with the help of retry mechanism.

Why Advanced Error Handling and Retries?

When working with LLMs and executing generated code, we encounter 2 main types of errors:

Validation errors (outputs not matching our expected structure)
Execution errors (in generated Python code)

Implementing advanced error handling and retries allows us to:

Improve system reliability by automatically recovering from transient errors
Maximize the value from API calls by retrying failed requests without stopping the process
Feed captured error details back to the LLM to generate corrected solutions

Leveraging Instructor's Automatic Retry Mechanism

The Instructor library provides a powerful automatic retry mechanism when the returned output doesn't follow the Pydantic structure we require. This feature significantly simplifies our error handling process for validation errors.

import instructor 
from pydantic import BaseModel, Field
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv(usecwd=True, raise_error_if_not_found=True))

class MathSolution(BaseModel):
    answer: str = Field(min_length=1)
    python_code: str = Field(min_length=10)

problem_text = "What is 2 + 2?"

client = instructor.patch(OpenAI())

solution = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are an expert mathematics tutor."},
        {"role": "user", "content": problem_text}],
    response_model=MathSolution,
    model="gpt-4o-mini",
    max_retries=3
)

In this example, Instructor will automatically retry up to 3 times if the LLM’s response doesn’t meet the Pydantic model’s requirements. This built-in feature handles validation errors without requiring explicit try-except blocks in our code, streamlining our error handling process.

Handling Execution Errors from Python Code Execution

While Instructor handles validation errors, we still need to manage execution errors that may occur when running the LLM-generated Python code. Here’s how we can handle these errors:

class MathSolution(BaseModel):
    answer: str = Field(min_length=1)
    python_code: str = Field(min_length=10)

def execute_program(code: str):
    try:
        local_vars = {}
        exec(code, {}, local_vars)
        result = local_vars.get('answer', None)
        return result, None
    except Exception as e:
        return None, str(e)

def get_solution(client, problem_text, max_attempts=2):
    for attempt in range(max_attempts):
        try:
            solution = client.chat.completions.create(
                messages=[
                    {"role": "system", "content": "You are an expert mathematics tutor. Provide Python code to solve the problem and store the result in a variable named 'answer'."},
                    {"role": "user", "content": problem_text}
                ],
                response_model=MathSolution,
                model="gpt-4o-mini",
                max_retries=2
            )
            
            executed_answer, execution_error = execute_program(solution.python_code)
            
            if execution_error:
                problem_text = f"The previous code failed to execute with the error: {execution_error}. Please provide a corrected version that solves this problem: {problem_text}"
                continue
            
            return solution, executed_answer

        except Exception as e:
            print(f"An error occurred on attempt {attempt + 1}: {str(e)}")
            if attempt == max_attempts - 1:
                raise

    raise Exception("Max attempts reached without a valid solution")

This approach allows us to catch and handle errors that occur during the execution of the generated Python code, providing an opportunity to feed this information back to the LLM for correction. This creates a feedback loop that can improve the quality of generated solutions over time.

Logging and Monitoring

Even with Instructor’s automatic retries and our execution error handling, it’s crucial to implement logging and monitoring. This allows us to track the performance of our system and identify areas for improvement. Storing error messages alongside successful results in a structured format (e.g., in a DataFrame or database) allows us to:

Track the frequency and types of errors occurring
Analyze patterns in LLM outputs that lead to execution errors
Identify problems that consistently fail, even after multiple retries
Maintain a record for auditing and debugging purposes

Conclusion

By leveraging Instructor’s automatic retry mechanism and implementing our own error handling for code execution, we’ve significantly improved the robustness and reliability of our LLM-based math problem solver.

The techniques covered in this series have broader applications beyond math problem-solving. A lot of LLM based applications rely on structured output and by using libraries like Instructor and Pydentic implementing comprehensive validation, error handling, and retry mechanism, you can create LLM-based applications that are not just powerful, but also dependable and maintainable.

We’ve also demonstrated how to use LLMs to generate code that can be executed safely — a technique that could be part of a multi-step application process in various domains such as data analysis, automation, or even software development assistance.

[GitHub Link to reproducible example]