Solving Math Problems with LLMs: Part 2

Executing Python Code Safely

3 min readSep 2, 2024

Introduction

In Part 1: Structured Outputs and Effective Prompting, we explored how to use the Instructor library to obtain structured outputs from LLMs and craft effective prompts for math problem-solving. We set up a system that could generate not only answers and explanations but also Python code to solve given math problems.

Why Generate Code for Math Problems?
While recent LLM models have become better at solving math problems out of the box, they can still sometimes hallucinate when it comes to arithmetic or solution approaches. Having Python code allows for further validation and helps reduce these potential inaccuracies.

Executing Generated Code Safely

When executing LLM-generated code, we need to implement safety measures to mitigate risks such as infinite loops or resource-intensive operations. Let’s focus on two key aspects: implementing a timeout mechanism and handling execution errors.

Timeout Mechanism

To prevent infinite loops or excessively long computations, we use a timeout mechanism. Here’s a simplified version of how it works:

import threading

def execute_with_timeout(code: str, timeout: int = 10):
    result = [None, None]  # [execution_result, error_message]
    
    def run_code():
        try:
            exec(code)
            result[0] = locals().get('answer', None)
        except Exception as e:
            result[1] = str(e)

    thread = threading.Thread(target=run_code)
    thread.start()
    thread.join(timeout)

    if thread.is_alive():
        return None, "Execution timed out"
    
    return result[0], result[1]

This function runs the code in a separate thread and waits for a specified timeout period. If the execution doesn’t complete within this time, it’s considered to have timed out. This helps prevent cases of infinite loops that LLM-generated code might construct

Error Handling

We catch and store any execution errors that occur:

try:
    exec(code)
    # Store the result (assumes the last variable is the answer)
    result[0] = locals().get('answer', None)
except Exception as e:
    result[1] = str(e)

As shown above, we have ability to save any issues with the generated code. This provides a convenient way to feed this back to LLM within retry mechanism to fix the generated code accordingly.

System Execution and Limitations

Our current approach executes the generated Python code using Python’s built-in exec()function. While this is straightforward, it has some limitations:

1. Security: The executed code has access to the entire Python environment, which could be a security risk.
2. Resource Control: There’s limited control over memory usage or other system resources.
3. Isolation: The code execution is not fully isolated from the main program.

To address these limitations, we could wrap this execution in a Docker container. However, this is beyond the scope of our current implementation.

Conclusion

In this article, we’ve explored the benefits of generating Python code for math problem-solving and implemented basic safety measures for code execution. By adding timeout mechanisms and error handling, we’ve reduced the risks associated with running LLM-generated code.

In the next part of our series, we’ll dive into multithreading techniques to handle multiple math problems concurrently, further improving the efficiency of our math problem solver.

Check Part 3: “Multithreading for Robust Response Generation in LLM Applications”!

[Github link to reproducible example]