Large Language Models

Can LLMs Reason?

Attempting the ARC-AGI challenge with gpt-4o (10% score)

Benedict Neo
bitgrit Data Science Publication

--

image by author (made with excalidraw)

Can LLMs reason? Or are they just next word predictors?

The ARC-AGI is the ultimate test.

What is ARC-AGI?

ARC-AGI stands for the Abstraction and Reasoning Corpus for Artificial General Intelligence benchmark.

It was introduced in François Chollet’s 2019 paper “On the Measure of Intelligence” to measure the efficiency of AI skill-acquisition on unknown tasks.

“ARC can be seen as a general artificial intelligence benchmark, as a program synthesis benchmark, or as a psychometric intelligence test. It is targeted at both humans and artificially intelligent systems that aim at emulating a human-like form of general fluid intelligence.”

It’s designed to measure an AI system’s ability to efficiently learn new skills and generalize to novel problems outside of its training data. Surprisingly, while humans easily score 85% on this benchmark, the best AI systems currently only manage 34%.

How is it designed?

ARC-AGI consits of unique tasks, each task contains input and output examples like below.

Each square in the puzzles can be one of 10 colors. And a grid can be any height or width between 1x1 and 30x30.

arcprize.org

To successfully solve a task, the output grid must be pixel-perfect. This includes picking the correct dimensions of the output grid.

To learn more about the reasoning behind the challenge, watch this podcast for more.

The challenge

The ARC Prize competition, hosted on Kaggle, challenges researchers and data scientists to explore ideas beyond traditional LLMs.

With a prize pool of $100,000 and an additional $500,000 for any team that can beat the 85% human-level score, this competition is not just about the money — it’s about pushing the boundaries of what’s possible in AI.

Let’s dive right in to the challenge.

We’ll walk through my baseline approach, and some of the code used to achieve it.

Get Github code is here.

Shout out to HP for providing the HP Z6 G5 Tower Workstation which I used to run the experiments.

The Data 💾

The ARC-AGI dataset is unlike typical machine learning datasets. Instead of a large collection of similar examples, it provides a series of unique, abstract reasoning tasks.

Each task consists of a few input-output training examples and a test input for which your AI needs to predict the correct output.

Here’s what the data folder looks like:

├── data
│ ├── arc-agi_evaluation_challenges.json
│ ├── arc-agi_evaluation_solutions.json
│ ├── arc-agi_test_challenges.json
│ ├── arc-agi_training_challenges.json
│ ├── sample_submission.json
└─└── arc-agi_training_solutions.json

Here’s what a single challenge looks like from the training dataset.

{"007bbfb7":
{
"test": [
{"input":[
[7,0,7],
[7,0,7],
[7,7,0]]
}],
"train":[
{"input":[
[0,7,7],
[7,7,7],
[0,7,7]],
"output":[
[0,0,0,0,7,7,0,7,7],
[0,0,0,7,7,7,7,7,7],
[0,0,0,0,7,7,0,7,7],
[0,7,7,0,7,7,0,7,7],
[7,7,7,7,7,7,7,7,7],
[0,7,7,0,7,7,0,7,7],
[0,0,0,0,7,7,0,7,7],
[0,0,0,7,7,7,7,7,7],
[0,0,0,0,7,7,0,7,7]]
},
{"input":[
[4,0,4],
[0,0,0],
[0,4,0]],
"output":[
[4,0,4,0,0,0,4,0,4],
[0,0,0,0,0,0,0,0,0],
[0,4,0,0,0,0,0,4,0],
[0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0],
[0,0,0,4,0,4,0,0,0],
[0,0,0,0,0,0,0,0,0],
[0,0,0,0,4,0,0,0,0]]},
...

which corresponds to this puzzle

1. Data Load

def load_data(dataset: str):
if dataset not in DATA_PATHS_MAP:
raise ValueError("Invalid dataset. Choose from 'train', 'eval', or 'test'.")

data = {}
with open(DATA_PATHS_MAP[dataset], "r") as f:
text = f.read()
data[dataset] = json.loads(text)
logging.info(f"Loaded {len(data[dataset])} lines of {dataset} data")

if dataset in ["train", "eval"]:
solutions_key = f"{dataset}_solutions"
with open(DATA_PATHS_MAP[solutions_key], "r") as f:
text = f.read()
data[solutions_key] = json.loads(text)
logging.info(f"Loaded {len(data[solutions_key])} lines of {solutions_key} data")

return data

This function loads the specified dataset (train, eval, or test) and its corresponding solutions if available.

Here’s how you can load the train dataset.

from arcprize.helpers import load_data
data = load_data('train')

2. Task Representation

def json_task_to_string(task: dict) -> str:
train_tasks = task["train"]
test_task = task["test"]

final_output = "Training Examples\n"

for i, task in enumerate(train_tasks):
final_output += f"Example {i + 1}: Input\n["
for row in task["input"]:
final_output += f"\n{str(row)},"
final_output += "]\n\n"
final_output += f"Example {i + 1}: Output\n["
for row in task["output"]:
final_output += f"\n{str(row)},"
final_output += "]\n\n"

final_output += "Test\n["
for row in test_task[0]["input"]:
final_output += f"\n{str(row)}"
final_output += "]"

return final_output

This function creates a simpler string representation of the task, including all training examples and the test input which serves as the prompt for our LLM.

Training Examples
Example 1: Input
[
[0, 0, 0, 8, 0, 8, 0, 0, 0],
[0, 0, 0, 0, 8, 8, 0, 0, 0],
[0, 0, 0, 0, 0, 8, 0, 0, 0],
[0, 0, 0, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 4, 4, 4, 0, 0, 0],
[0, 0, 0, 0, 4, 0, 0, 0, 0],]

Example 1: Output
[
[8, 0, 8, 8, 0, 8, 0, 0, 0],
[8, 8, 0, 0, 8, 8, 0, 0, 0],
[8, 0, 0, 0, 0, 8, 0, 0, 0],
[0, 0, 0, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 4, 4, 4, 0, 0, 0],
[0, 0, 0, 0, 4, 0, 0, 0, 0],]

Example 2: Input
[
[0, 0, 0, 8, 0, 8, 0, 0, 0],
[0, 0, 0, 8, 8, 8, 0, 0, 0],
[0, 0, 0, 8, 8, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 4, 0, 0, 0],
...

Now for the fun part.

3. LLM Interaction and Code Execution

I’ve experimented with over a dozen prompts, this is what I ended up going with.

System prompt

You are an intelligent agent and a skilled Python programmer. Your task is to analyze and reason about complex pattern transformations where an input matrix (grid) is transformed into an output matrix based on a few examples. You need to identify the underlying transformation rule and implement it in Python.

The inputs and outputs are represented as grids — a rectangular matrix of integers between 0 and 9 (inclusive). Each integer corresponds to a specific color.

You need to reason deductively to understand the transformation rule and demonstrate your reasoning in detail. Your response should include a clear and thorough reasoning section enclosed in <reasoning></reasoning> tags, followed by the implementation of the transformation in Python within triple backticks (```python```).

User Prompt

Here are some examples of a transformation pattern from `input` to `output` along with the reasoning behind the transformation:

<example_1>
{example_1_reasoning}
</example_1>

<example_2>
{example_2_reasoning}
</example_2>

<example_3>
{example_3_reasoning}
</example_3}

Now it’s your turn to solve a new problem. Here’s a new input: {{input_string}}

Follow the same reasoning steps as the examples, it is important that you infer the correct output dimension from the given input.

Use the following template for your algorithm:

```python
import numpy as np

# Your thought process
def apply_transformation(input_matrix):
# perform transformation

return output_matrix

The example reasonings are taken from this Github repo.

The LLM call

async def get_task_prediction(sample, retry_attempts=RETRY_ATTEMPTS):
user_prompt = generate_user_prompt(sample, USER_PROMPT_1)

for attempt in range(retry_attempts):
try:
response = await client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
temperature=TEMPERATURE,
top_p=TOP_P,
)
resp = response.choices[0].message.content
reasoning = extract_reasoning_from_response(resp)
logging.info(f"Reasoning: {reasoning}")

output, code, error = await handle_code_execution(resp, sample)

if error:
logging.error(f"Attempt {attempt + 1} failed with error: {error}")
continue

task_out_dim = (
len(sample["train"][0]["output"]),
len(sample["train"][0]["output"][0]),
)
pred_out_dim = (len(output), len(output[0]))

if task_out_dim == pred_out_dim:
return output
else:
fixed_code = await fix_code_with_llm(
resp,
f"Output dimension mismatch: Expected {task_out_dim}, Got {pred_out_dim}",
sample,
)
output, code, error = execute_code(fixed_code, sample["test"][0]["input"])

if not error and task_out_dim == (len(output), len(output[0])):
return output

except Exception as e:
logging.error(f"Attempt {attempt + 1} failed with error: {e}")

logging.error("Failed to get correct output dimensions after multiple attempts")
return BASE_RESPONSE

This function does several important things:

  1. Generates a user prompt with the specific task
  2. Sends the prompt to the LLM
  3. Extracts reasoning and code from the LLM’s response
  4. Attempts to execute the code
  5. Checks if the output dimensions match the expected dimensions
  6. If there’s an error or dimension mismatch, it tries to fix the code using the LLM again
  7. Code Fixing

And to fix the code, we have this function below, with tis own custom prompt.

async def fix_code_with_llm(broken_code, error_message, sample):
fix_user_prompt = f"""
The following code has an error:
```python
{broken_code}
```
The error message is:
{error_message}

Reason carefully and fix your algorithm so it matches the expected output.

Here is the input again:
{sample["test"][0]["input"]}

Use the following template for your algorithm:

```python
import numpy as np

# Your thought process
def apply_transformation(input_matrix):
# perform transformation
...
return output_matrix
```

respond with only the reasoning and the fixed code.
"""

response = await client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": fix_user_prompt},
],
temperature=TEMPERATURE,
top_p=TOP_P,
)
fixed_code = response.choices[0].message.content
reasoning = extract_reasoning_from_response(fixed_code)
logging.info(f"Reasoning: {reasoning}")
return fixed_code

This function sends the task to the LLM, receives the response (including reasoning and Python code), attempts to execute the code, and even tries to fix the code if there are errors or dimension mismatches.

If the code has no issues, we run it.

4. Code Execution

def exec_response(response, input_matrix):
pattern = r"```python\n(.*?)```"
matches = re.findall(pattern, response, re.DOTALL)
if not matches:
raise ValueError("No Python code found in response.")

code = matches[0] + "\nresult = apply_transformation(input_matrix)"

global_scope = {
name: dynamic_import(module) for name, module in required_modules.items()
}
local_scope = {"input_matrix": input_matrix}

try:
exec(code, global_scope, local_scope)
result = local_scope.get("result", None)
return result, code
except Exception as e:
raise ValueError(f"Failed to run the code: {e}")

And this function executes the code.

So how does it do?

Does it work?

Here’s a summary of the results on the training set.

  • Score: (38/400) 9.5%
  • Time: 75m
  • Cost: $200

There is room for a lot of improvements. Below are some links that I used as reference.

One thing that could reduce the cost is using prefix caching. I found a vLLM implementation, but it doesn’t support the cloud providers.

At the time of writing, the O1 model was just released recently, so trying out that model would be a fun experiment.

References

Good luck!

Thanks for reading

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--

bitgrit Data Science Publication
bitgrit Data Science Publication

Published in bitgrit Data Science Publication

We’re democratizing AI with our online competition platform — bitgrit.net. On our publication, we publish only high-quality data science-related topics. Become a writer by emailing us at: info@bitgrit.net