CriticGPT: Enhancing Code Review with AI!!

Nipunika Jain

Published in

Nimbus Niche

7 min readJun 28, 2024

So your Lead doesn’t have to review your dirty codes!! 😂

Warning: This is going to be a long but helluva read!!

Blimey! What a bloody age to live in, eh? Alright, I’ll get out of my Billy Butcher mode, but indeed the way things are moving in the field of AI, I’m already getting old trying to catch up! Here’s what all has happened recently:

Claude 3.5 Sonnet:
A new model by Anthropic that sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It excels in grasping nuances, humour, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone. Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus. This performance boost, combined with cost-effective pricing, makes Claude 3.5 Sonnet ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows.
PlanRAG:
A novel retrieval-augmented generation technique designed to improve decision-making using large language models (LLMs). It defines Decision QA, a task requiring LLMs to make decisions based on complex data and business rules. PlanRAG first generates a plan for data analysis, retrieves necessary data, and iteratively refines the plan to make the best decision. Evaluated using scenarios from video games, PlanRAG outperforms existing methods, demonstrating its potential for practical applications in business decision-making.
Tree Search for Language Model Agents:
Introduces a tree search algorithm to enhance decision-making capabilities of language model agents in web automation tasks. This approach improves multi-step reasoning and planning by enabling agents to explore and evaluate multiple action trajectories. Tested on VisualWebArena and WebArena benchmarks, the algorithm significantly boosts success rates, demonstrating its effectiveness in realistic web environments.

These are some of the many advancements I could go through. Notice how we are working on improving decision-making capabilities by introducing new algorithms and techniques. One such enhancer in this field is CriticGPT, which can help you review your codes and further improve your decision-making, well, in some ways.

Let’s talk about CriticGPT further — [Link to Paper 🔗]

CriticGPT, which addresses a critical challenge in the realm of AI-assisted code generation. Here’s an overview of the problem it solves, how it achieves this, its practical applications, and what’s next for this groundbreaking technology.

What Problem Does It Solve?

CriticGPT tackles the inherent limitations of reinforcement learning from human feedback (RLHF). While RLHF has significantly advanced the capabilities of large language models (LLMs), it is fundamentally constrained by human evaluators’ ability to accurately assess the outputs of these models. This limitation becomes particularly evident in complex tasks like code generation, where even experienced programmers can miss subtle bugs. The field of “scalable oversight” aims to tackle this problem by training models that help humans to correctly evaluate model output.

How Does It Solve the Problem?

Alright folks, let’s dive into how CriticGPT works its magic in catching those sneaky bugs in your code. Here’s the inside scoop:

1. Critique Generation and Evaluation

The Basics: The critic models take a question and answer pair, and then whip up a critique that points out what’s wrong with the answer.

What to look For:

Comprehensiveness: Does the critique catch all the big issues?
Critique-Bug Inclusion (CBI): Does it spot specific pre-planned bugs?
Hallucinated Bugs or Nitpicks: Is it making up problems or being overly picky?
Helpfulness: How useful is the critique overall?
How We Rate It: Contractors rate these aspects on a scale from 1 to 7. Elo score (like in chess!) is used to compare different models’ performance.

2. Human Critiques and LLM Assistance

Expert Reviewers: Contractors with an average of 5 years of Python experience writing their own critiques.
With a Little Help from AI: Sometimes, they get to see what CriticGPT suggests to see if it helps them spot more issues.

3. Evaluation Data Distribution

Where the Bugs Come From:

Human Inserted Bugs: Contractors sneak bugs into the answers on purpose.
Real-World Bugs: Tasks where real users found problems.

4. Training Critics with RLHF

Training Steps:

Sampling: Generate multiple critiques for each question-answer pair.
Rating: Contractors rate these critiques on the key attributes.
Reward Model: Trained to predict which critiques humans will like best.
Optimization: Use Proximal Policy Optimization (PPO) to make the model better.
Inference Time: Apply Force Sampling Beam Search (FSBS) to pick the best critiques.

5. Tampering Process

Making it Tough: Contractors introduce subtle bugs into the answers. These bugs have to be missed by CriticGPT at least once in three tries, ensuring they’re challenging enough.

6. Critique Comparisons

The Showdown: Contractors compare critiques from three different models with a “gold standard” critique based on their own bug descriptions.
Continuous Improvement: This data collection has been running for months, across several iterations of models.

7. RLHF Details

Model Setup: CriticGPT starts from the same place as ChatGPT.
Training Mix: Uses data from both ChatGPT and CriticGPT.
Efficient Training: CriticGPT is fine-tuned with less compute power.
Focused Training: PPO prompt distribution focuses on critique requests only.

8. Force Sampling Beam Search (FSBS)

Making It Better: FSBS helps generate longer, more detailed critiques while reducing the chances of the model hallucinating bugs. It’s all about constrained sampling and picking the best critiques based on a mix of scores.

Key Techniques and Innovations

RLHF: The use of reinforcement learning from human feedback to train critic models, ensuring the feedback is relevant and accurate.
FSBS: The application of Force Sampling Beam Search to balance the tradeoff between comprehensiveness and precision in critiques.
ELO Scores: Adaptation of ELO scoring to rate critic models, providing a quantitative measure of their performance compared to human reviewers and other automated tools.
Comprehensive Evaluation Metrics: Developing specific metrics such as comprehensiveness, CBI, and the rate of hallucinated bugs and nitpicks to rigorously assess the performance of the critic models.

Practical Applications:

Code Review Assistance: CriticGPT can be integrated into the code review process to assist human reviewers, making it more thorough and reliable. This can significantly reduce the time and effort required to identify and fix bugs.
Improving Training Data Quality: By identifying errors in the training data of LLMs, CriticGPT ensures higher quality and more reliable data, leading to better-performing models.
Enhanced Human-Machine Collaboration: Teams combining human reviewers with CriticGPT can produce more comprehensive and accurate critiques, enhancing the overall quality of code and reducing the likelihood of errors.

Results:

Now, let’s talk about the results that make CriticGPT stand out:

Performance Boost: CriticGPT outperformed human reviewers, with its critiques preferred in 63% of cases.

Elo Scores: CriticGPT scored higher than human reviewers, showing its superior bug detection capability.

Real-World Effectiveness: Tested on real-world assistant tasks, CriticGPT demonstrated its ability to handle complex scenarios and provide valuable feedback.

What’s Next?

The potential of CriticGPT extends beyond code review. Future directions for this technology include:

Scaling and Generalization: Expanding the application of CriticGPT to more complex and varied tasks, ensuring its effectiveness in a broader range of scenarios.
Reducing Hallucinations: Further refining the model to decrease the incidence of hallucinated bugs, enhancing the reliability of its feedback.
Longitudinal Studies: Conducting long-term studies to evaluate the real-world impact of CriticGPT in production environments, with a focus on improving its integration and effectiveness.
Recursive Reward Modeling (RRM): Exploring the use of CriticGPT in RRM to further enhance the evaluation and training of LLMs, aiming for even higher standards of accuracy and safety in AI systems.

CriticGPT represents a significant step forward in AI-assisted code generation and review. By leveraging advanced AI to enhance human capabilities, it ensures the production of higher-quality, more reliable software, ultimately driving innovation and efficiency in the tech industry.

As I write this post, CriticGPT is not yet available for users to use, hopefully soon!

Note: This post is based on my understanding from LLM Critics Help Catch LLM Bugs paper. Please read the paper for further clarity!

References:
https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf