OpenAI’s new “CriticGPT” model is trained to critique GPT-4 output

Enlarge / An illustration created by OpenAI.

On Thursday, OpenAI researchers unveiled CriticGPT, a new AI model designed to identify bugs in code generated by ChatGPT. It’s intended to improve the process of making AI systems behave in ways that people want (called “alignment”) via reinforcement learning from human feedback (RLHF), which helps human reviewers make large language model (LLM) output more accurate.

As outlined in a new research paper titled “LLM Critics Help Catch LLM Bugs,” OpenAI created CriticGPT to act as an AI assistant for human trainers reviewing code generated by the ChatGPT AI assistant. CriticGPT—based on the GPT-4 family of LLMS—analyzes the code and highlights potential errors, making it easier for humans to spot mistakes that would otherwise go unnoticed. The researchers trained CriticGPT on a dataset of code samples with intentionally injected bugs, teaching it to recognize and highlight various coding errors.

The researchers found that annotators preferred CriticGPT critiques over human critiques in 63 percent of cases involving naturally occurring LLM errors, and that human-machine teams using CriticGPT wrote more comprehensive critiques than humans alone, while reducing the number of confabulations (hallucinations) compared to AI-only critiques.

Developing an automated critic

The development of CriticGPT involved training the model on a large number of inputs with deliberately inserted errors. Human trainers were asked to modify code written by ChatGPT, introduce bugs, and then provide sample feedback as if they had discovered these bugs. ​​This process allowed the model to learn how to identify and critique different types of coding errors.

In experiments, CriticGPT demonstrated its ability to catch both injected bugs and naturally occurring errors in ChatGPT’s output. Trainers favored the new model’s critiques over ChatGPT itself in 63 percent of cases involving natural bugs (the aforementioned statistic). This preference was partly due to the fact that CriticGPT produced fewer useless “nitpicks” and generated fewer false positives or hallucinatory problems.

The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed reviews of code. It lets the researchers adjust how thorough CriticGPT is in finding problems, while also controlling how often it invents problems that aren’t actually there. They can adjust this balance depending on what they need for different AI training tasks.

Interestingly, the researchers found that CriticGPT’s capabilities extend beyond code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been deemed bug-free by human annotators. Surprisingly, CriticGPT identified bugs in 24 percent of these cases, bugs that were subsequently confirmed by human reviewers. OpenAI believes this demonstrates the model’s potential to generalize to non-code tasks, and highlights its ability to detect subtle bugs that even careful human review might miss.

Despite its promising results, CriticGPT, like all AI models, has limitations. The model is trained on relatively short ChatGPT responses, which may not fully prepare it to evaluate longer, more complex tasks that future AI systems may be able to tackle. Furthermore, while CriticGPT reduces confabulations, it does not eliminate them entirely, and human trainers may still make labeling errors based on these false outcomes.

The research team recognizes that CriticGPT is most effective at identifying errors that can be pinpointed to one specific location in the code. However, real-world errors in AI results can often be spread across multiple parts of an answer, posing a challenge for future model iterations.

OpenAI plans to integrate CriticGPT-like models into its RLHF labeling pipeline, and equip its trainers with AI assistance. For OpenAI, it’s a step toward developing better tools for evaluating the output of LLM systems that are difficult for humans to assess without additional support. However, the researchers caution that even with tools like CriticGPT, extremely complex tasks or responses can still be challenging for human evaluators, even those assisted by AI.

Leave a Comment