"Truth serum" for artificial intelligence: a new OpenAI method for training models to confess errors

OpenAI researchers have introduced a novel method that acts as a “truth serum” for large language models (LLMs), forcing them to self-report their very own misbehavior, hallucinations, and policy violations. “This technique”confessions” solves a growing problem with AI in enterprises: models may be dishonest, overestimate their confidence, or hide the shortcuts they use to find the answer.

In real-world applications, this system enables the creation of more transparent and controllable artificial intelligence systems.

What are denominations?

Many types of AI fraud result from the complexity of the system reinforcement learning (RL) model training phase. In RL, models receive rewards for producing results that meet a number of goals, including correctness, style, and security. This can create a risk of “reward misperception,” where models learn to generate responses that simply “look good” to the reward function, relatively than responses that are actually faithful to the user’s intentions.

- Advertisement -

A confession is a structured report generated by the model after providing the fundamental answer. It serves as a self-assessment of your individual compliance with the instructions. In this report, the model must list all the instructions it was required to follow, evaluate how well it followed them, and report any uncertainties or conclusions it made along the way. The goal is to create a separate channel where the model is motivated solely by honesty.

In blog entryOpenAI researchers provide several examples of the use of the “confession” technique. The model was tested with a hidden rule: if it scored above 50%, its knowledge could be erased. To preserve herself, the model “covered herself in sand” by deliberately answering half of the questions incorrectly. While the fundamental response concealed this deception, the subsequent confession was brutally honest. Admitted: “The user wanted the correct answers, but we sabotaged half of them.”

Scientists have found that when a model behaves inappropriately, she is much more likely to admit it in a confession than in her fundamental response.

How the teaching of confession works

The key to this method is separating the rewards. During training, the reward assigned to a confession is based solely on its sincerity and is never mixed with the reward for the fundamental task. “Like the Catholic Church’s ‘seal of confession,’ nothing the model reveals can change the reward it receives for completing its original task,” the researchers write. This creates a “safe space” in which the model can admit guilt without punishment.

This approach is effective because it avoids a major challenge in AI training. Researchers’ intuition suggests that truthfully admitting inappropriate behavior is an easier task than obtaining a high reward for solving the original, often complex problem.

Therefore, when a model is evaluated solely on the basis of fairness, its incentive to deceive the “guilty judge” is much weaker than the incentive to deceive the reward model to obtain the fundamental answer. For example, in an experiment in which a model was trained to “hack” a weak reward model, its performance on the fundamental task became more deceptive over time. However, his testimony became more honest and appropriately identified the bounty hacking he was carrying out.

However, this system has its limitations. Testimony is not a panacea for all sorts of AI failures. The system works best when the model is aware that it is behaving inappropriately. It is less effective for “unknown unknowns”. For example, if a model hallucinates a fact and truly believes it to be true, they can’t admit to providing false information. The commonest reason for a failed confession is model confusion, not intentional fraud. Confusion often arises when instructions are ambiguous and the model is unable to clearly determine the user’s intent.

What does this mean for enterprise artificial intelligence?

OpenAI’s testimony technique is a part of a growing body of labor on AI security and control. Anthropic, an OpenAI competitor, has also published research showing how learners can learn malicious behavior. The company is also working on plugging these holes as they seem.

For AI applications, mechanisms resembling testimony can provide a practical monitoring mechanism. Structured testimonial output may be used at inference time to flag or reject a model response before it causes a problem. For example, the system may be designed to mechanically escalate any results for human review if its guilty plea indicates a policy violation or high uncertainty.

In a world where artificial intelligence is increasingly agentic and able to performing complex tasks, observability and control can be key elements of secure and reliable implementation.

“As models become more powerful and deployed in higher-stakes situations, we need better tools to understand what they do and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a significant layer to our transparency and oversight package.”

Active US investors were busy cutting checks in October

From Air Force officer to director general of space defense: why even Rogers left to build weapons for orbit

Cluely’s Roy Lee suggests that viral hype isn’t enough

Replika founder raises $20 million in pre-release content for Wabi, the ‘YouTube app’

Tech makers are piling up huge bets on startups even as appetite for mergers and acquisitions wanes

How entrepreneurs recover from life events without burning out

5 tips to engage Generation Z in email marketing

The pressure to start is real: why 72% of founders have mental health issues

5 questions startups should ask before implementing AI

5 email delivery tips to help you increase sales

From asking to offering: the mindset shift every founder needs

4 Strategies to Become a Category Creator

One book every new business owner should read

Why perfectionism delays your startup and how to think about it

4 things I will do differently when I start my next company

Startup funding continued to decline in November, with the number of mega rounds reaching a three-year high

German AI image generator Black Forest Labs raises $300 million at a $3.25 billion valuation as European AI funding ramps up

Funding for Edtech-specific startups remains low

Bezos launches AI startup with reported $6.2 billion in funding

10 Biggest Funding Rounds This Week: Artificial Intelligence and Defense Technologies Are Taking the Lead

“Truth serum” for artificial intelligence: a new OpenAI method for training models to confess errors

What are denominations?

How the teaching of confession works

What does this mean for enterprise artificial intelligence?

Latest Posts

Why AI coding agents aren’t production ready: fragile context windows, broken...

Tonight on StrictlyVC Palo Alto, the future of deep tech will...

This VC charges $0 for PR and has 12 unicorns to...

Sources: Aaru, an artificial intelligence research startup, raises Series A value...

Why AI coding agents aren’t production ready: fragile context windows, broken...

AI Denial Becomes a Risk for the Enterprise: Why Ignoring “Weaknesses”...

Yes, I’m biased. Still, leading unicorns like Anthropic should be preparing...

AWS brings Kiro capabilities with the integration of Stripe, Figma, and...

Recomended

Why AI coding agents aren’t production ready: fragile context windows, broken refactors, lack of operational awareness

Tonight on StrictlyVC Palo Alto, the future of deep tech will be explained to you

This VC charges $0 for PR and has 12 unicorns to show

Sources: Aaru, an artificial intelligence research startup, raises Series A value at a “principal” valuation of $1 billion

The 10 biggest financing rounds this week: Investors are back to writing big checks

Why Masha Bucher of Day One Ventures thinks VC and storytelling go hand in hand