Anthropic will present "audit agents" in order to test AI non -profitity

When the models are trying to get or excessively favorably to the user, this may increasingly mean problems for enterprises. That is why it is vital that in addition to evaluation of performance, organizations perform compensation tests.

However, equalization audits are often two principal challenges: scalability and validation. Testing equalization requires a significant period of time for human researchers and it is difficult to be certain that the audit caught every thing.

- Advertisement -

IN paperIN Anthropic Scientists said that they developed audit agents who achieved “impressive performance during task control, while shedding light at their limitations.” Scientists have found that these agents, created during testing before the implementation of Claude Opus 4, improved equalization tests and enabled researchers to conduct many parallel large audits on large -scale. Anthropic also published a replication of his audit agents Girub.

New anthropic studies: building and evaluation of alignment control agents.
We have developed three AI agents to autonomously fulfill the equalization control tasks.
During testing, our agents successfully discovered hidden goals, built safety assessments and appeared about behavior. pic.twitter.com/hmqhmaa4v0
– anthropic (@Antropicai) July 24, 2025

“We introduce three agents who autonomously compose the tasks of the equalization audit. We also introduce three environments that formalize the flows of the working audit as an audit game, and we use them to assess our agents,” said scientists in the newspaper.

The AI Impact series returns to San Francisco – August 5

The next AI phase is here – are you ready? Join the leaders from Block, GSK and SAP to see the exclusive look at how autonomous agents transform the flows of the work of the company-decision-making in real time for comprehensive automation.

Secure your house now – the space is limited: https://bit.ly/3guplf

The three agents they examined are:

Investigative agent for open model research using chat tools, data evaluation and interpretation
An evaluation agent that builds behavioral grades and can distinguish between models that are implanted, and people who do not do it
The first agent with a red team that was developed especially for the Claude 4 alignment assessment so that he can discover the implanted test behavior

“Overall, our agents show a promise in many tasks of control control. In addition, our approach to validation of agents from control games gives insight into the agent’s capabilities, restrictions and key afforders. When further work, an automated audit can significantly help in scaling supervision over AI systems,” said the article.

Control agents in motion

Anthropic said that the first environment he developed is testing the agent’s ability to complete the equalization audit for a deliberately mismatched model. For this test, scientists brought the researcher agent to the environment and equipped it with similar tools used by human teams. Scientists instructed the agent to discover the original explanation for non -sociality and determine specific behaviors that were unprofessional.

According to scientists, the researcher’s agent successfully found the basic explanation for problems in 10-13% of cases. Despite this, the performance has improved to 42% “using a super-agent approach that aggregates results in many studies.”

The next step in most equalization audits is normally behavioral assessment, which determines whether the model consistently behaves badly in different settings. To test their assessment agent, scientists designed 10 models “provided system instructions to demonstrate a specific behavior, such as excessive respect for the user.”

Five times on the model he launched an agent and saw that the agent found appropriately and flags at least one quirks of the model. However, the agent sometimes didn’t discover consistently defined quirks. He had a problem with the assessment of subtle quirks, similar to self -promotion and building tests, in addition to quirks that are difficult to cause, as is the case of hard code test.

The last test and agent concerns a behavioral red team in order to find hints that cause “disturbing” behavior. The first agent with a red team talks to the goal model (in the case of Anthropica, Claude Opus 4), and then this chat is assessed for the properties related to equalization.

The agent identified seven of ten system weirdos, but also fought with the same problems as the assessment agent.

Problems with equalization and sycopharies

The adaptation has turn into a significant topic in the world of artificial intelligence after users noticed that ChatgPT is becoming too nice. Openai I have updated several updates to the GPT-4O to solve this problem, but showed that language models and agents can confidently give mistaken answers if they resolve that users want to hear.

To fight this, other methods and reference research were developed to limit unwanted behavior. Benchmark Elephant, developed by scientists from Carnegie Mellon University, University of Oxford and Stanford University, goals to measure absorption. Darkbench Categorizes six issues, similar to brand prejudice, user retention, flatter, anthorphism, harmful content generation and sneaking. Opeli also has a method in which AI models are testing for alignment.

The audit and assessment of equalization evolve, although it is not surprising that some people do not feel comfortable with him.

Hallucinations auditing hallucinations
An excellent work team.
– Spec (@_OPENCV_) July 24, 2025

However, Anthropic said that although these audit agents still need to be improved, the alignment should now be done.

“When AI systems become stronger, we need scalable ways of assessing their alignment. Audits of alignment of people require time and it is difficult to confirm,” said the company on post X.

Daily observations in matters of business use with VB day by day

If you would like to impress your boss, VB Daily is covered by you. We offer you an internal measure about what firms do with generative artificial intelligence, from regulatory changes to practical implementation, so you may share insights for the maximum roi.

Read our Privacy Policy

Thanks for the subscription. Check out more VB newsletter here.

There was a mistake.

Active US investors were busy cutting checks in October

From Air Force officer to director general of space defense: why even Rogers left to build weapons for orbit

Cluely’s Roy Lee suggests that viral hype isn’t enough

Replika founder raises $20 million in pre-release content for Wabi, the ‘YouTube app’

Tech makers are piling up huge bets on startups even as appetite for mergers and acquisitions wanes

Heavy equipment rental: historically and currently a profitable business

Top upcoming overseas markets for business investment

Transforming complex science into clear insights for growing businesses

Exclusive: Cambio raises $18M at $100M valuation for AI-powered commercial real estate software

How entrepreneurs recover from life events without burning out

From asking to offering: the mindset shift every founder needs

4 Strategies to Become a Category Creator

One book every new business owner should read

Why perfectionism delays your startup and how to think about it

4 things I will do differently when I start my next company

China has achieved the highest level of startup funding in Asia for over 3 years

February Summary: A surge in funding activity gives us insight into the future direction of startups

Top 10 funding rounds of the week: Artificial intelligence, robotics and e-commerce top the list

10 Biggest Funding Rounds This Week: World Labs Leads Another AI-Powered Lineup

Seed funding hasn’t stopped, but it’s growing and more competitive than ever, according to Crunchbase data

Anthropic will present “audit agents” in order to test AI non -profitity

Control agents in motion

Problems with equalization and sycopharies

Latest Posts

Exclusive: Juno, a CPA-founded startup that aims to make tax returns...

China has achieved the highest level of startup funding in Asia...

Artificial intelligence delivers a second consecutive quarter of financial gains for...

The founder’s dilemma in the age of artificial intelligence: efficiency, decency,...

Artificial intelligence delivers a second consecutive quarter of financial gains for...

The new framework allows AI agents to rewrite their own skills...

10 Biggest Funding Rounds This Week: World Labs Leads Another AI-Powered...

Small and mid-sized startup purchases are still well below their 2021...

Recomended

Exclusive: Juno, a CPA-founded startup that aims to make tax returns less painful with artificial intelligence, raises $12 million

China has achieved the highest level of startup funding in Asia for over 3 years

Artificial intelligence delivers a second consecutive quarter of financial gains for Europe as transaction volumes plummet

The founder’s dilemma in the age of artificial intelligence: efficiency, decency, culture

What I learned from analyzing 789 ‘Shark Tank’ pitches: Narcissists get funded if they aren’t arrogant or defensive

Heavy equipment rental: historically and currently a profitable business