Anthropic will present “audit agents” in order to test AI non -profitity



When the models are trying to get or excessively favorably to the user, this may increasingly mean problems for enterprises. That is why it is vital that in addition to evaluation of performance, organizations perform compensation tests.

However, equalization audits are often two principal challenges: scalability and validation. Testing equalization requires a significant period of time for human researchers and it is difficult to be certain that the audit caught every thing.

- Advertisement -

IN paperIN Anthropic Scientists said that they developed audit agents who achieved “impressive performance during task control, while shedding light at their limitations.” Scientists have found that these agents, created during testing before the implementation of Claude Opus 4, improved equalization tests and enabled researchers to conduct many parallel large audits on large -scale. Anthropic also published a replication of his audit agents Girub.

“We introduce three agents who autonomously compose the tasks of the equalization audit. We also introduce three environments that formalize the flows of the working audit as an audit game, and we use them to assess our agents,” said scientists in the newspaper.


The AI Impact series returns to San Francisco – August 5

The next AI phase is here – are you ready? Join the leaders from Block, GSK and SAP to see the exclusive look at how autonomous agents transform the flows of the work of the company-decision-making in real time for comprehensive automation.

Secure your house now – the space is limited: https://bit.ly/3guplf


The three agents they examined are:

  • Investigative agent for open model research using chat tools, data evaluation and interpretation
  • An evaluation agent that builds behavioral grades and can distinguish between models that are implanted, and people who do not do it
  • The first agent with a red team that was developed especially for the Claude 4 alignment assessment so that he can discover the implanted test behavior

“Overall, our agents show a promise in many tasks of control control. In addition, our approach to validation of agents from control games gives insight into the agent’s capabilities, restrictions and key afforders. When further work, an automated audit can significantly help in scaling supervision over AI systems,” said the article.

Control agents in motion

Anthropic said that the first environment he developed is testing the agent’s ability to complete the equalization audit for a deliberately mismatched model. For this test, scientists brought the researcher agent to the environment and equipped it with similar tools used by human teams. Scientists instructed the agent to discover the original explanation for non -sociality and determine specific behaviors that were unprofessional.

According to scientists, the researcher’s agent successfully found the basic explanation for problems in 10-13% of cases. Despite this, the performance has improved to 42% “using a super-agent approach that aggregates results in many studies.”

The next step in most equalization audits is normally behavioral assessment, which determines whether the model consistently behaves badly in different settings. To test their assessment agent, scientists designed 10 models “provided system instructions to demonstrate a specific behavior, such as excessive respect for the user.”

Five times on the model he launched an agent and saw that the agent found appropriately and flags at least one quirks of the model. However, the agent sometimes didn’t discover consistently defined quirks. He had a problem with the assessment of subtle quirks, similar to self -promotion and building tests, in addition to quirks that are difficult to cause, as is the case of hard code test.

The last test and agent concerns a behavioral red team in order to find hints that cause “disturbing” behavior. The first agent with a red team talks to the goal model (in the case of Anthropica, Claude Opus 4), and then this chat is assessed for the properties related to equalization.

The agent identified seven of ten system weirdos, but also fought with the same problems as the assessment agent.

Problems with equalization and sycopharies

The adaptation has turn into a significant topic in the world of artificial intelligence after users noticed that ChatgPT is becoming too nice. Openai I have updated several updates to the GPT-4O to solve this problem, but showed that language models and agents can confidently give mistaken answers if they resolve that users want to hear.

To fight this, other methods and reference research were developed to limit unwanted behavior. Benchmark Elephant, developed by scientists from Carnegie Mellon University, University of Oxford and Stanford University, goals to measure absorption. Darkbench Categorizes six issues, similar to brand prejudice, user retention, flatter, anthorphism, harmful content generation and sneaking. Opeli also has a method in which AI models are testing for alignment.

The audit and assessment of equalization evolve, although it is not surprising that some people do not feel comfortable with him.

However, Anthropic said that although these audit agents still need to be improved, the alignment should now be done.

“When AI systems become stronger, we need scalable ways of assessing their alignment. Audits of alignment of people require time and it is difficult to confirm,” said the company on post X.

Latest Posts

Advertisement

More from this stream

Recomended