Open Source Open Source Computer Agents competing with its own OPENAI and Anthropic models



New framework of researchers in University of Hong Kong (HKU) and cooperation institutions form the basis of Open Source to create solid AI agents that may operate computers. Frames called OpencuaIt includes tools, data and recipes for scaling of computer use agents (CUA).

Models trained with this frame work strongly on CUA references, exceeding the existing Open Source models and strictly compete with closed agents from the leading AI laboratories, akin to Opeli and Anthropic.

- Advertisement -

Challenge to build computer user agents

Computer user agents are designed for autonomous tasks on a computer, from website navigation to complex software. They can even help in automating work flows in an enterprise. However, the most talented CUA systems are reserved, with key details about their training data, architectures and privacy development processes.

“When the lack of transparency limits technical progress and raises concerns about security, the research community needs a really open CUA RAM to examine its capabilities, restrictions and risk,” scientists say in their paper.


AI scaling hits its limits

Power capitals, the growing costs of the token and inference delay are transforming AI Enterprise. Join our exclusive salon to find how the best teams are:

  • Changing energy into a strategic advantage
  • Architect of effective inference regarding real capability profits
  • Unlocking competitive roi using balanced AI systems

Secure your house to stay ahead: https://bit.ly/4mwgni


At the same time, open source efforts are in front of their own obstacle set. There was no scalable infrastructure to gather various data on a large -scale data needed to coach these agents. Existing Open Source data sets for graphic user interfaces (Guis) have limited data, and many research projects provide insufficient details about their methods, which makes it difficult to repeat their work.

According to the article, “These restrictions together make it difficult to progress in general purpose and limit the meaningful examination of their scalability, generalization and potential approaches to learning.”

We present OpenCua

OpenCua is an open source framework designed to resolve these challenges by scaling each data collection and models themselves. At the root is the Agentnet tool for registering human demonstrations of computer tasks in various operating systems.

The tool improves data collection by acting in the background on the Adnotator’s pc, intercepting screen movies, mouse and keyboard inputs, and a basic availability tree that gives structural information about screen elements. These raw data is then processed into “status trajectories”, combining a computer screenshot (state) with a corresponding user (click, pressing the key, etc.). Annotetors can then view, edit and submit these demonstrations.

Using this tool, scientists have collected a set of Agentnet data, which comprises over 22,600 demonstrations of tasks in Windows, MacOS and Ubuntu, covering over 200 applications and web sites. “This set of data authentically records the complexity of human behavior and environmental dynamics from the personal computing environments of users,” notes the paper.

Recognizing that the tools for recording the screen are raised by significant concerns about the privacy of information for enterprises, scientists designed the Agentnet tool with security in mind. Xinyuan Wang, co -author of the newspaper and a PhD student in HKU, explained that they’d implemented a multilayer framework for privacy protection. “First of all, the adnators themselves can fully observe the generated data … Before deciding whether to submit it,” said Venturebeat. Then the data is subject to manual verification of privacy problems and automated scanning by a large model to detect all other sensitive content before issuing. “This layered process ensures the solidity of the corporate class for environments that support confidential customer or financial data,” Wang added.

To speed up the assessment, the team also touched agentanetbench, offline reference, which provides many correct actions for each step, offering a more efficient way of measuring the agent’s performance.

New recipe for training agents

OpenCua framework introduces an revolutionary pipeline for data processing and training of computer use agents. The first step transforms strict interpersonal demonstrations into pure state couples suitable for training models in the language of vision (VLM). However, scientists have found that simply training models of those couples gives limited performance increases, even with large amounts of information.

The key insight was the increase in these trajectories about the reasoning of the chain (COT). This process generates a detailed “internal monologue” for each activity that features planning, memory and reflection. This structured reasoning is organized in three levels: a high -level screen statement, reflective thoughts that analyze the situation and plan subsequent steps, and finally concise, executable motion. This approach helps the agent develop a deeper understanding of tasks.

“We believe that natural language reasoning is crucial for generalized models of computer foundations, helping CUA internalize cognitive abilities,” scientists write.

This pipeline with data synthesis is a general frame that firms may be adapted to the training of agents on their own unique internal tools. According to Wang, the company can record demonstrations of its reserved flow and use the same “reflector” and “generator” pipeline to create the essential training data. “This allows them to break down into a high-performance agent adapted to their internal tools without the need for manual manual reasoning,” he explained.

OpenCua transmission to the test

Scientists used OpenCua frames for training a series of VLM Open Source, including QWEN and KIMI-VL variants, with a size of parameters from 3 billion to 32 billion. The models were evaluated in the package of online and offline reference points that test their ability to perform tasks and understand GUI.

Model 32 billion parameters, OpenCua-32B, has established a recent latest success rate among Open Source models with references to OSWORLD. He also exceeded CUA based on GPT-4O OPENAI and significantly closed the gap in performance using the leading reserved anthropics models.

For corporate programmers and product leaders, research offers several key arrangements. The OpenCua method is wide, which improves the performance of models of varied architecture (each dense and mixture of experience) and sizes. Trained agents also show a strong generalization, well achieving various tasks and operating systems.

According to Wang, the frames are particularly suitable for the automation of repetitive, labor -intensive work flows of enterprises. “For example, in the AGENTNET data set, we already capture several demonstrations of starting EC2 instance to Amazon AS AWS and configuring the annotation parameters on Mttu,” said Venturebeat. “These tasks include many sequential steps, but perform repetitive patterns.”

However, Wang noticed that the bridging of the gap in order to implement live requires solving key challenges related to safety and reliability. “The biggest challenge in real implementation is security and reliability: the agent must avoid errors that can accidentally change the system settings or cause harmful side effects outside the intended task,” he said.

Scientists have released codeIN Data setAND weights for their models.

Because Open Source agents built on akin to OpenCua turn out to be more talented, they’ll generally develop a relationship between knowledge employees and their computers. Wang predicts a future in which proficiency in complex software becomes less essential than the ability to obviously express goals to the AI ​​agent.

He described two basic ways of working: “Offline automation in which the agent uses his broader knowledge about software to continue the task from end to end” and “Online cooperation in which the agent reacts in real time and works next to man, like a colleague.” Basically, people will provide strategic “what”, while the more and more sophisticated AI agents deal with the operational “how”.

Latest Posts

Advertisement

More from this stream

Recomended