EAGLET increases the efficiency of AI agents for longer-term tasks by generating custom plans

The 12 months was speculated to be 2025 12 months of “AI agents”, in response to Nvidia CEO Jensen Huang and others in the AI ​​industry. And in many respects, many of the leading AI model providers similar to OpenAI, Google, and even Chinese competitors similar to Alibaba have released refined AI models or applications designed to focus on a narrow set of tasks, similar to web searches and report writing.

However, one big obstacle stays to a future of highly efficient and reliable AI agents: ensuring that they proceed to perform a task when the task involves several steps. Third party benchmarks show that even the strongest AI models experience more failures the more steps they take to finish a task and the longer they spend on it (in excess of hours).

AND a recent academic framework called EAGLET proposes a practical and effective method to enhance long-term task performance in LLM-based agents – without the need for manual data labeling or retraining.

- Advertisement -

Developed by researchers from Tsinghua University, Peking University, DeepLang AI and the University of Illinois Urbana-Champaign. EAGLET offers a “global planner” that may be integrated into existing agent workflows to cut back hallucinations and improve task efficiency.

EAGLET is a refined language model that interprets task instructions – typically provided as prompts by the user or the agent’s operating environment – and generates a high-level plan for the agent (based on its own LLM). It does not interfere during execution, but the up-front guidance helps reduce planning errors and improve task completion rates.

Solving the planning problem for long-horizon agents

Many LLM-based agents struggle with long-term assignments because they rely on step-by-step reactive reasoning. This approach often results in trial-and-error behavior, hallucinatory planning, and inefficient trajectories.

EAGLET solves this limitation by introducing a global planning module who works with the executive agent.

Instead of combining planning and activity generation in a single model, EAGLET separates them, enabling more consistent strategies at the task level.

Two-step training process without human annotations

The EAGLET planner is trained in a two-step process that requires no human-written plans or annotations.

The first step involves generating synthetic plans using high-performance LLMs similar to GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a novel strategy called homologous consensus filtering, which only includes people who improve task performance for each experienced and novice executive agents.

In the second stage, the rule-based reinforcement learning process further refines the planner by using a specially designed reward function to guage how well each plan helps multiple agents succeed.

Introducing the Contractor Enhancement Reward (ECGR)

One of EAGLET’s key innovations is the Enhanced Contractor Capacity Award (ECGR).

This reward measures the value of the generated plan by whether it helps high- and low-ability agents complete tasks more efficiently and with fewer steps.

It also includes a decay factor that favors shorter and more efficient task trajectories. This approach avoids oversatisfying plans that are only useful to already competent agents and promotes more generalized planning guidelines.

Compatible with existing agents and models

The EAGLET scheduler is designed to be modular and plug and play, which implies it may possibly be incorporated into existing agent pipelines without the have to retrain contractors.

During evaluations, the scheduler improved performance on various baseline models, including GPT-4.1, GPT-5, Lama-3.1, and Qwen2.5.

It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts in addition to approaches like Reflexion.

State-of-the-art benchmark performance

EAGLET was tested on three commonly used benchmarks for long-time agent tasks: ScienceWorld, which simulates science experiments in a text-based laboratory environment; ALFWorld, which tasks agents with performing household tasks using natural language in a simulated home setting; and WebShop, which assesses goal-oriented behavior in a realistic online shopping interface.

In all three cases, EAGLET-equipped executive agents outperformed their non-scheduler counterparts and other planning baselines, including MPO and KnowAgent.

In experiments with the open-source Llama-3.1-8B-Instruct model, EAGLET improved average performance from 39.5 to 59.4, an increase of +19.9 points across all tasks.

In unseen scenarios, ScienceWorld increased performance from 42.2 to 61.6.

In the scenarios presented in ALFWorld, EAGLET improved results from 22.9 to 54.3, an increase in performance of over 2.3 times.

Even greater increases were observed for more efficient models.

For example, GPT-4.1 improved from 75.5 to 82.2 for EAGLET, and GPT-5 increased from 84.5 to 88.1 despite already performing well.

In some tests, the performance gain was as much as +11.8 points, similar to when combining EAGLET with the ETO runtime method on ALFWorld invisible jobs.

Compared to other scheduling databases similar to MPO, EAGLET consistently delivered higher task completion rates. For example, in ALFWorld’s unseen tasks with GPT-4.1, MPO achieved 79.1 while EAGLET achieved 83.6, an advantage of +4.5 points.

Additionally, the article reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, the average number of steps decreased from 13.0 (without scheduling) to 11.1 (EAGLET). For GPT-5, it dropped from 11.4 to 9.4, supporting the claim of higher execution performance.

Increased efficiency in training and execution

Compared to RL-based methods similar to GiGPO, which may require a whole bunch of training iterations, EAGLET achieved higher or comparable results with roughly one-eighth of the training effort.

This efficiency also translates into execution: agents using EAGLET typically needed fewer steps to finish tasks. This translates into reduced inference time and computational costs in production scenarios.

No public code – yet

As of the version submitted to arXiv, the authors have not yet published an open-source implementation of EAGLET. It’s unclear if or when the code can be released, under what license, or how it is going to be maintained, which could limit the framework’s short-term usefulness for enterprise deployment.

VentureBeat has reached out to the authors to make clear these points and will update this text when we hear back.

Questions remain about enterprise implementation

Although the scheduler is described as plug-and-play, it is unclear whether EAGLET may be easily integrated with popular enterprise agent platforms similar to LangChain or AutoGen, or whether it requires a custom stack to support plan-execution separation.

Similarly, the training setup uses multiple execution agents, which could also be difficult to breed in enterprise environments with limited access to the model. VentureBeat asked researchers whether the homology consensus filtering method could possibly be adapted for teams that only have access to a single execution model or have limited computing resources.

The authors of the EAGLET project report success with various types and sizes of models, but it is not yet known what is the minimum model scale that may be practically implemented. For example, can enterprise teams effectively use the scheduler on open, sub-10B models in latency-sensitive environments? Additionally, the platform may offer industry-specific value in areas similar to customer support or IT automation, but it stays to be seen how easily the planner may be adapted to such industries.

Real-time vs. pre-generated planning

Another open query is how you can best implement EAGLET in practice. Should the planner run in real time with executors in the loop, or is it higher to make use of it offline to pre-generate global plans for known task types? Each approach has implications for delays, costs, and operational complexity. VentureBeat posed this query to the authors and will report any insights that emerge.

Strategic trade-offs for enterprise teams

For technical leaders in medium and large enterprises, EAGLET provides a compelling proof of concept for improving the reliability and performance of LLM agents. However, without public tools and implementation guidelines, the framework still represents a “build rather than wait” decision. Companies must weigh the potential gains in task efficiency and performance against the costs of replicating or approximating the training process in-house.

Potential use cases in enterprise settings

For enterprises developing agent-based AI systems – especially in environments requiring staged planning, similar to IT automation, customer support or online interactions – EAGLET offers a template to enable planning without retraining. Its ability to drive each open and closed source models, along with its effective training method, could make it an attractive place to begin for teams trying to improve agent performance with minimal effort.

Latest Posts

Advertisement

More from this stream

Recomended