One of the foremost challenges of deploying autonomous agents is building systems that may adapt to changes in their environments without having to retrain the underlying large language models (LLM).
Souvenir skillsa new platform developed by scientists from many universities eliminates this bottleneck, giving agents the opportunity to develop their skills on their own. “He adds his own continuous learning capabilities of existing offerings in the current market, such as OpenClaw and Claude Code,” Jun Wang, co-author of the paper, told VentureBeat.
Memento-Skills acts as an evolving external memory, allowing the system to regularly increase its capabilities without modifying the basic model. The framework provides a set of skills that might be updated and expanded as the agent receives feedback from its environment.
This is vital for enterprise teams supporting agents in production. The alternative – tuning model weights or hand-building skills – involves significant operational and data requirements. Memento-Skills bypasses each.
The challenges of building self-evolving agents
Self-evolving aspects are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain constant, limiting it to the knowledge encoded during training and whatever falls inside its immediate context window.
Providing the model with external memory scaffolding allows it to improve without a costly and slow retraining process. However, current approaches to agent adaptation largely rely on hand-engineered abilities to deal with new tasks. While there are some methods for routinely learning skills, most of them create text guides that allow for quick optimization. Other approaches simply record single-task trajectories that do not transfer between different tasks.
Moreover, when agents try to obtain knowledge relevant to a new task, they typically rely on semantic similarity routers comparable to standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent using standard RAG may download a “password reset” script to resolve a “refund processing” query simply because the documents share the same corporate terminology.
“Most search-aided generation (RAG) systems rely on similarity-based search. However, when skills are represented as executable artifacts, such as markdown documents or code snippets, similarity alone may not select the most effective skill,” Wang said.
How Memento-Skills stores and updates skills
To address the limitations of current agent systems, researchers created Memento-Skills. The paper describes the system as “a generic, continuously learning LLM agent system that acts as an agent design agent.” Instead of keeping a passive diary of past conversations, Memento-Skills creates a set of skills that act as a everlasting, evolving external memory.
These skills are stored in structured markdown files and function the agent’s evolving knowledge base. Each reusable skill artifact consists of three basic elements. Contains declarative specifications that outline what a skill is and the way it ought to be used. It comprises specialized instructions and prompts that guide the reasoning of the language model. Contains executable code and supporting scripts that the agent runs to actually solve the task.
Memento-Skills enables continuous learning with a “Reflective Read-Write Learning” mechanism that treats memory updates as lively policy iteration somewhat than passive data logging. When faced with a new task, the agent queries a specialized skill router for the most behaviorally appropriate skill – not only the most semantically similar one – and executes it.
Once the agent performs the skill and receives feedback, the system analyzes the result to close the learning loop. Instead of simply appending a log of what happened, the system actively mutates its memory. If execution fails, the coordinator evaluates the trace and rewrites the skill artifacts. This implies that it directly updates the code or prompts you to patch a specific failure mode. If vital, creates a completely new skill.
Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from performance feedback, not only text overlap. “The real value of a skill is how it contributes to the agent’s overall workflow and subsequent execution,” Wang said. “Therefore, reinforcement learning provides a more appropriate framework because it allows the agent to evaluate and select skills based on long-term utility.”
To prevent regression in production, automated skill mutations are protected by an automated unit test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library.
By continually rewriting and refining its own executables, Memento-Skills enables a frozen language model to build solid muscle memory and regularly expand its capabilities from start to finish.
Putting a self-evolving agent to the test
Researchers evaluated Memento-Skills using two rigorous benchmark tests. The first is Generic AI assistants (GAIA), which requires complex, multi-step reasoning, multiple modalities, web browsing, and tool use. The second one is The final test of humanityor HLE, an expert-level benchmark covering eight different academic subjects comparable to mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as a basic model for frozen language.
The system was compared to a base read-write system that downloads skills and collects feedback, but does not have self-developing functions. The researchers also tested their custom skills router against standard semantic search databases, including BM25 and Qwen3 Embedding.
The results showed that an actively developing memory significantly outperforms a static skill library. In the highly differentiated GAIA benchmark, Memento-Skills improved the accuracy of the test set by 13.7 percentage points compared to the static baseline, achieving 66.0% compared to 52.3%. In the HLE benchmark, where the domain structure allowed massive reuse of skills across multiple tasks, system performance greater than doubled compared to the baseline, jumping from 17.9% to 38.7%.
Furthermore, Memento-Skills’ specialized skill router avoids the classic search trap where an irrelevant skill is chosen simply due to semantic similarity. Experiments show that Memento-Skills increases the success rate of complex tasks to 80% compared to only 50% for standard BM25 recovery.
Researchers have observed that Memento-Skills manages these outcomes through highly organic, structured skill development. Both comparison experiments began with just five seed skills, comparable to basic Internet searching and terminal skills. In the GAIA test, the agent independently expanded this seed group into a compact library of 41 skills to perform a number of tasks. In the expert-level HLE benchmark, the system dynamically scaled its library to 235 different skills.
Finding the optimal point for the company
Scientists have published the code for Memento-Skills on GitHuband is easily available for use.
For enterprise architects, the effectiveness of this method depends on domain fit. Instead of simply looking at benchmark results, the foremost business trade-off is whether your agents handle isolated tasks or structured workflows.
“Skill transfer depends on the degree of task similarity,” Wang said. “First, when tasks are isolated or weakly related to each other, the agent cannot rely on prior experience and must learn through interaction.” In such distributed environments, transfer between tasks is limited. “Second, when tasks have a common structure, previously acquired skills can be directly reused. In this case, learning becomes more effective because knowledge is transferred between tasks, allowing the agent to perform well on new problems with little or no additional interaction.”
Given that the system requires repeatable task patterns to consolidate knowledge, enterprise leaders need to know exactly where to implement it today and where to hold off.
“Workflows are probably the most appropriate place for this approach because they provide a structured environment in which to compose, evaluate and refine skills,” Wang said.
However, he cautioned against over-deploying staff in areas that are not yet covered by the framework. “Physical agents remain largely unexplored in this context and require further research. Additionally, tasks with longer horizons may require more advanced approaches, such as multi-agent LLM systems, to enable the coordination, planning and continuous execution of longer decision sequences.”
As the industry moves towards agents who write their own production code, governance and security remain paramount. While Memento-Skills uses basic security measures comparable to automated unit test gates, a broader framework will likely be needed for enterprise implementation.
“To enable reliable self-improvement, we need a well-designed assessment or evaluation system that can assess performance and provide consistent guidance,” Wang said. “Rather than allowing for unlimited self-modification, the process should be structured as a directed form of self-development in which feedback guides the agent toward better designs.”
