Researchers are tackling the significant challenge of creating artificial agents that can autonomously discover and learn an unlimited range of skills. Richard Bornemann from Imperial College London, Pierluigi Vito Amadori from Sony Interactive Entertainment, and Antoine Cully from Imperial College London, et al., present a novel framework, Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), which moves beyond reliance on manually designed reward systems. This work is particularly noteworthy as it utilises Foundation Models to dynamically build and improve a library of skills, represented as executable code, enabling agents to solve increasingly complex, long-horizon goals within the Craftax environment and ultimately surpassing the performance of both pre-trained agents and expert policies by a substantial margin on average.

This work addresses a critical limitation in reinforcement learning, where agents typically require hand-designed reward functions, a process infeasible for open-ended skill discovery where the desired skills are initially unknown.

CODE-SHARP leverages the power of Foundation Models to both discover and refine a hierarchical archive of skills, represented as executable reward functions written in code. CODE-SHARP employs two iterative processes driven by Foundation Models: one for discovering new SHARP skills and another for refining existing ones.

Novel skills are proposed, implemented, and selected, while existing skills undergo mutation and evaluation, creating an emergent hierarchy of increasing complexity. This approach allows the system to move beyond simply improving predefined skills and towards genuinely discovering entirely new capabilities.

Researchers demonstrated the effectiveness of CODE-SHARP within the Craftax environment, successfully discovering an average of 90 diverse SHARP skills. A goal-conditioned agent, trained solely on the rewards generated by these discovered skills, learned to solve complex, long-horizon goals previously unattainable.

Furthermore, when these skills were integrated into high-level policies by a Foundation Model-based planner, the agent outperformed both pre-trained agents and task-specific expert policies by over 134% on average. This significant performance increase highlights the potential of CODE-SHARP to create truly generalist agents capable of adapting to unseen environments and acquiring new skills autonomously.

The framework consists of a directed graph where each node represents a SHARP skill, implemented in Python code, and generated by a series of Foundation Models. These models function as a skill proposal generator, implementor, and judge, filtering and evaluating potential skills before environmental testing.

Skill refinement is achieved through mutation proposals, also generated by Foundation Models, and directly evaluated within the environment. Simultaneously, a goal-conditioned agent is trained on the rewards from the growing skill archive, continuously expanding its learned capabilities and tackling increasingly complex goals. This archive is structured as a directed graph of executable reward functions written in code, termed SHARPs, and facilitates open-ended skill discovery.

The system employs two iterative processes: the discovery of novel SHARP skills and the refinement of existing skills within the archive, both guided by a Foundation Model. To generate new skills, CODE-SHARP utilises a pipeline comprising a Foundation Model-based skill proposal generator, implementor, and judge.

The proposal generator creates candidate skills, which are then assessed for code syntax by the implementor before being evaluated within the environment. The judge, also powered by a Foundation Model, determines whether a proposed skill is acceptable for inclusion in the archive, filtering out failed proposals and mutations.

Skill refinement involves mutating existing SHARP skills using a Foundation Model-based skill mutation generator and implementor, followed by environmental evaluation. Each novel SHARP skill is constructed by composing previously added SHARP skills from the archive, resulting in an emergent hierarchy of increasing complexity.

The system simultaneously trains a single, goal-conditioned agent exclusively on the rewards and goals derived from the expanding skill archive, continuously broadening the agent’s capabilities. This agent was tested in the Craftax environment, demonstrating an ability to solve increasingly long-horizon goals. The study focused on autonomously expanding and refining a hierarchical skill archive, structured as a directed graph of executable reward functions coded in Python.

This work addresses limitations in existing automated reward function design methods by enabling the discovery of entirely new skills, rather than simply refining pre-defined ones. The CODE-SHARP framework leverages Foundation Models to iteratively discover and refine SHARP skills, composing novel skills from existing ones to create an emergent hierarchy of increasing complexity.

During evaluation, the system generated, on average, 90 diverse SHARP skills covering a broad range of capabilities within the Craftax skill space. A goal-conditioned agent, trained solely on rewards from these discovered skills, successfully learned to solve complex, long-horizon goals previously unattainable by baseline methods.

The architecture incorporates two primary open-ended iterative processes: FM-driven discovery of new SHARP skills and FM-driven refinement of the existing skill archive through mutation. Skill proposals undergo filtering via an FM-based pipeline consisting of a generator, implementor, and judge before environmental evaluation.

Skill refinement involves FM-based mutation, with proposals directly assessed within the environment to ensure functionality and reward generation. This system utilises foundation models to both expand and refine a hierarchical archive of skills, represented as executable reward functions coded in Python.

The framework addresses a key challenge in reinforcement learning, which traditionally relies on manually designed reward functions, by automating this process for previously undefined tasks. A high-level policy planner, also based on a foundation model, effectively composes these discovered skills into policies capable of solving complex tasks, exceeding the performance of both pre-trained agents and task-specific expert systems by over 134% on average.

The system also exhibits a capacity for continuous learning, generating increasingly complex skills and integrating them into improved policies over time. A primary limitation of the current implementation is its dependence on environments defined in code, restricting its immediate application to real-world scenarios like robotics.

Future research will focus on extending CODE-SHARP to environments not defined in code, potentially through learned reward models or the incorporation of natural-language feedback. Despite this constraint, the development of CODE-SHARP represents a substantial advancement towards creating autonomous, open-ended agents capable of tackling increasingly complex goals without requiring human-defined rewards, thereby contributing to the broader pursuit of general artificial intelligence.

👉 More information
🗞 CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
🧠 ArXiv: https://arxiv.org/abs/2602.10085

Tags:

Craftax environment Foundation Models goal-conditioned agents hierarchical reward programs long-horizon goals. open-ended skill discovery Reinforcement Learning SHARP skills

AI Learns New Skills Continuously, Without Needing Human Instruction or Pre-Set Goals

Rohail T.

Latest Posts by Rohail T.:

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy

Framework Improves Code Testing with Scenario Planning