Researchers are tackling the challenge of automated prompt optimisation for large language models, a field where current methods often require labelled data. Siran Peng from MAIS, CASIA and UCAS, Weisong Zhao from IIE, CAS and UCAS, and Tianyu Fu et al. introduce UPA, an Unsupervised Prompt Agent, which navigates and selects prompts without supervised feedback. This innovative approach utilises a tree-based search guided by comparisons from language models, decoupling exploration from final selection via a novel two-stage framework based on the Bradley-Terry-Luce model. Demonstrating consistent improvements over existing techniques across multiple tasks, this work proves agent-style optimisation can thrive even when operating entirely without human guidance.
This research addresses the challenge of refining prompts without relying on supervised reward signals, a common limitation in practical applications.
The team achieved this by framing prompt refinement as a sequential decision-making process within a structured prompt space, enabling the use of advanced planning algorithms. UPA uniquely realises structured search and selection without supervised feedback, utilising fine-grained, order-invariant pairwise comparisons derived directly from LLMs.
Specifically, the study details how UPA iteratively constructs an evolving tree structure to navigate the prompt space, with each node representing a candidate prompt and edges denoting refinement steps. Instead of absolute reward calculations, the agent employs a judge LLM to conduct relative preference assessments between prompts based on performance across sampled inputs.
Recognising that local comparisons lack a consistent global scale, researchers decoupled systematic prompt exploration from final selection, introducing a novel two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This innovative approach first aggregates local comparisons using path-wise Bayesian methods to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and pinpoint the optimal prompt.
Experiments conducted across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimisation methods. This breakthrough reveals that agent-style optimisation remains highly effective even in fully unsupervised settings, opening new avenues for prompt engineering. The research establishes a robust methodology for prompt refinement, offering a practical solution for scenarios where labelled data or task-specific metrics are unavailable. During search, UPA iteratively builds an evolving tree structure to navigate the prompt space, with each node representing a candidate prompt and edges signifying refinement steps performed by an optimisation Large Language Model (LLM).
This allows for parallel exploration of multiple refinement trajectories. The study harnessed a judge LLM to conduct fine-grained, order-invariant pairwise comparisons between child and parent prompts, assessing performance on sampled inputs to establish relative preferences. Recognising that these local comparisons lack a consistent global scale, researchers decoupled systematic prompt exploration from final selection, implementing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model.
Initially, path-wise Bayesian aggregation of local comparisons filters candidates, accounting for uncertainty. Subsequently, global tournament-style comparisons infer latent prompt quality, ultimately identifying the optimal prompt. Experiments employed multiple tasks to demonstrate UPA’s consistent outperformance over existing prompt optimisation methods, proving agent-style optimisation remains effective even in fully unsupervised settings.
This approach enables structured exploration in scenarios where ground-truth labels are unavailable, a significant advancement over previous methods reliant on supervised signals or single-path refinement. The innovative two-stage framework, combining Bayesian aggregation with tournament selection, addresses the challenge of inconsistent global ranking inherent in pairwise comparisons, facilitating robust and effective prompt optimisation.
UPA performance across diverse reasoning and factual verification benchmarks remains impressive
Scientists achieved a 52.1% accuracy on the GPQA benchmark using a novel unsupervised prompt agent, UPA, demonstrating significant performance gains. Experiments across multiple tasks reveal that UPA consistently outperforms existing prompt optimization methods, even in fully unsupervised settings. The team measured performance on AGIEval-MATH, recording an accuracy of 45.5%, and on LIAR, achieving 68.2% accuracy.
Results demonstrate that UPA’s two-stage framework effectively decouples systematic prompt exploration from final selection. Stage I employed path-wise Bayesian filtering, modelling win probability as a Beta distribution with parameters αv,u = α0 + wv,u and βv,u = β0 + lv,u, where α0 and β0 were set to 1.
The mean of the quality increment, μ∆ v,u, was calculated using the Digamma and Polygamma functions, specifically μ∆ v,u = ψ(αv,u) −ψ(βv,u). Path-wise posterior means were aggregated linearly, and uncertainty was accounted for using a Lower Confidence Bound (LCB) with a risk aversion parameter λunc. Tests prove that the path-wise variance, σ2 v, diminishes as the sampling budget increases, concentrating the ranking around true global quality.
Stage II utilized a global tournament-style comparison based on the Bradley-Terry-Luce (BTL) model, maximizing the BTL log-likelihood using a Minorization-Maximization (MM) algorithm. The update rule for the quality parameter γi at iteration t was γ(t+1) i = P j=i Wi,j P j=i Ni,j γ(t) i +γ(t) j. On the WSC dataset, UPA attained 82.7% accuracy, while on the BBH-Navigate benchmark, the breakthrough delivers a 98.0% accuracy.
Measurements confirm that UPA achieved 69.3% accuracy overall, surpassing the 66.3% achieved by the SPO baseline and the 65.0% attained by PromptAgent. The independent selection set, Qsel, with |Qsel| |Qsim|, was used to ensure robust quality estimation during the final selection process. This agent utilises a structured search process, building an evolving tree to navigate the prompt space, guided by pairwise comparisons generated by the language models themselves.
The system decouples exploration from final selection, employing a two-stage framework based on the Bradley-Terry-Luce model to aggregate comparisons and identify high-quality prompts. Experiments across several tasks demonstrate UPA’s consistent superiority over existing prompt optimisation techniques.
This indicates that agent-based optimisation strategies can remain effective even when supervised feedback is unavailable, offering a valuable approach for scenarios where obtaining such feedback is impractical or costly. The authors acknowledge that the performance of UPA is influenced by hyperparameters, which they categorised to balance performance and computational efficiency. Future research could explore the application of UPA to more complex tasks and investigate the potential for adapting the framework to different language model architectures.
👉 More information
🗞 UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection
🧠 ArXiv: https://arxiv.org/abs/2601.23273
