Vision-language models increasingly tackle complex reasoning tasks, but their improvement often depends on expensive, human-created labels or task-specific rules. Yicheng He, Chengsong Huang from Washington University in St. Louis, Zongxia Li, and Jiaxin Huang from Washington University in St. Louis, along with Yonghui Yang from the National University of Singapore, present VisPlay, a new framework that allows these models to evolve their reasoning abilities independently, using only unlabeled image data. VisPlay establishes a system where one part of the model generates challenging visual questions, while another part attempts to answer them, both learning and improving through this interaction. This self-evolving approach consistently enhances visual reasoning, improves the ability to handle complex scenarios, and reduces inaccuracies across multiple benchmarks, offering a scalable pathway towards more intelligent and autonomous multimodal systems.
VisPlay Training Data and Benchmarks
This document details the training datasets, model configurations, and prompt templates used in the VisPlay research, which explores a self-evolving Vision-Language Model. The primary image dataset, Vision-47K, comprises 47,000 web-sourced images categorized as charts, medical images, educational content, driving scenes, and miscellaneous images, all standardized to 224×224 resolution. The model’s performance was evaluated on established benchmarks including MM-Vet, assessing general visual understanding, MMMU, testing cross-modal reasoning, RealWorldQA, focusing on spatial reasoning, VisNumBench, concentrating on visual number sense, MathVerse and MATH-Vision, both dedicated to diagram-centric mathematical problems, and HallusionBench, designed to detect visual and language hallucinations.
Self-Evolving Agents Enhance Vision-Language Reasoning
The study introduces VisPlay, a self-evolving reinforcement learning framework that enhances the reasoning capabilities of vision-language models without relying on human-annotated data. The system operates as a closed-loop, employing two agents, an Image-Conditioned Questioner and a Multimodal Reasoner, both originating from a shared pretrained base model. The process begins with the Questioner receiving an image and generating a visual query, which is then presented to the Reasoner alongside the original image to elicit a response. Both agents are iteratively refined through continuous interaction, with the Questioner learning to formulate increasingly challenging questions and the Reasoner striving to solve them.
To facilitate this self-improvement, the researchers developed a novel reward system for the Questioner, leveraging a pseudo-label generation method. Since ground-truth answers are unavailable, the Reasoner samples multiple responses to each question, and a majority voting process determines the most likely answer, establishing a pseudo-label. A confidence score quantifies the Reasoner’s certainty, serving as a proxy for question difficulty and encouraging the Questioner to generate probing questions. The Questioner’s training further incorporates a diversity regularization component, preventing convergence on a narrow set of questions. The study employs Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that normalizes rewards to compute response-level advantages, and optimizes the policy using a clipped surrogate objective, regularized to constrain policy drift.
VisPlay Evolves Reasoning in Vision-Language Models
The research introduces VisPlay, a self-evolving reinforcement learning framework that enhances the reasoning abilities of Vision-Language Models (VLMs) using unlabeled image data. The system operates by establishing two interacting roles: an Image-Conditioned Questioner, which generates visual questions, and a Multimodal Reasoner, which provides answers. These roles are jointly trained using Group Relative Policy Optimization, a method that balances question complexity with answer quality. Experiments demonstrate that VisPlay consistently improves performance across various benchmarks, including MM-Vet, MMMU, and HallusionBench.
Researchers evaluated VisPlay with three VLM backbones, Qwen2. 5-VL-3B-Instruct, Qwen2. 5-VL-7B-Instruct, and MiMo-VL-7B-SFT, and observed consistent gains across multiple iterations. For Qwen2. 5-VL-3B, average accuracy increased from 30.
61 to 44. 16 after the first iteration, reaching 47. 27 after the third. Similar improvements were seen with Qwen2. 5-VL-7B, progressing from 40.
41 to 48. 61, and MiMo-VL-7B, increasing from 43. 56 to 45. 69. These results demonstrate the scalability and robustness of the framework across different model sizes and architectures.
Further analysis reveals that VisPlay enhances performance across diverse task types, including general visual understanding, multimodal mathematical reasoning, and visual hallucination detection. Notably, the HallusionBench score for Qwen2. 5-VL-3B increased from 32. 81 to 94. 95 after the second iteration, indicating a substantial improvement in factual grounding.
The research also highlights the iterative co-evolution between the Questioner and Reasoner, where increasingly challenging questions drive the Reasoner to learn more effectively, resulting in sustained performance gains. The team measured pseudo-label accuracy, finding values of 72. 0, 65. 0, and 61. 0 for the first, second, and third iterations, respectively, confirming the quality of the self-generated training data.
Self-Evolving Reasoning in Vision-Language Models
VisPlay represents a significant advancement in vision-language model (VLM) development, demonstrating a self-evolving reinforcement learning framework capable of autonomously improving reasoning abilities from unlabeled image data. The team decomposed a VLM into two interacting roles, an Image-Conditioned Questioner and a Multimodal Reasoner, and jointly trained them using a novel Group Relative Policy Optimization (GRPO) method. This approach effectively balances the complexity of generated questions with the quality of the resulting answers, achieving improvements in visual reasoning, compositional generalization, and reduction of hallucinations across multiple benchmarks. Experiments consistently demonstrate gains across eight benchmarks, including MM-Vet and MMMU, showcasing the scalability of VisPlay when applied to models such as Qwen2.
5-VL and MiMo-VL. The researchers acknowledge limitations, including the current focus on specific VLM families and the absence of a definitive verification method for the self-generated training data. Future work will focus on extending the framework to significantly larger models and developing more robust automated methods to ensure data faithfulness and prevent the accumulation of errors. This research suggests a promising path toward truly autonomous vision-language systems capable of continual self-improvement and adaptation.
👉 More information
🗞 VisPlay: Self-Evolving Vision-Language Models from Images
🧠 ArXiv: https://arxiv.org/abs/2511.15661
