Image captioning, a crucial task connecting vision and language, currently relies heavily on supervised learning methods that demand large amounts of costly, human-annotated data, often limiting the creativity and adaptability of resulting models. Long Xing, Xiaoyi Dong, and Yuhang Zang, alongside colleagues, address this challenge by pioneering a new approach using reinforcement learning to train image captioning models. Their work introduces CapRL, a framework that defines caption quality not by similarity to existing descriptions, but by its usefulness, specifically, the ability of a separate language model to accurately answer questions about an image based solely on the generated caption. This innovative method significantly improves performance across multiple benchmarks, achieving results comparable to state-of-the-art models like Qwen2. 5-VL-72B and demonstrating a substantial advancement in the field of open-ended image description.

The fundamental task of bridging the visual and linguistic domains is crucial for pre-training Large Vision-Language Models (LVLMs). Current captioning models often rely on Supervised Fine-Tuning (SFT), a method dependent on limited human-annotated data, which can result in models that memorise specific answers and lack creative description abilities. To address this, researchers apply the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to open-ended image captioning, overcoming challenges in implementing this approach for complex generation tasks.

Systematic VLM Caption Evaluation Through Constrained Prompts

This collection of prompts and instructions provides a systematic method for evaluating and interacting with large vision-language models (VLMs) like Qwen2. 5, rigorously assessing the quality of generated captions with a focus on accuracy and comprehensive reflection of image content. The prompts are carefully designed to constrain the model’s behaviour, isolating its ability to describe images and minimising external influences. The approach assesses accuracy of object identification, comprehensiveness of detail coverage, and level of descriptive specificity, often assigning specific roles to the model, such as a judge or reward model, to focus responses and ensure consistency.

These prompts enable the model to evaluate captions generated by another system using a strict scoring system, and to create multiple-choice questions about the image, testing understanding of visual details. This setup could be used to build automated evaluation pipelines, compare the performance of different VLMs, and analyse errors in captioning models, offering potential for reward modelling and dataset creation. The strength of this approach lies in its control, quantifiable metrics, and potential for automation, with future improvements potentially including more granular evaluation criteria and human validation of results.

Caption Quality Measured by Question Answering

Researchers developed CapRL, a new training framework that significantly enhances image captioning performance by redefining caption quality as its utility, a high-quality caption should enable an independent system to accurately answer questions about the corresponding image. This work addresses the limitations of traditional Supervised Fine-Tuning (SFT) methods, which can lead to models that memorise specific answers rather than understanding underlying concepts. CapRL employs a decoupled two-stage pipeline, first generating a caption with a Large Vision-Language Model (LVLM), then evaluating the caption’s quality based on the accuracy of a separate, vision-free Large Language Model (LLM) answering multiple-choice questions. Experiments demonstrate that pretraining on the CapRL-5M caption dataset, annotated by CapRL-3B, results in substantial gains across twelve benchmarks.

Within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to the Qwen2. 5-VL-72B model, exceeding the baseline by an average margin of 8. 4%. This demonstrates CapRL’s ability to train models to produce more general and accurate image descriptions. The team successfully designed an objective reward function for image captioning, overcoming challenges associated with reward hacking and unstable training curves, and validating that CapRL effectively trains models to produce more comprehensive and accurate image descriptions.

CapRL Improves Vision-Language Alignment with Rewards

This work introduces CapRL, a new framework that successfully applies reinforcement learning with verifiable rewards to the challenging task of image captioning. By redefining caption quality not as aesthetic appeal, but as its usefulness in enabling a vision-free language model to accurately answer questions about an image, the researchers created a robust and objective reward signal for training. Results demonstrate that CapRL encourages models to generate detailed and precise image descriptions, significantly improving alignment between visual and linguistic information during pre-training of large vision-language models. The team demonstrated substantial gains across twelve benchmarks using a dataset annotated with CapRL, achieving performance comparable to a state-of-the-art model while exceeding baseline performance by an average of 8.

4%. This represents a significant step away from traditional supervised fine-tuning, which relies on large amounts of human-annotated data and can lead to models that simply memorise specific answers. While performance improvements were observed with increasing sampling rounds, and limited rounds can introduce biases in the reward signal, future work will likely focus on refining the reward mechanism and exploring the generalizability of this approach to other open-ended tasks requiring subjective evaluation.

👉 More information
🗞 CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2509.22647

Tags:

CapRL CapRL-3B CapRL-5M Captioning Reinforcement Learning Image Captioning multiple-choice questions Prism Framework reinforcement learning with verifiable rewards supervised fine-tuning Vision-Language Models

Caprl: Reinforcement Learning Stimulates Dense Image Caption Capabilities, Overcoming Limitations of Supervised Fine-Tuning

Systematic VLM Caption Evaluation Through Constrained Prompts

Caption Quality Measured by Question Answering

CapRL Improves Vision-Language Alignment with Rewards

Rohail T.

Latest Posts by Rohail T.:

Symmetry Learning Achieved: Wise Gradient Coherence Enables Robust Generalization

Youtu-Parsing Achieves 11x Faster Document Decoding with High-Parallelism Techniques

Leaf Enables Label-Efficient Image Quality Assessment with Minimal MOS Annotations