Multimodal large language models increasingly rely on visual tools to solve complex problems, but often invoke these tools unnecessarily, creating inefficiencies and hindering performance. To address this, Chaoyang Wang, Kaituo Feng, and Dongyang Chen, along with colleagues, developed AdaTooler-V, a model that intelligently decides when visual tools genuinely improve reasoning. The team introduces a reinforcement learning algorithm that dynamically adjusts rewards based on the benefit each tool provides, encouraging selective tool use, and supports this approach with two new datasets spanning both images and videos. Experiments across a range of challenging benchmarks demonstrate AdaTooler-V’s strong reasoning capabilities, notably achieving superior accuracy to commercial models such as GPT-4o and Gemini 1.5 Pro on a high-resolution visual reasoning task, representing a significant advance in multimodal AI.
However, existing open-source models often exhibit unnecessary tool usage, invoking vision tools even when they are not required, which significantly increases computational overhead and degrades performance. To overcome this issue, the researchers propose AdaTooler-V, a multimodal large language model that adaptively determines when visual problems truly require external tools. They introduce AT-GRPO, a reinforcement learning algorithm that adjusts reward scales according to the benefit provided by each tool, promoting efficient and selective tool invocation. To support training and evaluation, the team constructed two datasets—AdaTooler-V-CoT-100k and AdaTooler-V-300k—which include single-image, multi-image, and video samples with verifiable rewards.
Training LLMs to Use Visual Tools
Recent work demonstrates that multimodal large language models benefit from using vision tools, but often invoke these tools unnecessarily, increasing processing demands and reducing performance. The core challenge lies in teaching the model when to use these tools, not simply how to use them. AdaTooler addresses this by employing a reinforcement learning process where the model is rewarded for using tools that lead to correct answers. The reward is determined by the change in the model’s confidence after using a tool; a positive reward is given if the tool helps the model arrive at the correct solution.
The team created AdaTooler-V-300k, a new dataset for training and evaluating this approach, covering a wide range of visual reasoning tasks including general images, videos, reasoning, counting, and high-resolution images. A specific prompt structure guides the model during training and inference, encouraging it to think step-by-step and consider tool usage. AdaTooler enhances these models with the ability to use external tools, such as image cropping, video frame extraction, and path tracing, to improve their reasoning capabilities. The focus is on training the model to effectively decide when to use these tools, and how to reward it for appropriate tool use.
Adaptive Tool Use Boosts Reasoning Performance
Recent work demonstrates that multimodal large language models benefit from using vision tools, but these tools are often invoked unnecessarily, increasing computational demands and reducing overall performance. To address this, researchers developed AdaTooler-V, a model that adaptively determines when visual problems genuinely require tool assistance. The team introduced AT-GRPO, a reinforcement learning algorithm that adjusts reward scales based on the benefit each tool provides, encouraging efficient tool use. To support training, they constructed two datasets, AdaTooler-V-CoT-100k and AdaTooler-V-300k, which include single-image, multi-image, and video data with verifiable rewards.
Experiments across twelve benchmarks demonstrate AdaTooler-V’s strong reasoning capabilities. Notably, the AdaTooler-V-7B model achieved 89.8% accuracy on a high-resolution benchmark, surpassing both commercial models GPT-4o and Gemini 1.5 Pro. This represents a significant advancement in high-resolution visual reasoning.
The model also achieved substantial gains on other benchmarks, including a 6 percentage point improvement over a baseline model on MathVista, reaching 74.5%. These results show that controlling the frequency of tool invocation allows the model to focus on essential visual cues. Further analysis confirms the effectiveness of the approach: the AT-GRPO algorithm, combined with supervised fine-tuning, led to substantial performance improvements across multiple benchmarks.
AdaTooler-V achieved 46.7% on VSI-Bench, 54.6% on VideoMMMU, and 68.4% on MVBench using only 32 frames, outperforming existing video-reasoning models. On the challenging Video-Holmes benchmark, the model reached 55.6%, more than doubling the performance of the baseline. These gains suggest that richer contextual cues and temporal information further enhance reasoning capabilities.
Ablation studies confirm the importance of each component of the approach. Removing the supervised fine-tuning stage or disabling tool use consistently led to performance drops across all benchmarks. For example, disabling tool use reduced accuracy on the high-resolution benchmark from 89.8% to 84.4%, and on VSI-Bench from 46.7% to 39.9%. These findings demonstrate that vision tools provide complementary evidence beyond text-based reasoning and are essential for accurate multimodal understanding.Overall, this work delivers a significant advancement in adaptive tool use for multimodal large language models, enabling more efficient and accurate visual reasoning.
Adaptive Tool Use Improves Visual Reasoning
Researchers have developed AdaTooler-V, a new multimodal large language model that demonstrates improved reasoning capabilities through adaptive tool use. The team addressed a common problem in these models, the tendency to unnecessarily invoke visual tools, increasing processing time and reducing performance. AdaTooler-V incorporates a reinforcement learning algorithm, AT-GRPO, which dynamically adjusts rewards based on the actual benefit a tool provides for each specific problem. This encourages the model to use tools only when they genuinely enhance its ability to solve visual reasoning tasks. To facilitate training and evaluation, the researchers curated two datasets, AdaTooler-V-CoT-100k and AdaTooler-V-300k, encompassing a range of image and video data.
Experiments across twelve benchmarks confirm the effectiveness of this approach, with AdaTooler-V-7B achieving a leading accuracy of 89.8% on a high-resolution benchmark, surpassing the performance of established commercial models. The authors acknowledge that the model’s initial tool-use patterns stem from supervised learning, where it learns to invoke tools based on instructions, and that simpler images may not always require such tools. Future work, they suggest, will build upon this foundation to further refine tool-augmented multimodal large language models.
👉 More information
🗞 AdaTooler-V: Adaptive Tool-Use for Images and Videos
🧠 ArXiv: https://arxiv.org/abs/2512.16918
