Researchers are tackling the persistent problem of long-form video understanding, a key hurdle for current Video Large Language Models. Chenglin Li, Qianglong Chen (Zhejiang University), and Feng Han (Fudan University), alongside Yin Xingxi, Yan Gong and et al., present VideoThinker, a novel agentic VideoLLM that overcomes limitations of static frame analysis by employing adaptive exploration of crucial video moments. This work is significant because it breaks the circular dependency of needing pre-existing video comprehension to create agentic training data , instead, VideoThinker learns through synthetic tool interaction trajectories generated in caption space and then grounded in video. By training on this uniquely constructed dataset, the model demonstrates substantially improved dynamic reasoning, temporal awareness, and multi-step tool use, ultimately outperforming existing methods on long-video benchmarks.
The research team achieved this breakthrough by training VideoThinker entirely on synthetic tool-interaction trajectories, a unique approach that bypasses the need for pre-existing strong long-form video comprehension capabilities. This innovative method converts videos into rich captions and employs a powerful agentic language model to generate multi-step tool-use sequences within that caption space, effectively creating a large-scale dataset for interleaved video and tool reasoning.
The core innovation lies in grounding these trajectories back to video by replacing captions with corresponding frames, yielding a dataset that doesn’t require the underlying model to initially possess strong long-form video understanding. This synthetic data equips VideoThinker with dynamic reasoning capabilities, allowing it to adaptively explore key moments in videos and utilise multi-step tools effectively. Experiments show that VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across established long-video benchmarks, demonstrating the effectiveness of this tool-augmented synthetic data and adaptive reasoning. Specifically, the model achieves a +6.8% improvement on MLVU and a +10.6% improvement on LVBench compared to standard VideoLLMs.
VideoThinker leverages two key agentic tools: Temporal Retrieval, which identifies potentially relevant temporal intervals using audio transcripts, scene descriptions, and summaries, and Temporal Zoom, which allows for detailed inspection of intervals through subtitles or frames. By combining these tools with the LLM’s inherent tool-augmented reasoning, the researchers constructed multi-turn tool-interaction trajectories, enabling the VideoLLM to actively retrieve and perceive key frames during reasoning. Furthermore, a confidence-gated tool controller was incorporated, resulting in gains of +3.9% and +3.5% over caption-only LLM agents also equipped with the developed tools. This innovative approach converts videos into rich captions, then utilises a powerful agentic language model to simulate multi-step tool use sequences within caption space. Subsequently, these captions are replaced with corresponding video frames, creating a large-scale dataset of interleaved video and tool reasoning without requiring pre-existing long-form video understanding capabilities.
Experiments employed two complementary agentic tools, Temporal Retrieval and Temporal Zoom, to enable adaptive exploration of key video moments. The Temporal Retrieval tool, ClipRetrieval, segments videos into 10-second clips, encoding each using LanguageBind-Video to generate clip-level embeddings; it then retrieves the top-ranked clips based on semantic similarity to the input query, returning their temporal intervals. Furthermore, SubtitleRetrieval leverages Whisper to transcribe video audio, enabling fine-grained text-level retrieval over automatically generated subtitles, providing a detailed textual proxy for multimodal tool outputs. The study pioneered a method of iteratively refining queries with ClipRetrieval, progressively zooming in on relevant video segments and efficiently narrowing the search space for analysis.
This modular design, incorporating both coarse-grained clip retrieval and fine-grained subtitle access, allows the agent to adaptively focus on key segments while maintaining global context. Existing models often struggle with temporal localization and information loss in extended videos, relying on static reasoning over uniformly sampled frames. The research team addressed this by training VideoThinker entirely on synthetically generated tool interaction trajectories, effectively bypassing the need for pre-existing strong long-form video comprehension capabilities. This innovative approach converts videos into rich captions and leverages a powerful agentic language model to execute multi-step tool use sequences within caption space, subsequently grounding these trajectories back to video by replacing captions with corresponding frames.
Experiments revealed that the team successfully constructed a large-scale dataset of interleaved video and tool reasoning, comprising diverse and interpretable temporally grounded reasoning traces. Data synthesis involved generating video captions using a VideoLLM, then combining these with queries and tool system prompts to elicit reasoning from the LLM. For each query, five distinct reasoning trajectories were generated with a sampling temperature of 0.7, retaining only those where the predicted answer matched the ground truth. The resulting dataset, Dtool, consists of {(vi, xi, ri, yi)}M i=1 samples, where ‘vi’ denotes the video, ‘xi’ the question, ‘ri’ the reasoning trajectory, and ‘yi’ the ground-truth answer.
Measurements confirm that CaptionZoom played a crucial role during data synthesis, acting as the sole visual access point by converting frames into temporally grounded captions. Notably, the textual outputs of CaptionZoom were replaced with corresponding video segments represented as special tokens, enabling the model to internalize structured reasoning patterns directly grounded in visual representations. Tests prove that the adaptive reasoning capabilities of VideoThinker significantly outperform both caption-only language model agents and strong video model baselines across long-video benchmarks. The breakthrough delivers dynamic reasoning and adaptive temporal exploration, demonstrated through the use of tools like ClipRetrieval, which accesses semantically relevant temporal regions by segmenting videos into 10-second clips and encoding them with LanguageBind-Video.
The team measured the performance of SubtitleRetrieval, employing Whisper to transcribe video audio and retrieve relevant subtitle segments with timestamps, and SubtitleSummary, built upon Qwen3-30B, which generates concise, query-focused summaries of complete subtitle transcripts. FrameZoom extracts raw frames within specified temporal intervals, resampling to increase visual density, for example, retrieving 8 frames from a 10-second interval within a 32-frame video. VideoThinker is trained entirely on synthetic data, generated by converting videos into rich captions and utilising a powerful language model to simulate multi-step tool use sequences within caption space. This innovative approach bypasses the need for pre-existing long-form video comprehension capabilities, resolving a common circular dependency in agentic video understanding research.
The resulting large-scale dataset interweaves video and tool reasoning, enabling VideoThinker to demonstrate dynamic reasoning, adaptive temporal exploration, and proficient multi-step tool use, significantly outperforming both caption-only language models and established video model baselines on long-video benchmarks. The authors acknowledge that the model’s performance is dependent on the quality of the synthetic data generation process and the capabilities of the underlying language model. Future research could explore more sophisticated synthetic data generation techniques and investigate the transferability of these agentic capabilities to other video understanding tasks.
👉 More information
🗞 VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
🧠 ArXiv: https://arxiv.org/abs/2601.15724
