Researchers are tackling the challenge of efficiently extracting information from complex documents with a new model called Youtu-Parsing.. Kun Yin, Yunfei Wu, and Bing Liu from TencentCloudADP, alongside colleagues Cai, Li, and Chen, present a high-performance system capable of understanding and structuring diverse document layouts. This innovative approach utilises a dynamic visual encoder and a prompt-guided language model, achieving a remarkable 5-11x speedup over conventional methods through a novel high-parallelism decoding strategy , a significant leap forward for applications requiring large-scale document intelligence, and demonstrated by state-of-the-art results on OmniDocBench and olmOCR-bench.
ViT and LLM for Document Understanding are showing
Scientists have unveiled Youtu-Parsing, a groundbreaking document parsing model engineered for high-performance content extraction from complex documents. The research team achieved a significant leap forward by developing an efficient and versatile architecture that combines a dynamic-resolution visual encoder, a native Vision Transformer (ViT), with a prompt-guided Youtu-LLM-2B language model for both layout analysis and region-prompted decoding. This innovative approach leverages a decoupled and feature-reusable framework, enabling the concurrent processing of document elements and dramatically accelerating the parsing process. Experiments demonstrate that Youtu-Parsing can accurately identify and segment structural components such as text blocks, hierarchical structures within diverse document types.
The core innovation lies in a high-parallelism decoding strategy comprising two key components: token parallelism and query parallelism. Token parallelism generates up to 64 candidate tokens per inference step, validated by a verification mechanism, resulting in a 5, 11x speedup compared to traditional autoregressive decoding, a particularly advantageous feature for structured data like tables. Complementing this, the query parallelism strategy simultaneously predicts content for up to five bounding boxes, providing an additional 2x acceleration without compromising output quality. This dual-track parallelization paradigm exploits the deterministic nature of document parsing, where tokens are visually grounded and spatial dependencies are limited, to maximize computational efficiency.
Youtu-Parsing’s robust performance extends to challenging document characteristics, including rare characters, multilingual text, and even handwritten content. Extensive evaluations on both the OmniDocBench and olmOCR-bench benchmarks confirm that the model achieves state-of-the-art (SOTA) results, surpassing both general-purpose vision-language models and specialized domain models in key areas like text, formula, table, and reading order recognition. Figure 1 visually demonstrates Youtu-Parsing’s superior performance, establishing new benchmarks across multiple evaluation tasks. The model’s ability to handle a diverse range of document elements, as detailed in Table 1, underscores its versatility and broad applicability.
This breakthrough establishes a new standard for document intelligence applications, offering significant experimental value and practical utility for large-scale data processing. By decoupling feature extraction from layout analysis and decoding, the researchers have created a modular system that facilitates targeted optimization and reduces error propagation. The combination of ViT for visual feature extraction and Youtu-LLM-2B for language processing provides a powerful synergy, enabling accurate and efficient parsing of complex documents. The publicly available code, model, and demo, accessible via GitHub and Hugging Face, further promote accessibility and encourage wider adoption of this transformative technology.
Visual Feature Extraction and Region Decoding are key
Scientists developed Youtu-Parsing, a novel document parsing framework engineered for high-performance content extraction from complex documents. The research team decoupled the parsing task into three synergistic stages: shared visual feature extraction, layout analysis, and region-prompted decoding, enabling modular training and targeted optimisation. Leveraging NaViT, the shared visual feature extraction module generates a unified feature map serving as input for subsequent operations. This innovative approach bypasses error accumulation common in traditional pipelines while facilitating efficient processing of diverse document types.
Experiments employed a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. The study pioneered a high-parallelism decoding strategy comprising token parallelism and query parallelism to accelerate processing. Token parallelism concurrently generates up to 64 candidate tokens per inference step, which are then validated by a verification mechanism ensuring zero degradation in recognition accuracy. This method achieves a 5, 11x speedup over traditional autoregressive decoding, particularly benefiting highly structured documents like tables.
To further enhance efficiency, the team implemented query parallelism, enabling simultaneous content prediction for up to five bounding boxes. This technique exploits the shared visual features, delivering an additional 2x acceleration while maintaining equivalent output quality to standard decoding methods. The system handles diverse document elements, hierarchical structures, and demonstrates robustness with rare characters, multilingual text, and handwriting. Researchers harnessed a Hybrid Masked Training (HMT) strategy during fine-tuning, augmenting 80% of training samples with random masks to encourage multi-token look-ahead dependencies. The remaining 20% remained unmasked to preserve standard autoregressive performance, resulting in an empirical speedup consistent with the theoretical acceleration of S ≈ k/2, where k represents the average number of accepted tokens per iteration. Extensive evaluations on OmniDocBench and olmOCR-bench benchmarks demonstrate that Youtu-Parsing achieves state-of-the-art performance, showcasing its significant experimental value and practical utility for large-scale document intelligence applications.
Youtu-Parsing delivers rapid, parallel document content extraction
Scientists have developed Youtu-Parsing, a novel document parsing model achieving high-performance content extraction from complex documents. The architecture utilizes a dynamic-resolution visual encoder, a native ViT, to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for both layout analysis and region-prompted decoding. This decoupled framework introduces a high-parallelism decoding strategy, significantly accelerating processing speeds and improving accuracy. Experiments revealed a 5, 11x speedup over traditional autoregressive decoding, achieved through a technique called Token Parallelism, which concurrently generates up to 64 candidate tokens per inference step.
A verification mechanism then validates these candidates, ensuring mathematical equivalence to standard decoding methods, a crucial achievement for maintaining data integrity. Furthermore, the Query Parallelism strategy enables simultaneous content prediction for up to five bounding boxes, delivering an additional 2x acceleration without compromising output quality. These parallel processing capabilities demonstrate a substantial enhancement in overall throughput compared to conventional approaches. Youtu-Parsing supports a comprehensive array of document elements, including text, hierarchical structures, enabling the interpretation of diverse real-world documents such as academic publications and legal filings.
The team measured strong robustness in handling rare characters, multilingual text, and even handwritten content, expanding the model’s applicability to a wider range of document types. Extensive evaluations on the OmniDocBench and olmOCR-bench benchmarks demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance in both recognition accuracy and structural integrity. The model’s architecture consists of a three-stage cascaded pipeline: shared visual feature extraction using a NaViT encoder, layout analysis with cross-modal fusion, and region-prompted decoding via customized block queries within the LLM module. A pre-trained 0.4B-parameter vision encoder, equipped with a dynamic resolution preprocessor, extracts high-level visual features from document images, while a two-layer MLP projector aligns visual representations with the LLM’s input space. This innovative design mitigates error propagation and hallucinations common in other multimodal models, ensuring high structural and semantic fidelity, a critical advancement for reliable document intelligence applications.
Youtu-Parsing delivers speed and accuracy gains
Scientists have developed Youtu-Parsing, a new 2.5-billion-parameter vision-language model for advanced document parsing. The architecture utilises a dynamic-resolution visual encoder and a prompt-guided language model to efficiently extract content and analyse document layouts. A key innovation is a high-parallelism decoding strategy, incorporating both token and query parallelism, which significantly accelerates processing speed. This research demonstrates a substantial 10, 20x throughput enhancement while maintaining accuracy equivalent to traditional methods, effectively addressing the trade-off between speed and fidelity.
Extensive evaluations on benchmarks like OmniDocBench and olmOCR-bench confirm Youtu-Parsing achieves state-of-the-art performance, surpassing existing multimodal models and specialised parsing systems. The model also exhibits robustness in handling diverse document elements, including text, formulas, tables, and seals, as well as challenging conditions like handwritten text and rare characters. The authors acknowledge that while Youtu-Parsing offers a compelling balance between performance and throughput, further research could explore scaling the model to even larger datasets and more complex document structures. Future work might also investigate adapting the parallel decoding strategy to other vision-language tasks beyond document parsing. These findings establish Youtu-Parsing as a robust and scalable foundation for large-scale information extraction and knowledge management applications.
👉 More information
🗞 Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding
🧠 ArXiv: https://arxiv.org/abs/2601.20430
