Document parsing, the process of extracting meaningful information from complex scientific and technical documents, presents a significant challenge for automated systems, but researchers are now addressing this with a new approach. Xi Fang, Haoyi Tao, and Shuwen Yang, alongside colleagues, have developed Uni-Parser, a highly efficient engine designed to accurately process scientific literature and patents. Unlike traditional methods that rely on sequential processing, Uni-Parser uses a flexible, multi-expert architecture that preserves relationships between text, equations, figures, and other elements within a document. This innovative system achieves a remarkable processing rate, handling up to twenty PDF pages per second, and unlocks new possibilities for large-scale data extraction and curation, ultimately accelerating advancements in fields like artificial intelligence and scientific discovery.
Multimodal Science Analysis with Large Language Models
This technical report introduces Uni-Smart, a unified platform developed by DP Technology for comprehensive analysis of scientific documents and data. Designed as a universal research assistant, Uni-Smart integrates multimodal analysis with large language models (LLMs) to support a wide range of scientific workflows.
The system processes diverse data modalities, including text, tables, chemical structure images, graphs, and micrographs. Uni-Smart incorporates advanced LLMs such as GPT-5, Gemini 2.5 Pro, and Qwen2.5-VL to enable deep understanding, reasoning, and cross-modal inference. Its modular architecture includes MolParser for molecular structure recognition from images, table extraction tools, micrograph segmentation and analysis, patent analysis for infringement assessment, structure–activity relationship (SAR) extraction, scientific literature comprehension, poster automation, and end-to-end orchestration of data and paper workflows.
Built on the Dataflow framework, Uni-Smart emphasizes data-centric AI and scalable workflow management. The system is trained and validated on large, high-quality datasets, including ChemPile, UniEM-3M, and FinePDFs, ensuring robust performance across scientific domains. By unifying multimodal analysis, LLM reasoning, and automated workflows within a single platform, Uni-Smart represents a significant advance toward accelerating scientific discovery and improving the efficiency of complex research processes.
Uni-Parser, a Fast Modular Document Engine
Scientists developed Uni-Parser, a novel document parsing engine specifically designed for the demanding requirements of scientific literature and patents, achieving both high throughput and robust accuracy. Unlike traditional pipeline-based systems, this work employs a modular, loosely coupled multi-expert architecture, enabling preservation of fine-grained alignments between diverse document elements including text, equations, tables, figures, and chemical structures. This innovative design also facilitates easy extension to accommodate emerging data modalities, ensuring future adaptability. The system achieves a processing rate of up to 20 PDF pages per second when deployed on a cluster of eight NVIDIA RTX 4090D GPUs, demonstrating a significant advancement in parsing speed and cost-efficiency for large-scale document collections.
The research team engineered a distributed microservice design with dynamic GPU load balancing to maximize parsing throughput and enable real-time processing of billions of document pages. Furthermore, scientists developed a suite of domain-specialized, lightweight expert models, each optimized for accurate parsing of a specific modality. To address the challenges posed by complex layouts common in scientific publications, the study pioneered a new layout analysis and reading order algorithm tailored to handle dense, irregular, and domain-specific page structures. The team validated the system’s performance on extensive datasets of scientific literature and patents, demonstrating state-of-the-art accuracy across all supported modalities. By transforming unstructured PDFs into clean, machine-actionable representations, this work establishes a scalable and extensible foundation for structured document understanding and unlocks new possibilities for knowledge extraction and AI4Science applications.
Uni-Parser Achieves Rapid Scientific Document Processing
Uni-Parser represents a significant advance in document parsing technology, specifically designed for the complexities of scientific literature and patents. The system achieves high throughput and accuracy through a modular architecture that preserves relationships between different document elements, including chemical structures, while remaining adaptable to new data types. This approach enables efficient processing at scale, reaching a rate of 20 PDF pages per second using readily available hardware, and facilitates a wide range of downstream applications including data extraction, corpus creation for AI models, and scientific database construction. The system incorporates a novel group-based layout analysis, representing document pages as hierarchical organizations of semantic elements, and successfully processes ten distinct content types including text, tables, chemical structures, and charts in parallel.
Experiments demonstrate the effectiveness of Uni-Parser’s layout detection model, trained on a large-scale in-house dataset of 500,000 pages, with 220,000 pages carefully annotated by humans. The dataset encompasses a diverse corpus of scientific journals, patents, preprints, and books, spanning 85 languages, and the team found that training with this high-quality real data outperformed approaches utilizing synthetic datasets. The system’s ability to accurately identify and group semantic elements, such as pairing figures with captions and tables with titles, is critical to its performance. Furthermore, Uni-Parser’s modular architecture allows for flexible output formatting, including plain text, Markdown, and HTML, and supports semantic chunking to improve coherence for downstream tasks like retrieval-augmented generation. The research team successfully implemented a two-layer hierarchical structure for layout analysis, enabling the aggregation of related elements into coherent groups while maintaining the ability to recover nested semantic content through post-processing.
Scientific Document Parsing at Scale Achieved
The researchers acknowledge limitations in current layout detection models, particularly when applied to document types beyond scientific papers and patents, and ongoing challenges in accurately recognizing complex chemical structures and parsing charts within scientific literature. Future work focuses on enhancing individual components of the system, improving generalization across diverse layouts, and developing more robust benchmarks for evaluating parsing performance. The team intends to release a fully open-source toolkit, Uni-Parser-Tools, to provide wider access to the system and facilitate its application in various scientific domains. These ongoing efforts aim to address current limitations and further improve the capabilities of document parsing for scientific discovery.
👉 More information
🗞 Uni-Parser Technical Report
🧠 ArXiv: https://arxiv.org/abs/2512.15098
