Scientists are tackling the longstanding challenge of Optical Character Recognition (OCR) with a novel approach, as demonstrated by Said Taghadouini, Adrien Cavaillès, and Baptiste Aubertin, all from LightOn. Their research introduces LightOnOCR-2-1B, a 1 billion parameter end-to-end multilingual model capable of converting document images directly into ordered text, bypassing traditional and often unreliable OCR pipelines. This represents a significant leap forward, achieving state-of-the-art performance on the OlmOCR-Bench benchmark whilst being nine times smaller and considerably faster than previous leading models. Furthermore, the team extends the model’s capabilities to locate embedded images within documents, enhancing its utility and paving the way for more robust document understanding , and they are generously releasing both the model checkpoints and a new evaluation dataset to the public.
LightOnOCR-2-1B streamlines document image text conversion with impressive
Scientists have unveiled LightOnOCR-2-1B, a groundbreaking 1 billion-parameter end-to-end multilingual vision-language model capable of converting document images, such as PDFs, directly into clean, naturally ordered text without relying on traditional, often brittle, OCR pipelines. This innovative approach bypasses the need for multi-stage processing, streamlining document understanding and offering significant advantages in adaptability and efficiency. The team achieved state-of-the-art results on the OlmOCR-Bench benchmark while simultaneously being nine times smaller and substantially faster than previously leading models. Trained on a large-scale, high-quality distillation mix, LightOnOCR-2 demonstrates strong performance across a diverse range of document types, including scans, French documents, and complex scientific PDFs.
Researchers meticulously curated this training data, increasing its size by 2.5x and enhancing coverage of critical document categories to improve model robustness and accuracy. Furthermore, the study introduces a novel approach to image localization within documents, extending the model’s output to predict normalized bounding boxes for embedded images. This capability was achieved through a resume strategy during pretraining and refined using Reinforcement Learning with Verifiable Rewards (RLVR) employing IoU-based rewards, effectively teaching the model to accurately identify and delineate images within the document layout. Experiments show that LightOnOCR-2-1B not only excels in text recognition but also demonstrates a significant advancement in document image localization.
The researchers introduced LightOnOCR-bbox-bench, a new benchmark specifically designed to evaluate image localization performance in document understanding tasks, providing a rigorous testing ground for this new capability. To further enhance the model’s performance, the team implemented lightweight weight-space techniques, including checkpoint averaging and task-arithmetic merging, allowing for the combination of complementary strengths and a controlled trade-off between OCR quality and bounding box accuracy. This work establishes a new standard in document understanding, offering a compact, efficient, and highly accurate solution for converting document images into usable text and identifying embedded images. The release of model checkpoints under the Apache 2.0 license, alongside the publicly available dataset and LightOnOCR-bbox-bench evaluation, will undoubtedly accelerate further research and development in this field. The potential applications are vast, ranging from automated document processing and archival to improved accessibility for visually impaired individuals and enhanced data extraction from scientific literature.
LightOnOCR-2-1B training and coordinate-based localisation offer promising results
Scientists developed LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision-language model designed to convert document images into clean, naturally ordered text, bypassing traditional, brittle OCR pipelines. The research team trained this model on a large-scale, high-quality distillation mix, prioritizing scans, French documents, and scientific PDFs, achieving state-of-the-art results on the OlmOCR-Bench while remaining 9times smaller and significantly faster than previous leading systems. Furthermore, the study extended the output format to predict normalized bounding boxes for embedded images within documents, introducing localization capabilities during pretraining. Researchers pioneered a resume strategy to incorporate coordinate supervision during pretraining, enabling the model to learn image locations, and subsequently refined this localization using Reinforcement Learning with Verifiable Rewards (RLVR) employing IoU-based rewards.
The team engineered a novel data curation pipeline utilizing nvpdftex to obtain pixel-aligned supervision from TEX sources, strengthening scientific OCR and generating an automatic subset for their new localization benchmark, LightOnOCR-bbox-bench. Experiments employed a higher resolution training regime, reaching a maximum longest edge of 1540px, alongside data augmentation techniques and the inclusion of empty pages to mitigate looping behaviours and enhance full-page fidelity. The study details the architecture of LightOnOCR, a compact model comprising a vision encoder, a multimodal projector, and a language model decoder. Scientists initialized the vision encoder from pretrained Mistral-Small-3.1 weights, allowing it to handle variable image sizes and preserve crucial spatial structure for documents with diverse layouts.
A two-layer MLP with GELU activation served as the multimodal projector, mapping vision features into the language model’s embedding space after applying spatial merging with a factor of 2, reducing visual tokens by 4x while maintaining granularity. To further enhance performance, the team leveraged lightweight weight-space techniques, specifically checkpoint averaging and task-arithmetic merging, to combine complementary gains and precisely control the trade-off between OCR quality and bounding box accuracy. This approach enables the model to handle complex layouts, including tables, forms, receipts, and scientific notation, without the fragility associated with multi-stage pipelines, a significant advancement in document understanding technology. The released checkpoints are available under the Apache 2.0 license, alongside the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.
LightOnOCR-2-1B excels on complex document recognition
Scientists have developed LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision-language model capable of converting document images, such as PDFs, directly into clean, naturally ordered text. This innovative system bypasses traditional, brittle OCR pipelines, offering a streamlined approach to document understanding. Experiments demonstrate that LightOnOCR-2 achieves state-of-the-art results on the OlmOCR-Bench benchmark, while simultaneously being 9times smaller and substantially faster than previously leading models. The team trained the model on a large-scale, high-quality distillation mix, specifically designed with strong coverage of scanned documents, French language texts, and complex scientific PDFs.
Measurements confirm a 2.5x increase in the size of the training mixture compared to its predecessor, LightOnOCR-1B, enhancing its ability to handle diverse document types. During pretraining, images were processed at a higher resolution, with a maximum longest edge of 1540 pixels, and data augmentation techniques were applied to improve robustness and fidelity. Researchers also incorporated empty pages into the training data to mitigate looping behaviours and ensure full-page accuracy. Furthermore, the study extends the model’s capabilities to predict normalized bounding boxes for embedded images within documents.
Localization was initially introduced during pretraining using a resume strategy, then refined with Reinforcement Learning with Verifiable Rewards (RLVR) employing IoU-based rewards, a technique that yielded significant improvements in bounding box accuracy. Tests prove that this approach effectively addresses persistent failure modes, including repetition loops and errors in mathematical rendering and formatting. To further enhance performance, scientists leveraged checkpoint averaging and task-arithmetic merging, lightweight weight-space techniques that combine complementary gains and allow for controlled trade-offs between OCR quality and bounding box precision. The research culminates in the release of model checkpoints under the Apache 2.0 license, alongside the dataset and a new benchmark, LightOnOCR-bbox-bench, for evaluating image localization in documents, all publicly available for further research and development.
LightOnOCR-2-1B delivers fast, accurate document AI
Scientists have developed LightOnOCR-2-1B, a 1 billion parameter end-to-end multilingual vision model capable of converting document images, such as PDFs, into accurately ordered text, bypassing traditional, often unreliable, OCR pipelines. This new model achieves state-of-the-art results on the OlmOCR-Bench benchmark while being significantly smaller and faster than previous leading systems. Researchers further enhanced the model’s capabilities by extending its output to predict normalized bounding boxes for images embedded within documents, a feature achieved through a novel pretraining strategy and refined using reinforcement learning with IoU-based rewards. Lightweight techniques like checkpoint averaging and task-arithmetic merging were also implemented, offering practical improvements and control over the trade-off between OCR accuracy and bounding box prediction.
The team publicly released the model weights, datasets, and a new benchmark, LightOnOCR-bbox-bench, to facilitate further research in high-fidelity document extraction and localization. The authors acknowledge limitations regarding handwritten text transcription, noting that the model’s training primarily focused on printed sources and struggles with cursive or unconstrained handwriting. They also observed performance variations across languages, with expected lower results for pruned models, a point for consideration in future applications. Future work, as suggested by the researchers, will concentrate on targeted data collection and evaluation to improve performance on handwritten text and expand language coverage, ultimately aiming for more robust and versatile document understanding systems.
👉 More information
🗞 LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
🧠 ArXiv: https://arxiv.org/abs/2601.14251
