Rectifying distorted book images poses a significant problem for document image processing, as the curvature of left and right pages differs due to binding constraints. Shaokai Liu from Hefei University of Technology, alongside Hao Feng, Bozhi Luan, and et al. from the University of Science and Technology of China, address this challenge with BookNet, a novel deep learning framework designed specifically for dual-page book image rectification. BookNet uniquely captures the geometric relationships between adjacent pages using a dual-branch architecture with cross-page attention, modelling the influence of each page on the other. Crucially, the researchers also introduce Book3D, a large synthetic dataset, and Book100, a real-world benchmark, to facilitate further research in this area, demonstrating substantial performance gains over existing methods.
This dataset incorporates realistic 3D deformation patterns derived from academic papers, enabling the training of robust rectification models. The creation of these datasets addresses a critical gap in resources for book-specific rectification research, establishing a standardised benchmark for comparing different models. The framework predicts three complementary flow fields, left flow, right flow, and full flow, capturing both page-specific deformations and their holistic interactions.
This multi-flow approach effectively rectifies books by modelling separate flows for each page and the complete spread, a departure from conventional single-flow methods that struggle with asymmetric distortions. This research establishes a new paradigm for book image rectification, moving beyond single-page techniques to address the unique challenges posed by bound documents. The work opens possibilities for improved digitisation of cultural heritage materials, enhanced knowledge management systems, and more effective multimodal understanding of book content. By providing both a novel framework and dedicated datasets, the study paves the way for further advancements in this crucial area of document image processing.
Dual-page rectification via cross-page attention and datasets improves
This innovative design explicitly models the influence of each page on the other, capturing asymmetric curvature patterns caused by binding constraints. The Book3D dataset generation pipeline employed Blender for 3D book modelling, utilising parameterized deformation controls to simulate realistic geometric distortions. Synthetic book images were rendered from diverse arXiv academic papers, varying illumination conditions and viewing angles to enhance dataset realism. The pipeline generates paired synthetic book images alongside corresponding ground truth arXiv paper images, providing labelled data for supervised learning.
Researchers then constructed Book100, a benchmark comprising 100 real-world book images, to rigorously evaluate the performance of BookNet against state-of-the-art methods. BookNet predicts three complementary flow fields: a left flow for the left page, a right flow for the right page, and a full flow representing the complete book spread. This multi-flow approach overcomes the limitations of single-flow methods, which struggle to capture asymmetric deformations in bound books. The cross-page attention mechanisms within the dual-branch architecture allow information exchange between the two branches, refining the estimated warping flows and improving rectification accuracy. Experiments demonstrate that BookNet outperforms existing methods, achieving superior performance in book image rectification and enabling more effective downstream applications such as cultural heritage digitisation and knowledge management. The team will publicly release both the code and datasets to promote further research in this area.
BookNet rectifies dual pages with warping flows and
Extensive experiments demonstrate BookNet’s superior performance in book image rectification compared to existing state-of-the-art methods. The team measured performance using five key metrics on the Book100 benchmark: Multi-Scale Structural Similarity (MSSIM), Local Distortion (LD), Aligned Distortion (AD), Edit Distance (ED), and Character Error Rate (CER). BookNet achieved an MSSIM score of 0.48, matching the highest value among compared methods, and a best-in-class Aligned Distortion of 0.53. Crucially, BookNet reduced Local Distortion to 12.42, a 16.9% improvement over the second-best method, and lowered Edit Distance to 948.63, a 4.9% reduction.
These quantitative results demonstrate BookNet’s ability to accurately rectify book pages, minimising geometric inaccuracies and preserving content integrity. The network was trained for 65 epochs with a batch size of 4 per GPU on 4 NVIDIA RTX 3090 GPUs, employing the AdamW optimiser with a maximum learning rate of 1 × 10−4 and weight decay of 1 × 10−5. Input images were resized to (288, 288), and HSV colour jittering was applied to enhance robustness to varying illumination conditions. Ablation studies confirmed the importance of each architectural component, revealing that joint supervision of all three warping flows, left page, right page, and full spread, yielded the best results.
This approach achieved a 14.0% reduction in Local Distortion and a 33.3% reduction in Edit Distance compared to page-only supervision. Qualitative comparisons demonstrate BookNet’s ability to maintain geometric consistency across the entire book spread, particularly in the challenging gutter region, while existing methods often produced misalignments or residual curvature. BookNet, comprising 30.1 million parameters, achieves 24.39 FPS on a single NVIDIA RTX 3090 GPU, demonstrating efficient inference speed for practical applications.
BookNet rectifies pages via cross-page attention and layout
This allows the system to simultaneously process both pages of a book spread and model the geometric relationships between them, improving the accuracy of image rectification. Extensive experimentation demonstrated that BookNet outperforms current state-of-the-art methods on book image rectification, achieving superior results across multiple metrics. Furthermore, the rectified images significantly improved the performance of a multimodal model, Qwen2.5-VL-7B, in document and visual question answering tasks, highlighting the practical benefits of accurate rectification for downstream applications. The authors acknowledge that their method currently focuses on dual-page book rectification and does not extend to more complex book structures or severely damaged pages. Future research could explore extending BookNet to handle multi-page volumes and developing techniques to address more significant distortions or missing content. Nevertheless, this work establishes a strong baseline for the field and offers valuable resources, datasets and code, to facilitate further advancements in document image processing and multimodal understanding.
👉 More information
🗞 BookNet: Book Image Rectification via Cross-Page Attention Network
🧠 ArXiv: https://arxiv.org/abs/2601.21938
