Despite being the sixth most spoken language globally, Bengali handwritten text remains a significant challenge for automated recognition systems, hampered by the script’s complexity and limited training data. Md. Mahmudul Hasan, Ahmed Nesar Tahsin Choudhury, and Md. Mahmudul Hasan, all from the University of Dhaka, alongside Md. Mosaddek Khan, present a new system, GraDeT-HTR, which tackles this problem with a resource-efficient approach. The team developed a system based on a decoder-only Transformer architecture, enhanced by a novel grapheme-based tokenizer that recognises the fundamental building blocks of the Bengali script. This innovation significantly improves recognition accuracy compared to standard methods, and the system achieves state-of-the-art performance on multiple benchmark datasets after pre-training on synthetic data and fine-tuning with real handwritten samples.

Synthetic Data and Two-Stage Bengali OCR

Scientists have created a new system for converting handwritten Bengali documents into text, addressing a significant challenge in optical character recognition. The team employed a two-stage approach, first generating large amounts of synthetic training data to supplement real-world handwritten samples, and then pretraining and fine-tuning a model to recognize the characters. Realistic distortions, such as wavy lines, blur, and fragments, were incorporated into the synthetic data to improve its effectiveness. The system also utilizes a two-stage pretraining process, initially focusing on line-level images and then on word-level images, allowing it to learn features at different levels of detail. The system’s performance is evaluated using metrics that measure character and word-level errors, and further assessed using large language models to evaluate transcription quality. A user-friendly web interface allows users to easily upload documents and extract the text.

Grapheme Tokenizer Boosts Bengali Script Recognition

Researchers have developed GraDeT-HTR, a new Bengali handwritten text recognition system designed to overcome the challenges posed by the script’s complexity and limited available data. Addressing a gap in optical character recognition technology, the system utilizes a decoder-only Transformer architecture, enhanced with a grapheme-based tokenizer, to achieve improved accuracy in recognizing handwritten Bengali text. This innovative tokenizer is designed to handle the approximately 13,000 graphemes present in the Bengali script, allowing for more accurate character representation. The system operates as an end-to-end pipeline, integrating both text detection and recognition for full-page images, beginning with a module that segments images into individual words. Measurements confirm the effectiveness of this approach, as the system achieves state-of-the-art performance on multiple benchmark datasets.

Bengali Handwriting Recognition with Grapheme-Based Transformers

This research presents GraDeT-HTR, a resource-efficient system for recognizing handwritten Bengali text, a challenging task due to the script’s complexity and limited available data. The team developed a decoder-only Transformer architecture, enhanced with a grapheme-based tokenizer specifically designed for the nuances of Bengali script, significantly improving recognition accuracy compared to existing methods. The system operates at the word level and delivers recognized text in an editable format, allowing for user corrections, and supports multi-page document review with export options for plain text or Microsoft Word formats. The complete system pipeline has been released publicly under an open-source license. Future work will focus on expanding the training dataset to include more diverse backgrounds and noise, pre-training larger language models for Bengali, and exploring alternative auto-regressive language models, alongside refining text detection to minimize segmentation errors.

👉 More information
🗞 GraDeT-HTR: A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
🧠 ArXiv: https://arxiv.org/abs/2509.18081

Tags:

benchmark datasets Bengali handwriting recognition grapheme-aware decoder grapheme-based tokenizer Handwritten Text Recognition Synthetic Data transformer architecture

Gradet-htr: Resource-Efficient Bengali Handwritten Text Recognition System Achieves Improved Accuracy with Grapheme-Based Tokenizer

Synthetic Data and Two-Stage Bengali OCR

Grapheme Tokenizer Boosts Bengali Script Recognition

Bengali Handwriting Recognition with Grapheme-Based Transformers

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently