Researchers are tackling a significant challenge in artificial intelligence: the ability of machines to understand artistic representations of language. Shubham Patle, Sara Ghaboura, and Hania Tariq, from Mohamed bin Zayed University of AI and NUCES respectively, alongside Mohammad Usman Khan and Omkar Thawakar et al. from NUST and MBZUAI, have introduced DuwatBench , a new benchmark designed to assess multimodal understanding of Arabic calligraphy. This dataset of 1,272 curated samples, spanning six calligraphic styles and over 1,475 unique words, highlights the limitations of current models when faced with the complexities of Arabic script, such as intricate strokes and stylistic nuances. DuwatBench is therefore crucial for fostering culturally sensitive AI, ensuring the Arabic language and its rich visual heritage are fairly represented and accurately processed by future technologies.
DuwatBench benchmark for Arabic calligraphic script analysis
Scientists have unveiled DuwatBench, a novel benchmark comprising 1,272 meticulously curated samples designed to assess multimodal models’ ability to process Arabic calligraphy. This groundbreaking work addresses a significant gap in current research, where models struggle with the artistic and stylistic nuances of Arabic script, particularly in calligraphic forms. The research team constructed DuwatBench with approximately 1,475 unique words across six classical and modern calligraphic styles, ’ah, and Nasta’liq, each paired with precise sentence-level detection annotations. These annotations facilitate both recognition and localization analysis, moving beyond simple OCR-like tasks and enabling a more comprehensive evaluation of model performance.
The dataset deliberately reflects the real-world complexities of Arabic writing, incorporating challenging features such as intricate stroke patterns, dense ligatures, and stylistic variations that commonly confound standard text recognition systems. Researchers evaluated 13 leading Arabic and multilingual models using DuwatBench, demonstrating that while these models excel with clean text, they consistently falter when confronted with calligraphic variation, artistic distortions, and the precise alignment of visual text. This highlights a critical need for models to move beyond surface-level pattern recognition and achieve genuine understanding of culturally grounded visual-textual data. This innovative benchmark incorporates a diverse range of content, including Quranic verses, devotional phrases, greetings, and poetic expressions, ensuring a broad representation of Arabic calligraphy’s semantic depth.
Beyond stylistic diversity, the taxonomy includes samples categorized by content, Quranic, devotional, non-religious, names of Allah, names of the Prophet, and personal/place names, providing a structured basis for evaluating both visual variation and semantic understanding. The team meticulously annotated each image with bounding boxes for word detection, a feature absent in existing datasets, allowing for detailed analysis of localization accuracy. Experiments show that current state-of-the-art Arabic and multilingual Large Multimodal models (LMMs) struggle with style misinterpretation, sensitivity to diacritics and hamza, curved text alignment, and interference from background clutter. By publicly releasing DuwatBench, alongside its comprehensive annotations and evaluation suite, the researchers aim to foster culturally grounded multimodal research, promote fair inclusion of the Arabic language and visual heritage in AI systems, and accelerate progress in this challenging area. The dataset and code are freely available, paving the way for future innovations in Arabic calligraphy processing and multimodal understanding.
DuwatBench Dataset and Calligraphy Recognition Evaluation
Scientists developed DuwatBench, a new benchmark comprising 1,272 curated samples and approximately 1,475 unique words to rigorously evaluate Arabic calligraphy recognition. The study meticulously paired each sample with sentence-level detection annotations, reflecting the complexities of real-world Arabic writing, including intricate stroke patterns and stylistic variations. Researchers constructed the dataset to challenge standard text recognition systems, specifically addressing issues with dense ligatures and artistic distortions commonly found in calligraphy. To assess model performance, the team evaluated 13 leading Arabic and multilingual models using DuwatBench, employing a controlled experimental setup to compare their capabilities.
The evaluation protocol involved presenting models with both clean text and calligraphic variations, allowing for a direct assessment of their robustness to stylistic changes. Quantitative analysis, summarized in Table 7, revealed strong correlations between character and word errors (ρCER,WER = 0.98), demonstrating that errors at the character level directly impact overall word recognition accuracy. Furthermore, the study established strong inverse correlations between normalized edit distance and semantic overlap (NLD, chrF =, 0.99, NLD, ExactMatch =, 0.95), indicating that lexical deviations closely correspond to losses in both visual and semantic fidelity. Experiments employed bounding box localization to reduce background clutter and improve visual grounding, consistently leading to performance gains across most models.
Open-source models like Gemma-3-27B-IT, Qwen2.5-VL-72B-Instruct, and MBZUAI/AIN demonstrated consistent improvements with localized input, suggesting that focused visual grounding enhances fine-grained character recognition. Closed-source systems, notably Gemini-2.5-Flash, maintained robust accuracy with and without bounding boxes, establishing it as the strongest performer in the evaluation. Researchers also conducted a detailed qualitative error analysis, visualizing representative outputs across six calligraphic styles to highlight correct predictions and identify failure modes. The work pioneered a comprehensive metric suite, validated through quantitative and qualitative analysis, providing a rigorous basis for cross-model performance comparison on DuwatBench. Analysis of the 1,272 samples revealed that average ExactMatch remained below 0.18, underscoring the persistent challenge of accurately transcribing stylized Arabic calligraphy even with advanced multimodal LLMs. This innovative methodology enables researchers to address the limitations of existing systems and foster culturally grounded research in Arabic language processing and visual heritage preservation.
DuwatBench benchmark reveals Arabic script recognition challenges
Scientists have unveiled DuwatBench, a new benchmark comprising 1,272 curated samples containing approximately 1,475 unique words across six classical and modern calligraphic styles, each meticulously paired with sentence-level detection annotations. This dataset addresses a significant gap in current research by focusing on the challenges of processing artistic and stylized Arabic script, a domain largely unexplored by existing models. The team measured performance across these diverse calligraphic styles, revealing the complexities inherent in recognizing intricate stroke patterns, dense ligatures, and stylistic variations common in Arabic writing. Experiments revealed substantial performance differences among the 13 leading Arabic and multilingual models tested, using five complementary metrics: Character Error Rate (CER), Word Error Rate (WER), chrF, ExactMatch, and Normalized Levenshtein Distance (NLD).
Results demonstrate that while models perform adequately on clean text, they struggle significantly with the nuances of calligraphic variation and artistic distortions. Specifically, the best performing open-source model, Gemma-3-27B-IT, achieved a CER of 0.5494 and a WER of 0.6572, alongside a chrF score of 48.9358, an ExactMatch of 0.2646, and an NLD error of 0.4707. These measurements confirm the difficulty of accurately transcribing stylized Arabic calligraphy even with advanced language models. Further analysis, detailed in Table 3, shows Word Error Rates (WER) across individual script styles, highlighting the varying difficulty posed by each.
The lowest WER score achieved across all styles was 0.3527 for the Gemini-2.5-Flash model on the Diwani style, while the highest was 1.0000 for several models on the Kufic style. Data shows that Gemini-2.5-Flash attained the highest chrF score of 71.8174, the best ExactMatch of 0.4167, and the lowest NLD error of 0.3166, demonstrating its superior robustness in handling complex calligraphic text. Tests prove that the use of bounding boxes around characters improves performance, as shown in Table 4, with models achieving higher chrF scores and lower error rates when provided with localized character information. The team recorded that the best performing model with bounding boxes, Gemini-2.5-Flash, achieved a chrF score of 66.5662, an ExactMatch of 0.4488, and an NLD error of 0.3212. This research provides a valuable resource for advancing the field of Arabic calligraphy recognition and paves the way for developing more accurate and robust optical character recognition systems for historical texts and artistic creations.
DuwatBench reveals calligraphy’s challenge for current models—recognizing complex
Scientists have introduced DuwatBench, a new benchmark dataset comprising 1,272 curated samples of Arabic calligraphy spanning six classical and modern styles. This resource includes sentence-level detection annotations and approximately 1,475 unique words, addressing a significant gap in evaluating models’ ability to process artistic Arabic script. The dataset authentically reflects real-world challenges presented by complex stroke patterns, dense ligatures, and stylistic variations commonly found in Arabic calligraphy. Researchers evaluated thirteen leading Arabic and multilingual models using DuwatBench, revealing that while these models perform adequately on clean text, they struggle considerably with the nuances of calligraphic variation, artistic distortions, and accurate visual-text alignment.
The findings demonstrate that visual-textual grounding, rather than simply increasing model size, is crucial for improving calligraphy understanding, and also highlight linguistic and cultural biases present in current multimodal models, for example, a tendency to overpredict the word “Allah” due to strong cultural associations learned during training. The authors acknowledge that DuwatBench, while diverse in style and text categories, is smaller in scale than some general-purpose Arabic corpora. They emphasize the dataset’s focused approach as a strength, enabling meaningful evaluation scenarios for Arabic calligraphy specifically. Future research should concentrate on script-aware modelling and culturally grounded AI, with potential applications in cultural preservation, education, and digital humanities. This work promotes responsible and inclusive research practices, respecting the cultural significance of Arabic script while advancing the development of robust multimodal systems.
👉 More information
🗞 DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
🧠 ArXiv: https://arxiv.org/abs/2601.19898
