Lexical normalization, the process of converting non-standard language into a standard form, remains a crucial challenge in Natural Language Processing, particularly when dealing with the informal and diverse nature of social media text. Weerayut Buaphet (School of Information Science and Technology, VISTEC), Thanh-Nhi Nguyen (University of Information Technology, Ho Chi Minh City & Vietnam National University), and Risa Kondo (Ehime University) et al, introduce MultiLexNorm++, a significant extension to existing benchmarks by incorporating five Asian languages , representing diverse families and scripts , previously underrepresented in this field. This research addresses a critical gap, demonstrating the limitations of current state-of-the-art models when applied to these new languages and presenting a novel Large Language Model-based approach that achieves more robust performance. By analysing remaining errors, the team highlights key areas for future development and paves the way for more inclusive and effective NLP systems across a wider range of languages and contexts.

The study introduces a unified benchmark and a novel generative model specifically designed to improve the processing of Asian languages, which have been historically underrepresented in existing NLP datasets. Researchers tackled the challenge of informal and spontaneous language use common in social media, which often degrades the performance of NLP models, by focusing on lexical normalization , the process of transforming text into a standard form. This work extends the existing MultiLexNorm benchmark to include five Asian languages , Indonesian, Japanese, Korean, Thai, and Vietnamese , representing diverse language families and scripts, thereby broadening the scope of evaluation beyond Indo-European languages and the Latin script.

The team achieved this by adapting existing datasets and creating new, manually annotated resources, resulting in a comprehensive benchmark for assessing lexical normalization performance across a wider range of languages. This LLM-based approach combines traditional heuristics with the power of modern language models, yielding more robust and reliable results. The research establishes that the proposed LLM-based model outperforms existing methods on the extended benchmark, showcasing its adaptability and effectiveness in handling the complexities of Asian languages.
Furthermore, the study provides a detailed analysis of remaining errors, revealing key areas for future research and development in lexical normalization. By releasing MultiLexNorm++, the researchers offer the NLP community a valuable resource for evaluating and comparing different models, fostering further innovation in this crucial area. Researchers meticulously curated MultiLexNorm++, a manually annotated extension incorporating five Asian languages spanning diverse language families and four distinct scripts, thereby broadening the scope of evaluation beyond traditionally represented linguistic groups. The study employed convenience sampling to select languages for inclusion, prioritising those with pre-existing datasets or access to native speaker annotators, ensuring practical feasibility and data availability. This innovative approach directly addresses the gap in comprehensive multilingual benchmarks, enabling more robust assessment of normalization models across a wider range of linguistic structures.

To construct MultiLexNorm++, the team adapted existing datasets and supplemented them with new annotations from native speakers, maintaining consistency with the original MultiLexNorm task definition. This involved careful conversion of existing data into a standardised format suitable for the benchmark, followed by rigorous annotation to ensure accuracy and quality. Experiments employed a combination of automated data processing and manual verification to guarantee the reliability of the expanded dataset, resulting in a resource that accurately reflects the complexities of the target languages. This technique harnesses the generative capabilities of LLMs to perform word-by-word transformations, aiming to map informal social media text to its standard form. The approach enables the model to leverage both explicit linguistic rules and implicit knowledge encoded within the LLM’s parameters, achieving a balance between precision and adaptability. This innovative method was designed to be robust across a variety of languages, addressing the limitations of previous models that struggled with diverse scripts and morphologies.
Evaluation involved a comparative analysis of the state-of-the-art UFAL model, a fine-tuned encoder-decoder model, and the newly developed LLM-based approach on the MultiLexNorm++ benchmark. UFAL, previously demonstrating a substantial 14 percentage point performance margin on the original MultiLexNorm, served as a strong baseline for assessing the effectiveness of the LLM-based method. Scientists rigorously measured performance across all languages in the extended benchmark, identifying areas where the LLM-based model outperformed UFAL and highlighting remaining challenges in lexical normalization, particularly for the newly included Asian languages. This detailed analysis provides valuable insights into the strengths and weaknesses of different approaches, guiding future research directions in the field.

MultiLexNorm++ benchmark boosts Asian language normalisation performance significantly

Scientists achieved a significant breakthrough in lexical normalization for Asian languages, extending the MultiLexNorm benchmark with data covering five languages , Japanese, Korean, Thai, Vietnamese, and Indonesian , from diverse language families and scripts. The research team meticulously constructed a new dataset, termed MultiLexNorm++, to address the limitations of existing benchmarks which primarily focused on Indo-European languages. The team measured performance using Error Reduction Rate (ERR), a metric ranging from 0-100 where higher scores indicate better performance, and also reported F1 scores to assess precision and recall.

Results demonstrate that languages utilising the Latin script consistently achieved higher ERR and F1 scores compared to those employing non-Latin scripts. Specifically, Vietnamese achieved an ERR of 75.58 with an F1 score of 77.35, while Indonesian scored 61.17 ERR and 65.46 F1. The LLM-based pipeline delivered promising results without requiring fine-tuning, although open-source models generally underperformed compared to UFAL, a previous benchmark. Notably, GPT-4o matched UFAL’s overall performance and surpassed it on Thai (ERR of 41.87), Vietnamese, and Indonesian, suggesting a strong capability in handling these complex linguistic structures.

UFAL underperformed particularly on Thai and Korean, likely due to inefficiencies in its byte-level representations for these scripts, confirming prior observations of low performance on other NLP tasks. Performance on Japanese and Korean remained low overall, with ERR scores of 5.77 and 6.35 respectively, prompting further investigation into the remaining challenges. Tests prove that the developed pipeline incorporates an encoder-based detection model, trained as a binary sequence labeling task using XLM-R, to identify words needing normalization. This detection step enhances efficiency by focusing LLM processing on relevant words and mitigating overnormalization. The team constructed in-context prompts using detected words and surrounding context, employing 8-shot learning with randomly sampled examples from the training data. Measurements confirm that this approach significantly improves normalization accuracy across the tested languages, paving the way for more robust and adaptable NLP systems.

MultiLexNorm expands to Asian languages successfully, demonstrating broad

Scientists have extended the MultiLexNorm benchmark for lexical normalization to include five Asian languages, addressing a gap in existing resources which largely focus on Indo-European languages. This expansion incorporates languages from diverse families and scripts, providing a more comprehensive evaluation platform for normalization models. The LLM-based model demonstrated more robust performance across the extended benchmark, although it relies on pre-detection and a lookup list derived from training data.

Analysis of remaining errors highlighted ongoing challenges in areas such as normalization detection, spelling errors, and slang, areas requiring further investigation. The study also noted that open-source LLMs did not consistently surpass a fine-tuned ÚFAL baseline, revealing a performance disparity between open and closed-source models in text normalization tasks. These findings underscore the importance of considering linguistic diversity when evaluating and developing text normalization techniques. The research suggests that mapping decomposed character representations to Latin characters can improve performance for languages with non-Latin orthographies.

The authors acknowledge limitations related to the reliance on pre-detection and lookup lists, as well as the persistent challenges posed by complex linguistic phenomena. Future work should focus on addressing these bottlenecks and bridging the performance gap between open and closed-source LLMs to facilitate the creation of more effective and broadly applicable language normalization methods. Lexical normalization is a crucial step in many NLP tasks, and this work represents a significant advancement in the field.

👉 More information
🗞 MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages
🧠 ArXiv: https://arxiv.org/abs/2601.16623

Tags:

Asian languages Indo-European languages ! Large Language Models lexical normalization LLMs MultiLexNorm benchmark natural language processing sociolects

Multilexnorm++ Achieves 5-Language Asian Lexical Normalization Benchmark for Improved NLP

MultiLexNorm++ benchmark boosts Asian language normalisation performance significantly

MultiLexNorm expands to Asian languages successfully, demonstrating broad

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently