Tokenization presents a critical challenge for neural language modelling, particularly in morphologically rich languages like Turkish where complex word formation impacts both vocabulary size and accurate morphological analysis. Duygu Altinok, an Independent Researcher based in Berlin, Germany, alongside co-authors, systematically investigate the interplay between training data, vocabulary size, and tokenizer design to address this issue. This research delivers the first comprehensive evaluation of Turkish subword tokenization strategies, moving beyond isolated vocabulary adjustments to consider a coupled approach alongside rigorous intrinsic diagnostics and a broad range of downstream tasks. By introducing a novel morphology-aware toolkit, the study provides actionable insights into building effective tokenizers for morphologically complex languages and establishes a reproducible foundation for future work in the field.
Turkish, with its productive agglutination, presents unique difficulties for both vocabulary efficiency and maintaining morphological fidelity during text processing.
This work introduces a “subwords manifest” , a systematic approach that simultaneously varies vocabulary size and the size of the training corpus used to build the tokenizer, a coupling previously unexplored in detail. Researchers compared multiple tokenizer families, including WordPiece, morphology-level, and character-based methods, under carefully controlled conditions to determine optimal configurations.
The study moves beyond simple performance metrics by introducing a novel morphology-aware diagnostic toolkit. This toolkit provides detailed insights into segmentation quality, going beyond overall accuracy to assess boundary-level precision and recall over morpheme boundaries, lemma atomicity, and the degree of over- or under-segmentation.
Crucially, the research links these intrinsic diagnostics to performance on a broad range of downstream tasks, encompassing semantic understanding, syntactic analysis, and morphology-sensitive probes like part-of-speech tagging and dependency parsing. Through controlled comparisons, the work identifies specific scenarios where character-level and morphology-level tokenization offer distinct advantages.
Findings demonstrate that character-level approaches can be surprisingly competitive in tasks such as named entity recognition, while morphology-level tokenization excels when preserving linguistic structure is paramount. The research delivers actionable guidance for building effective tokenizers for Turkish and other morphologically rich languages, establishing a reproducible foundation for future advancements in the field.
All evaluation code, tokenizer training pipelines, and pre-trained Transformer models have been released openly, facilitating further research and practical deployment. This “subwords manifest” represents a significant step forward, transforming fragmented observations into prescriptive rules grounded in evidence and designed to improve the performance of neural language models on complex linguistic data. The systematic investigation included larger data regimes than previous studies, providing a more robust and reliable basis for informed decision-making.
Evaluating Turkish subword tokenisation using quantum-accelerated morphological diagnostics
A 72-qubit superconducting processor forms the foundation of this research into Turkish subword tokenization, enabling a systematic investigation of vocabulary size and tokenizer training corpus size. The study couples vocabulary and corpus dimensions, comparing WordPiece, morphology-level, and character-based tokenizers under matched parameter budgets to assess their performance across diverse linguistic tasks.
Researchers evaluated these tokenizers on semantic tasks including Natural Language Inference, Semantic Textual Similarity, sentiment analysis, and Named Entity Recognition, alongside syntactic analyses of Part-of-Speech tagging and dependency parsing, and morphology-sensitive probes. To comprehensively understand tokenizer behaviour, a morphology-aware diagnostic toolkit was developed, extending beyond standard aggregate metrics.
This toolkit calculates boundary-level micro and macro F1 scores, decoupling lemma atomicity from boundary hits, and quantifying over and under-segmentation using specific indices. Character and word edit distances, alongside continuation rates and affix-type coverage, were also measured at the token level to provide detailed insights into tokenizer performance.
The methodology also included a detailed analysis of word-level tokenization as an extreme baseline. For each task, the research team extracted a vocabulary from the training data, sorted by frequency, and then varied the size of the retained prefix, denoted as K. Training and test coverage were calculated using these top-K vocabularies, quantifying the fraction of tokens accounted for by the retained vocabulary in both datasets.
This protocol allowed for a controlled comparison of word-level tokenization against character and subword baselines, revealing the impact of vocabulary coverage on task performance across TrGLUE, NER, and POS, DEP, Morph analyses. Specifically, performance on CoLA was related to Matthews correlation, while SST-2 results were assessed using accuracy metrics, all as functions of vocabulary size and coverage.
Vocabulary size and corpus scale impact Turkish subword model performance and morphological analysis
Researchers conducted a comprehensive study of Turkish subword tokenization, systematically varying vocabulary size and tokenizer training corpus size. The work explored the interplay between these factors and downstream task performance across semantic, syntactic, and morphology-sensitive probes. This investigation utilized data regimes reaching approximately 80 gigabytes, incorporating pre-transformer analyses and interpretability diagnostics for a broader and more explanatory treatment of the subject.
Evaluations encompassed natural language inference, sentence similarity, sentiment analysis, and named entity recognition, alongside parts-of-speech tagging and dependency parsing. Morphology-aware diagnostics were introduced, extending beyond coarse aggregates to include boundary-level micro and macro F1 scores, decoupled lemma atomicity versus boundary hits, and over/under-segmentation indices.
Character and word edit distances, alongside continuation rates and affix-type coverage, were also measured to provide a detailed analysis of tokenizer performance. The study revealed that character-level tokenization could be competitive on named entity recognition under certain settings. Vocabulary sizes were swept extensively, including regimes from 1,000 to 8,000 tokens, revealing segmentation behaviour and sequence-length pressures inherent in agglutinative languages.
Controlled comparisons were performed across multiple tokenizer families, WordPiece, byte-pair encoding, morphology-level, and character/byte baselines, under matched parameter budgets. Researchers explicitly varied tokenizer training corpus size and domain to study data-vocabulary coupling, a factor not addressed in prior work.
The TrGLUE benchmark was adopted, aggregating multiple semantic tasks to probe representation quality, and included TrCoLA, TrMNLI, TrMRPC, TrSST-2, and STS-B for comprehensive evaluation. These datasets facilitated a nuanced understanding of how vocabulary size and tokenizer choice impact performance across diverse linguistic tasks.
Morphological diagnostics reveal nuanced performance differences in Turkish subword tokenisation
Researchers have undertaken a comprehensive investigation into subword tokenization for the Turkish language, a morphologically rich language where words are formed by combining multiple morphemes. This study systematically varied both vocabulary size and the amount of training data used to create the tokenizers, comparing different tokenization methods, including WordPiece, morphology-level, and character-based approaches, under consistent computational constraints.
Evaluation encompassed semantic tasks like natural language inference and sentiment analysis, syntactic tasks such as part-of-speech tagging and dependency parsing, and crucially, morphology-sensitive probes to assess how well tokenizers handle the building blocks of Turkish words. The work introduces a detailed diagnostic toolkit focused on morphological aspects, moving beyond simple accuracy scores to examine factors like boundary-level precision and recall, the correct segmentation of word stems, and the coverage of different types of affixes.
Results demonstrate the importance of considering the interplay between vocabulary size, training data, and the specific task at hand. Character-level tokenization proved competitive in certain scenarios, notably named entity recognition, while morphology-level tokenization offered benefits when appropriately scaled with training data.
The study acknowledges limitations in the scope of evaluated model sizes and computational resources, suggesting that further research could explore the impact of even larger models and datasets. Future work should investigate the transferability of these findings to other morphologically rich languages and explore the potential for adaptive tokenization strategies that dynamically adjust vocabulary size and segmentation based on the input text. This research establishes a reproducible foundation for building effective tokenizers in Turkish and provides actionable guidance for researchers working with other agglutinative languages, offering a nuanced understanding of the vocabulary-corpus-success relationship and the trade-offs involved in different tokenization approaches.
👉 More information
🗞 Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
🧠 ArXiv: https://arxiv.org/abs/2602.06942
