Researchers are tackling the challenge of accurately pinpointing specific text segments , known as span labeling , when utilising powerful Large Language Models (LLMs) for tasks like named entity recognition and error detection. Danil Semin, Ondřej Dušek, and Zdeněk Kasner, all from the Institute of Formal and Applied Linguistics at Charles University, investigate the inconsistent results arising from current ‘ad-hoc’ prompting strategies used with these generative models, which unlike older systems, don’t inherently highlight input text sections. Their work categorises existing approaches and introduces LogitMatch, a novel decoding method designed to ensure LLM outputs precisely align with the intended spans of text , a significant step towards more reliable and consistent performance across diverse analytical applications. This research demonstrates LogitMatch’s ability to outperform existing methods by resolving span matching issues and establishing a robust baseline for future development.
LogitMatch constrains LLM span labeling accuracy
Unlike encoder-based models which explicitly reference input parts, generative LLMs lack this inherent capability, leading to inconsistent results from various ad-hoc prompting strategies. The study reveals that existing span labeling approaches often struggle with issues like incorrect span copies, multiple span identifications, or inaccurate indices, stemming from the LLM’s inability to directly ground its outputs in the input text. To overcome these challenges, the team developed LogitMatch, a method that modifies the raw model logits during decoding to ensure only valid spans from the input text are generated. This innovative approach is applicable to any locally-deployed LLM without requiring costly finetuning or architectural modifications, making it a versatile solution for a range of text analysis applications.
The findings indicate that while tagging strategies provide a robust baseline, LogitMatch offers a highly competitive alternative, successfully addressing the shortcomings of standard matching strategies. The team formally defined the span labeling problem as identifying contiguous text sequences and assigning them to predefined categories through textual output generation. This research establishes a clear framework for robustly applying LLMs to span labeling tasks, offering practical solutions for extracting structured information from text. This breakthrough unveils a significant advancement in the field of natural language processing, enabling more accurate and reliable text analysis with LLMs.
The work opens new avenues for applications in areas such as information extraction, error detection, and low-resource language processing, where finetuned encoder models may not be readily available. The. Experiments revealed that existing ad-hoc prompting strategies for span labeling often yield inconsistent results, prompting the team to categorise these into three families: tagging, indexing, and matching. Researchers meticulously evaluated these strategies across four diverse tasks, finding that tagging provides a robust baseline for performance. However, the breakthrough delivers significant improvements with LogitMatch, which eliminates issues inherent in matching-based methods.
The team measured performance gains by comparing LogitMatch against competitive matching strategies, demonstrating its superior ability to identify and categorise text spans correctly. Tests prove that LogitMatch modifies raw model logits, allowing only the generation of valid spans from the input text. This approach is applicable to any locally-deployed LLM without requiring costly finetuning or architectural modifications. Specifically, the study defined the span labeling problem as identifying contiguous text sequences and assigning them to predefined categories, represented as sets of spans S = {(si, ei, ci)}m i=1.
Measurements confirm that each span is characterised by a start index (si), an end index (ei), and a category (ci), with the indices satisfying 1 ≤si ≤ei ≤n. Results demonstrate that while LLMs can achieve comparable performance to finetuned encoder models on span labeling tasks without task-specific training, a consistent approach has been lacking. The work categorises existing strategies, revealing a wide range of ad-hoc methods with unsatisfactory performance. This research opens avenues for applying LLMs to low-resource tasks, such as evaluating factual accuracy and identifying rhetorical structures.
Span labelling strategies and LogitMatch decoding offer improved
The research demonstrates that while tagging with XML-like tags provides consistently robust results, matching strategies can be more token-efficient and often achieve competitive performance. To address limitations within matching methods, researchers introduced LogitMatch, a novel constrained decoding technique designed to enforce alignment between output and valid input spans. The study acknowledges limitations, including a focus on parsing output text and a current inability to disambiguate multiple spans perfectly, though combining LogitMatch with indexed input offers partial mitigation. Future research could explore combining the strengths of each method and investigating the influence of varied prompting examples and instructions on efficiency. The findings suggest that Qwen3-8B is competitive with the larger Llama-3.3-70B, though factors like quantization and model release dates may influence these comparisons.
👉 More information
🗞 Strategies for Span Labeling with Large Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.16946
