Researchers are increasingly focused on how to steer Large Language Models (LLMs) beyond simple text generation, demanding explicit control over nuanced textual qualities like humour and persuasiveness. Arya Labroo from the University of Cambridge, Ivaxi Sheth and Mario Fritz from CISPA Helmholtz Center for Information Security, alongside Vyas Raina from Apta and Amaani Ahmed from Royal Holloway, University of London, present a new evaluation framework to assess this ‘fine-grained’ control , specifically, how well LLMs manage multiple, distinct concepts simultaneously. Their work reveals a surprising and fundamental limitation: while LLMs excel at generating text with single attributes, performance often decreases when asked to combine even intuitively independent concepts, highlighting a struggle with compositional understanding. This research offers crucial evidence of this gap and establishes a principled method for benchmarking future advancements in multi-concept control for LLMs.
LLMs struggle with combined stylistic control and factual
Researchers developed a framework to evaluate both single- and dual-concept control, deliberately pairing concepts like clarity and humor, which should, in principle, be separable. Experiments were conducted using medium-sized instruction-tuned models, ranging from 7 billion to 14 billion parameters, prompted across five discrete levels (0, 4) to represent varying degrees of each concept. Outputs were then rigorously judged by a stronger LLM via pairwise comparisons, providing a robust measure of controllability. This breakthrough establishes that prompting, while effective for individual concept calibration, often fails when tasked with managing multiple stylistic attributes concurrently.
The research team found that performance frequently drops sharply in dual-concept settings, even for intuitively orthogonal pairs, indicating an entanglement of concept dimensions that resists simple composition. The evaluation framework is designed to be model- and method-agnostic, offering a standardized approach for measuring controllability across future techniques and encouraging the development of more robust multi-dimensional stylistic control in language models. By identifying common failure modes, the work opens avenues for creating LLMs capable of generating text with nuanced and interpretable stylistic characteristics. Experiments utilized a judge-based evaluation system, where pairwise comparisons between generated outputs were assessed to determine adherence to the intended stylistic levels. This methodology provides a robust and quantitative measure of fine-grained control, enabling systematic comparison of model performance across different concepts and settings. The findings highlight a critical gap in current LLM capabilities and underscore the need for innovative approaches to achieve truly composable and interpretable stylistic control, a crucial step towards more versatile and user-friendly language generation systems.
LLM Controllability via Pairwise Concept Modulation enables fine-grained
Scientists developed a novel evaluation framework to assess fine-grained controllability of large language models (LLMs) across both single and dual-concept scenarios. The research team focused on six linguistically distinct concepts , humor, persuasiveness, clarity, politeness, assertiveness, and formality , deliberately pairing concepts presumed to be independent, such as clarity versus humor. Experiments employed medium-sized instruction-tuned models, specifically 7B, 14B parameter models, prompted across five discrete levels, ranging from 0 to 4, to modulate the presence of target concepts. The study pioneered a robust measurement of controllability using rank correlations between intended and judged levels, enabling systematic comparison of performance under both single and dual-concept conditions.
Researchers opted to evaluate prompting techniques, recognising their accessibility and demonstrated superior performance compared to many representation engineering methods for single-concept control, as evidenced by prior literature. The experimental setup involved providing a textual context, x, to the LLM alongside a target concept, Ca, and a desired level, l, resulting in an output yl = G(x, Ca, l). For dual-concept control, the model received instructions for two concepts, Ca and Cb, with desired levels la and lb, generating output y = G(x, Ca, la, Cb, lb). To isolate controllability of Ca, Cb was held constant at a fixed level j, and generations were obtained for all levels of la, allowing for evaluation of the model’s ability to modulate Ca while maintaining Cb. This meticulous approach enables precise quantification of compositional limitations in LLMs, revealing instances where performance degrades when controlling multiple concepts simultaneously, even when those concepts are logically independent. The research team designed a novel evaluation framework to systematically assess how well LLMs can manage linguistically distinct concepts, revealing a surprising limitation in compositional control. Experiments revealed that performance frequently declines when controlling for two concepts simultaneously, even when those concepts are logically independent. Results demonstrate a consistent pattern across multiple LLMs , Llama-11B, Gemma-12B, and Qwen-14B , and various generative tasks, indicating a fundamental challenge in achieving nuanced stylistic control.
The team measured Spearman correlations (ρ) between intended concept levels and the empirical ranks of generated responses, providing a principled approach for quantifying multi-concept control. For the humor-persuasiveness pairing, the average Spearman correlation for Llama-11B in structured text generation dropped from 0.76±0.23 (single concept, humor) to 0.51±0.41 (dual-concept, humor and fixed persuasiveness) and 0.54±0.35 (dual-concept, humor and random persuasiveness). Similarly, for clarity-politeness, Llama-11B exhibited near-zero correlation (0.02±0.52) for single-concept clarity in argument generation, but this decreased further when combined with politeness. Data shows that Qwen-14B and Gemma-12B consistently outperformed Llama-11B across all settings, suggesting that model size and instruction tuning play a crucial role in maintaining disentanglement between stylistic dimensions.
Tests prove that introducing a secondary concept consistently leads to a reduction in alignment, as evidenced by decreased Spearman correlations. The breakthrough delivers systematic evidence of a gap in compositional control, even when concepts are intuitively independent, highlighting a critical area for future research in LLM development and multi-concept control. These findings have implications for applications requiring precise stylistic control, such as automated content creation and personalized communication.
Dual-Concept Control Reveals LLM Limitations in complex reasoning
Scientists have developed a new evaluation framework to assess fine-grained controllability in large language models (LLMs), focusing on how well these models can adhere to multiple, distinct textual attributes simultaneously. The research systematically examines both single-concept and dual-concept control, utilising linguistically separate concepts like persuasiveness and humour to gauge performance. Surprisingly, the findings reveal that LLM performance frequently diminishes when tasked with controlling two concepts at once, despite the intuitive independence of those concepts. The evaluation framework establishes clear evidence of this gap, offering a principled method for measuring the ability of future models to manage multi-concept control effectively.
Results, obtained across several LLMs and generation tasks, show that while single-concept control is generally strong, introducing a secondary concept often leads to significant degradation, particularly in structured text generation. The authors acknowledge that their evaluation does not require internal disentanglement of concepts within the models; it only assesses whether models can track user-specified levels for each concept without interference. Future research could explore methods to improve compositional control, potentially through architectural innovations or training techniques designed to enhance the model’s ability to integrate multiple constraints, a direction the authors suggest is crucial for advancing the field of controllable text generation.
👉 More information
🗞 Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs
🧠 ArXiv: https://arxiv.org/abs/2601.18483
