The challenge of how well large language models learn across tasks of varying difficulty remains a central question in artificial intelligence, impacting both the creation of effective training data and reliable evaluation benchmarks. Yeganeh Kordi, Nihal V. Nayak, and Max Zuo, alongside colleagues at Brown University, investigate this issue with a comprehensive analysis of generalization across different difficulty levels. The team developed a novel method for ranking task difficulty, utilising the collective outputs of thousands of language models and established Item Response Theory from educational testing, thereby removing subjective human assessment. Their results demonstrate that consistent improvements in performance across the full spectrum of difficulties are often unattainable, whether training focuses on easier or harder examples, highlighting the critical need for diverse difficulty levels in both training and evaluation datasets for large language models.
LLM Difficulty, IRT, and Human Metrics
This research investigates how well different methods align when assessing the difficulty of tasks for large language models (LLMs). The goal was to determine how reliably these measures correlate and whether IRT offers a more objective assessment than simply relying on dataset labels. The study reveals generally weak correlations between these different difficulty measures, suggesting that difficulty assigned by dataset creators doesn’t always reflect how LLMs actually perform or the objective characteristics of the questions.
While IRT aims for objectivity, the research demonstrates it doesn’t perfectly capture difficulty either. Some moderate correlations were observed, with reasoning steps in one dataset and question length in another positively correlating with IRT difficulty, while answer length consistently showed a negative correlation. Researchers leveraged existing evaluation results from thousands of LLMs collected from the Open LLM Leaderboard, avoiding the expense of running new inferences. By treating LLMs as ‘students’ and benchmark problems as ‘questions’, the team jointly optimized for difficulty and ability, providing a more nuanced understanding of performance than metrics focusing solely on problem features or individual model accuracy.
The core of the research involved calculating IRT-based difficulty scores for examples across six datasets, revealing substantial divergence from human-assigned difficulty metrics. This finding underscores the limitations of relying on human judgment when assessing LLM performance and highlights the value of model-centric difficulty estimation. To facilitate systematic analysis, each dataset was divided into ten equal-sized bins, ordered by increasing difficulty, allowing researchers to isolate and study generalization patterns. The team then trained LLMs on data from individual difficulty bins and evaluated their performance across all other bins, meticulously characterizing the extent of cross-difficulty generalization. Experiments demonstrate that while some generalization between adjacent difficulty levels is possible, consistent performance across the full range of difficulties remains elusive.
Model Difficulty Ratings Reveal Limited Generalization
This research delivers a novel understanding of how large language models (LLMs) generalize across tasks of varying difficulty, a crucial factor in both data curation and model evaluation. Scientists systematically evaluated LLMs’ performance on six datasets, moving beyond previous work that relied on human assessments of task difficulty. This approach establishes a difficulty rating determined solely by model abilities, removing subjective human judgment.
Experiments reveal limited cross-difficulty generalization; training LLMs exclusively on easy data does not consistently improve performance on hard tasks, and vice versa. Scientists found that performance plateaus when models are trained on a single difficulty level, highlighting the importance of a balanced training dataset. This work contrasts with prior studies, which often reported either easy-to-hard or hard-to-easy generalization, and demonstrates that these findings are often dependent on the method used to assess task difficulty.
Difficulty Mismatch Limits Language Model Generalization
This research presents a detailed analysis of how well large language models (LLMs) generalize across tasks of varying difficulty. Results demonstrate that generalization is often limited when there is a significant difference in difficulty between the training data and the data used for evaluation, regardless of whether models are trained on easier or harder examples. This challenges the assumption that focusing training on a single difficulty level will reliably improve performance across the board. The findings highlight the importance of incorporating a range of difficulties in both the training and evaluation of LLMs, suggesting that shortcuts regarding difficulty can be detrimental to overall performance.
👉 More information
🗞 Revisiting Generalization Across Difficulty Levels: It’s Not So Easy
🧠 ArXiv: https://arxiv.org/abs/2511.21692
