AI Models Now Assessed on Ability to Solve Problems, Not Just Recall Facts

Researchers are increasingly focused on evaluating the true intelligence of generative artificial intelligence models, moving beyond simple recall of learned information. Ruichuan An, Sihan Yang from Peking University, and Ziyu Guo from CUHK, along with colleagues, address the critical gap in assessing fluid intelligence, the ability to reason, adapt, and induce patterns in novel situations. They introduce GENIUS (Generative Fluid Intelligence Evaluation Suite), a benchmark formalised as a synthesis of visual preference inference, abstract metaphor visualisation, and counter-intuitive physics simulation, to rigorously test this capability. This work, a collaboration between Peking University, CUHK, and PolyU, including contributions from Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, and Wentao Zhang, reveals significant performance deficits in current models, pinpointing limitations in contextual understanding rather than generative capacity, and proposes a novel attention intervention strategy. Ultimately, GENIUS establishes a robust standard for evaluating fluid intelligence, pushing the field towards more dynamic and general-purpose reasoning in generative models.

Scientists have developed GENIUS, a new evaluation suite comprising 86 samples designed to rigorously assess “Implicit Pattern Generation” in artificial intelligence systems. This work addresses a critical limitation in current AI benchmarks, which largely focus on recalling pre-existing knowledge rather than the ability to adapt and reason in novel situations. “generative fluid intelligence”, the capacity to discern patterns, adhere to constraints, and respond dynamically to unforeseen circumstances. The suite’s design incorporates three core primitives, inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge, to provide a comprehensive evaluation of this crucial skill. A systematic evaluation utilising GENIUS revealed significant performance deficits across twelve representative models, indicating a substantial gap between current AI capabilities and true general intelligence. Diagnostic analysis pinpointed the root cause of these failures not as a lack of generative potential, but as limited comprehension of contextual relationships. A meticulously curated suite of 510 expert-designed samples underpins this work, forming Implicit Pattern Induction, Ad-hoc Constraint Execution, and Contextual Knowledge Adaptation, each designed to isolate specific reasoning capabilities. Implicit Pattern Induction tasks, for example, require models to infer unstated visual preferences from a series of images and apply these preferences during image generation. Ad-hoc Constraint Execution challenges models with abstract, dynamically defined constraints, demanding logical reasoning within a visual or symbolic framework. To ensure task complexity, each instance within GENIUS incorporates multi-modal interleaved context, meaning the task cannot be solved by removing any single modality from the input. This design choice forces models to integrate information across different data types, mirroring the complexities of real-world reasoning. Contextual Knowledge Adaptation assesses a model’s ability to adjust its behaviour based on contextual cues, even when these cues contradict established common sense or pre-trained priors. The selection of 86 samples for Implicit Pattern Induction, 213 for Ad-hoc Constraint Execution, and 211 for Contextual Knowledge Adaptation was not arbitrary. Each sample was carefully designed to present a unique challenge, demanding a synthesis of inductive inference, abstract reasoning, and adaptive inhibition. This systematic approach to benchmark construction allows for a granular analysis of model strengths and weaknesses, pinpointing specific areas where improvements are needed to achieve true generative fluid intelligence. Initial evaluations utilising GENIUS reveal significant performance deficits across 12 representative models, highlighting a crucial gap in their ability to dynamically reason and adapt. Diagnostic analysis demonstrates that these deficits originate from limited context comprehension, not from a lack of inherent generative capacity. Models struggle to effectively interpret and apply information presented within the immediate context of each sample, indicating a weakness in fluid intelligence rather than a deficiency in their ability to generate images generally. Furthermore, the research introduces a training-free attention intervention strategy aimed at mitigating these identified deficits. While not the primary focus of the work, this intervention suggests a potential avenue for improving context comprehension in multimodal models. The 86 samples within GENIUS represent a carefully constructed testbed for probing the boundaries of AI’s ability to reason and adapt in real-time, without relying on pre-existing knowledge. This rigorous standard is intended to guide future research beyond simple knowledge utilisation towards more dynamic and general-purpose reasoning capabilities in artificial intelligence. For years, progress in AI has been largely defined by scaling up datasets and model sizes, effectively creating sophisticated pattern-matching machines. However, true intelligence demands more than just recognising familiar shapes; it requires the capacity to generate novel solutions, to understand underlying principles, and to apply them flexibly. This is particularly crucial as we move towards deploying AI in unpredictable, real-world scenarios where rote learning will inevitably fail. The significance of this work lies in its diagnostic approach, isolating the specific cognitive deficits in current models, a lack of contextual understanding, rather than a lack of generative power, and providing a clear roadmap for future development. Ultimately, this isn’t just about achieving higher scores on a benchmark; it’s about building AI systems that can truly think, not just mimic thought. The next step will be to see whether these insights translate into more robust and adaptable AI across a wider range of applications, from robotics and autonomous systems to creative problem-solving and scientific discovery.

👉 More information
🗞 GENIUS: Generative Fluid Intelligence Evaluation Suite
🧠 ArXiv: https://arxiv.org/abs/2602.11144

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Repulsive Interactions Between Electrons Enable Superconductivity in Two-Dimensional Systems

Repulsive Interactions Between Electrons Enable Superconductivity in Two-Dimensional Systems

February 13, 2026
Atomic Interactions Boost Signal Strength for Future Quantum Technologies

Atomic Interactions Boost Signal Strength for Future Quantum Technologies

February 13, 2026
Researchers Characterise Defects in Flexible Electronics, Improving Converter Reliability by 92 Percent

Researchers Characterise Defects in Flexible Electronics, Improving Converter Reliability by 92 Percent

February 13, 2026