Researchers are increasingly focused on evaluating the true intelligence of generative artificial intelligence models, moving beyond simple recall of learned information. Ruichuan An, Sihan Yang from Peking University, and Ziyu Guo from CUHK, along with colleagues, address the critical gap in assessing fluid intelligence, the ability to reason, adapt, and induce patterns in novel situations. They introduce GENIUS (Generative Fluid Intelligence Evaluation Suite), a benchmark formalised as a synthesis of visual preference inference, abstract metaphor visualisation, and counter-intuitive physics simulation, to rigorously test this capability. This work, a collaboration between Peking University, CUHK, and PolyU, including contributions from Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, and Wentao Zhang, reveals significant performance deficits in current models, pinpointing limitations in contextual understanding rather than generative capacity, and proposes a novel attention intervention strategy. Ultimately, GENIUS establishes a robust standard for evaluating fluid intelligence, pushing the field towards more dynamic and general-purpose reasoning in generative models.

Scientists have developed GENIUS, a new evaluation suite comprising 86 samples designed to rigorously assess “Implicit Pattern Generation” in artificial intelligence systems. This work addresses a critical limitation in current AI benchmarks, which largely focus on recalling pre-existing knowledge rather than the ability to adapt and reason in novel situations. “generative fluid intelligence”, the capacity to discern patterns, adhere to constraints, and respond dynamically to unforeseen circumstances. The suite’s design incorporates three core primitives, inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge, to provide a comprehensive evaluation of this crucial skill. A systematic evaluation utilising GENIUS revealed significant performance deficits across twelve representative models, indicating a substantial gap between current AI capabilities and true general intelligence. Diagnostic analysis pinpointed the root cause of these failures not as a lack of generative potential, but as limited comprehension of contextual relationships. A meticulously curated suite of 510 expert-designed samples underpins this work, forming Implicit Pattern Induction, Ad-hoc Constraint Execution, and Contextual Knowledge Adaptation, each designed to isolate specific reasoning capabilities. Implicit Pattern Induction tasks, for example, require models to infer unstated visual preferences from a series of images and apply these preferences during image generation. Ad-hoc Constraint Execution challenges models with abstract, dynamically defined constraints, demanding logical reasoning within a visual or symbolic framework. To ensure task complexity, each instance within GENIUS incorporates multi-modal interleaved context, meaning the task cannot be solved by removing any single modality from the input. This design choice forces models to integrate information across different data types, mirroring the complexities of real-world reasoning. Contextual Knowledge Adaptation assesses a model’s ability to adjust its behaviour based on contextual cues, even when these cues contradict established common sense or pre-trained priors. The selection of 86 samples for Implicit Pattern Induction, 213 for Ad-hoc Constraint Execution, and 211 for Contextual Knowledge Adaptation was not arbitrary. Each sample was carefully designed to present a unique challenge, demanding a synthesis of inductive inference, abstract reasoning, and adaptive inhibition. This systematic approach to benchmark construction allows for a granular analysis of model strengths and weaknesses, pinpointing specific areas where improvements are needed to achieve true generative fluid intelligence. Initial evaluations utilising GENIUS reveal significant performance deficits across 12 representative models, highlighting a crucial gap in their ability to dynamically reason and adapt. Diagnostic analysis demonstrates that these deficits originate from limited context comprehension, not from a lack of inherent generative capacity. Models struggle to effectively interpret and apply information presented within the immediate context of each sample, indicating a weakness in fluid intelligence rather than a deficiency in their ability to generate images generally. Furthermore, the research introduces a training-free attention intervention strategy aimed at mitigating these identified deficits. While not the primary focus of the work, this intervention suggests a potential avenue for improving context comprehension in multimodal models. The 86 samples within GENIUS represent a carefully constructed testbed for probing the boundaries of AI’s ability to reason and adapt in real-time, without relying on pre-existing knowledge. This rigorous standard is intended to guide future research beyond simple knowledge utilisation towards more dynamic and general-purpose reasoning capabilities in artificial intelligence. For years, progress in AI has been largely defined by scaling up datasets and model sizes, effectively creating sophisticated pattern-matching machines. However, true intelligence demands more than just recognising familiar shapes; it requires the capacity to generate novel solutions, to understand underlying principles, and to apply them flexibly. This is particularly crucial as we move towards deploying AI in unpredictable, real-world scenarios where rote learning will inevitably fail. The significance of this work lies in its diagnostic approach, isolating the specific cognitive deficits in current models, a lack of contextual understanding, rather than a lack of generative power, and providing a clear roadmap for future development. Ultimately, this isn’t just about achieving higher scores on a benchmark; it’s about building AI systems that can truly think, not just mimic thought. The next step will be to see whether these insights translate into more robust and adaptable AI across a wider range of applications, from robotics and autonomous systems to creative problem-solving and scientific discovery.

👉 More information
🗞 GENIUS: Generative Fluid Intelligence Evaluation Suite
🧠 ArXiv: https://arxiv.org/abs/2602.11144

AI Models Now Assessed on Ability to Solve Problems, Not Just Recall Facts

Rohail T.

Latest Posts by Rohail T.:

Scalable Phonon Lasers Overcome Limitations for Focused Vibrational Control

Microstructure Predicts Qubit Coherence, Reducing Decoherence Loss by Two Orders of Magnitude

Fewer Atoms Needed: Light Emission Scales with One Divided by N Cubed