A new benchmark, CHANCERY, assesses artificial intelligence reasoning about corporate governance, utilising 79 real-world charters and 24 established principles. Evaluations of current models, including Claude 3.7 Sonnet and GPT-4o, reveal limited performance, with reasoning agents demonstrating comparatively superior, though imperfect, capabilities.
The capacity of artificial intelligence to navigate complex regulatory frameworks represents a significant challenge in natural language processing. Assessing whether proposed actions align with established governing principles demands more than simple information retrieval; it requires nuanced reasoning and an understanding of precedent. Researchers at the Sentient Foundation – Lucas Irwin, Peiyao Sheng, Arda Kaz, and Pramod Viswanath – address this need with their newly presented benchmark, CHANCERY: Evaluating corporate governance reasoning capabilities in language models. This work introduces a novel method for testing a model’s ability to determine consistency between proposed executive actions and the stipulations within corporate charters, utilising a dataset derived from 10,000 real-world examples.
New Benchmark Assesses AI Reasoning in Corporate Governance
Researchers have introduced CHANCERY, a new benchmark designed to rigorously evaluate reasoning capabilities within the complex domain of corporate governance. Existing legal datasets often lack the emphasis on intricate reasoning required to accurately assess artificial intelligence (AI) in this field.
CHANCERY challenges models by presenting them with a corporate charter – a foundational document outlining the rules governing an organisation – alongside a proposed action initiated by executives, boards, or shareholders. The AI must then classify whether the proposed action aligns with the established rules detailed within the charter. This approach moves beyond simple pattern recognition, demanding a deeper understanding of legal principles and their application to specific scenarios.
The creation of CHANCERY leveraged twenty-four established corporate governance principles and drew upon a dataset of ten thousand real-world corporate charters. Researchers carefully curated seventy-nine diverse examples, representing a broad spectrum of industries, to ensure the benchmark accurately reflects the complexities of real-world legal scenarios.
Evaluations reveal the significant challenge posed by CHANCERY. State-of-the-art language models, including Claude 3.7 Sonnet and GPT-4o, achieved accuracies of 64.5% and 75.2% respectively, highlighting current limitations in AI’s ability to comprehend and apply complex legal principles.
Reasoning agents, employing frameworks such as ReAct and CodeAct, outperformed standard models, attaining accuracies of 76.1% and 78.1% respectively. This suggests that simply increasing model parameters is insufficient for success in legal reasoning tasks; a focus on developing more effective reasoning frameworks is crucial. ReAct and CodeAct are iterative reasoning frameworks that allow models to generate thoughts, actions, and observations, and to use code to enhance reasoning capabilities.
Detailed analysis reveals specific question types that pose challenges for current models, including nuanced interpretations of charter language and the application of principles to novel scenarios. Models often struggle with ambiguous language, requiring a deeper understanding of legal context and intent, and applying established principles to unique situations demands a higher level of reasoning ability.
The research contributes to a better understanding of the capabilities and shortcomings of current AI in legal reasoning, providing a valuable resource for researchers and practitioners. The authors promote responsible AI research practices by releasing both the code and data used to create CHANCERY, alongside comprehensive documentation, fostering transparency and reproducibility.
Researchers addressed key considerations regarding responsible research practices, acknowledging potential risks, bias mitigation, fairness evaluation, privacy protection, and dual-use concerns. They emphasise the importance of proactively addressing these concerns throughout the AI development lifecycle, ensuring that AI systems are used responsibly and ethically.
The benchmark is released under the MIT license, while utilising a pre-existing dataset with a CC-BY 4.0 license, ensuring appropriate attribution and usage rights, and promoting accessibility and collaboration. The authors declare the use of GPT-4o during the benchmark creation process, contributing to transparency in the research methodology.
Future work will focus on expanding the benchmark with a greater diversity of corporate charters and more complex reasoning scenarios. Researchers plan to incorporate a wider range of legal documents and scenarios, explore different AI architectures and training techniques, and investigate the use of explainable AI (XAI) techniques to provide insights into the reasoning processes of AI systems and enhance trust and transparency.
👉 More information
🗞 CHANCERY: Evaluating corporate governance reasoning capabilities in language models
🧠 DOI: https://doi.org/10.48550/arXiv.2506.04636
