Researchers Evaluate Language Model’s Reasoning with 115 German Tax Law Examination Questions

Researchers are addressing the limitations of large language models when applied to highly structured domains like German tax law, where precision and legal accuracy are paramount. Sebastian Wind from Friedrich-Alexander-Universität Erlangen-Nürnberg, Jeta Sopa of DATEV eG, and Laurin Schmid working with colleagues at the Bavarian AI Taxation Laboratory, University of Technology Nuremberg, alongside Quirin Jackl, Sebastian Kiefer, Fei Wu, Martin Mayr from Friedrich-Alexander-Universität Erlangen-Nürnberg, Harald Köstler from RWTH Aachen University, Gerhard Wellein, Andreas Maier from Friedrich-Alexander-Universität Erlangen-Nürnberg, and Soroosh Tayebi Arasteh, present SteuerLLM, a specialised language model designed for German tax law analysis. This work is significant because it introduces SteuerEx, a novel benchmark created from authentic university tax examinations, and demonstrates that a domain-adapted model with 28 billion parameters can outperform larger, general-purpose models in realistic legal reasoning tasks, offering a crucial step towards reliable AI assistance in complex legal fields.

Scientists have developed a new benchmark and language model specifically designed to excel in the complex domain of German tax law, addressing a critical limitation of current large language models (LLMs) which struggle with fields demanding strict adherence to formal rules, precise terminology, and legally binding structures. Tax law, with its need for exact statutory citation and rigorous numerical accuracy, presents a particularly challenging test case for artificial intelligence. SteuerEx employs a unique evaluation framework mirroring real examination practices, assessing answers at the statement level and awarding partial credit for incremental correctness, providing a more realistic measure of legal reasoning ability than simple pass/fail metrics. To facilitate robust model training, a large-scale synthetic dataset was generated from this authentic examination material using a controlled retrieval-augmented pipeline, effectively expanding the training corpus while maintaining fidelity to real-world legal challenges. Complementing the benchmark is SteuerLLM, a domain-adapted LLM with 28 billion parameters, trained on this synthetic dataset derived from genuine examination questions and grounded in authoritative legal sources. The model’s architecture incorporates a block expansion strategy, adding layers to enhance domain-specific capacity without sacrificing general language understanding. Evaluations demonstrate that SteuerLLM consistently outperforms general-purpose LLMs of comparable and even significantly larger sizes, highlighting the importance of targeted data and architectural adaptation over sheer parameter count. Evaluations were conducted using normalized total exam scores, calculated as a percentage of the maximum achievable 1,035.5 points, closely replicating standard academic assessment procedures. All locally deployed language models were assessed and used between January and April 2025, with evaluations spanning from April 2025 to January 2026 to ensure consistency and account for potential model updates. This rigorous temporal control, combined with the authentic examination-based benchmark, establishes a reproducible framework for evaluating domain-specific legal reasoning capabilities in language models. Across the SteuerEx benchmark, the 28-parameter SteuerLLM consistently demonstrated strong performance, achieving scores competitive with, and often exceeding, substantially larger general-purpose language models ranging up to 671 billion parameters. This indicates that domain-specific training and architectural adaptation are more impactful than sheer model scale for this complex legal reasoning task. The research employed a block expansion strategy, adding trainable Transformer layers to a pretrained base model while preserving existing parameters, resulting in a 28-parameter model allowing for specialised depth without complete fine-tuning. Smaller variants of SteuerLLM, with 10 parameters, also remained competitive with mid-sized baseline models, further highlighting the efficiency of the domain adaptation strategy. Analysis of performance across six core tax law domains revealed nuanced strengths and weaknesses of the models, with comparisons to anonymized student examination results providing a concrete benchmark for assessing model competence relative to human academic performance. The evaluation workflow decomposed expert solutions into discrete, graded legal statements, enabling fine-grained assessment of partially correct reasoning, with each statement assigned point values allowing for a precise quantification of model performance. The relentless pursuit of artificial general intelligence often overlooks the stubborn realities of specialised expertise, as LLMs routinely stumble when confronted with domains demanding precision, codified knowledge, and unambiguous interpretation. This work, detailing the creation of SteuerLLM and the SteuerEx benchmark, tackles this challenge head-on, focusing on the notoriously complex field of German tax law. What distinguishes this research is the emphasis on domain adaptation, not just model size, as the researchers achieved superior performance compared to significantly larger general-purpose LLMs by meticulously crafting a benchmark from genuine tax law exams and training a model on synthetically generated, legally-grounded data. This suggests that focused, high-quality training data, coupled with architectural adjustments, can yield more substantial gains than brute-force scaling. The open release of both the benchmark and model weights is particularly commendable, fostering reproducibility and further innovation, although the synthetic data generation process introduces potential biases mirroring those present in the original exam questions. The model’s performance remains confined to German tax law, with generalisability to other legal domains or tax regimes unproven, therefore future work should explore methods for mitigating synthetic data bias and assessing the transferability of this domain-adaptation approach. This success encourages a shift in focus, away from simply building bigger models and towards intelligently curating and leveraging domain-specific knowledge to unlock true reasoning capabilities in artificial intelligence.

👉 More information
🗞 SteuerLLM: Local specialized large language model for German tax law analysis
🧠 ArXiv: https://arxiv.org/abs/2602.11081

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

March 10, 2026

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

March 3, 2026

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

March 3, 2026