Large language models excel at many tasks, but fundamental arithmetic remains a challenge, particularly when it comes to subtraction. Mayank Jobanputra, Nils Philipp Walter, and Maitrey Mehta, along with their colleagues, investigated this weakness by systematically testing eight pretrained language models on addition and subtraction problems. Their work reveals a significant gap in performance, with subtraction accuracy consistently lagging behind addition, and highlights a tendency for models to correctly calculate the magnitude of the answer but frequently omit the crucial negative sign. The team’s probing analyses demonstrate that these models do internally register the need for a negative result, suggesting a disconnect between internal processing and final output, and their experiments show that targeted instruction-tuning can dramatically improve performance, bringing accuracy close to perfection. This research provides valuable insight into the limitations and potential for improvement in the arithmetic capabilities of these increasingly powerful artificial intelligence systems.

Subtraction, despite being structurally distinct from addition, has received comparatively little attention in the evaluation of large language models. This research evaluates eight pretrained models, spanning four families, on both addition and subtraction problems. Experiments reveal that subtraction accuracy consistently lags behind addition accuracy. The errors in subtraction are concentrated in cases where the first number is smaller than the second, and models frequently produce the correct magnitude but omit the negative sign in these instances. Probing analyses demonstrate that models internally encode whether results should be negative, yet this information is often not reflected in the generated outputs. The team further tests well-known techniques such as few-shot learning.

LLM Subtraction Performance Across Model Families

The study systematically investigates subtraction capabilities in large language models (LLMs), revealing a significant performance gap compared to addition despite the structural differences between the two operations. Researchers generated synthetic datasets to evaluate eight pretrained LLMs, spanning four families, focusing on single-token numbers to ensure consistent evaluation across models. The team meticulously controlled the range of numeric values used, aligning them with each LLM’s tokenizer capabilities, and created balanced datasets with an equal number of problems where the first number was greater than or less than the second. This approach ensured that any observed performance differences stemmed from the models’ arithmetic reasoning, not from variations in input representation.

To probe the LLMs’ abilities, the researchers employed five distinct prompt formats, ranging from minimal equation style to verbose templates, to minimize the risk of spurious correlations and ensure generalizability of the results. Both zero-shot and n-shot prompting were used, with up to ten solved examples provided before each query to assess in-context learning capabilities. For instruction-tuned LLMs, the default system prompt was utilized, and up to 500 new tokens were sampled during inference. All experiments were conducted on a cluster of four H100 GPUs using the vLLM framework, without quantization, to maintain computational precision and reproducibility. The team extracted the final numerical answer from the LLMs’ generated text using a robust parsing mechanism, enabling quantitative comparison of performance across models and conditions. Dataset statistics show that the total number of samples ranged from 100 to 1,000,000, with a consistent balance between cases where a b and a.

LLMs Struggle with Subtraction, Especially a b

This work presents a systematic evaluation of subtraction capabilities in large language models (LLMs), revealing a significant performance gap compared to addition. Researchers evaluated eight pretrained LLMs, spanning four families, Gemma-2, Qwen3, OLMo-2, and Llama-3, on both addition and subtraction problems. The results demonstrate that subtraction accuracy consistently lags behind addition, with some models achieving perfect accuracy on addition while only reaching approximately half that level on subtraction. This disparity highlights a fundamental challenge for LLMs when performing this non-commutative operation.

A key finding is that errors in subtraction are disproportionately concentrated when subtracting a larger number from a smaller one (a In these cases, LLMs frequently produce the correct magnitude of the answer but consistently omit the negative sign, indicating a difficulty in correctly representing the sign of the result. Probing analyses reveal that LLMs internally encode information about whether the result should be negative, yet this information is often not reflected in the generated output, suggesting a decoding-time failure. To address this limitation, researchers tested both few-shot learning and instruction-tuning techniques. While few-shot prompting yielded only modest and inconsistent improvements, instruction-tuned models achieved near-perfect accuracy in generating the correct negative sign, demonstrating a substantial recovery of performance. These findings suggest that subtraction presents an inherently harder task for LLMs than addition, and that instruction tuning is crucial for achieving reliable results.

Subtraction Errors in Large Language Models

This research presents a systematic investigation into the arithmetic capabilities of large language models, focusing specifically on subtraction. The study reveals that while these models often perform well on addition, their accuracy significantly decreases when performing subtraction, particularly when the result is a negative number. The team found that models frequently calculate the correct magnitude but fail to include the negative sign in their output, despite internal representations suggesting the model knows the result should be negative. The researchers tested various techniques to improve performance, finding that few-shot prompting offered modest gains, but instruction tuning proved substantially more effective, often achieving near-perfect accuracy in generating the correct negative sign. This suggests that targeted training can significantly address the identified weakness in subtraction. The work highlights subtraction as a valuable diagnostic tool for assessing numerical reasoning in large language models, advocating for its inclusion as a standard benchmark alongside addition in future research.

👉 More information
🗞 Can LLMs subtract numbers?
🧠 ArXiv: https://arxiv.org/abs/2511.02795

Tags:

accuracy Few-Shot Learning Large Language Models

Muhammad Rohail T.

LLMs & Subtraction: Accuracy Lags Addition

LLM Subtraction Performance Across Model Families

LLMs Struggle with Subtraction, Especially a b

Subtraction Errors in Large Language Models

Latest Posts by Muhammad Rohail T.:

Quantum Light Speeds Atomic Ionization

Quantum States Predictably Distribute with Noise

Quantum Networks: Unknown State Verification Limit