The ability of large language models to reason through complex problems benefits greatly from chain-of-thought inference, but training these models typically relies on computationally expensive reinforcement learning techniques. Purbesh Mitra and Sennur Ulukus, from the University of Maryland, address this challenge with a novel approach called Semantic Soft Bootstrapping. Their method circumvents the need for reinforcement learning by employing a self-distillation technique, where the model learns from subtly different contextual cues regarding the correctness of its own responses. This process automatically generates training data from raw problem-answer pairs, enabling the model to refine its reasoning process and achieve significant improvements in accuracy on challenging mathematical benchmarks, demonstrating a substantial leap forward in long-context reasoning capabilities without the limitations of traditional reinforcement learning methods.

Supervising Logits Improves Reasoning in LLMs

Scientists have developed a new method, SLiM (Supervised Logit Matching), to enhance the reasoning abilities of large language models after their initial training. SLiM offers a simpler, more efficient alternative to techniques like reinforcement learning by directly supervising the model’s internal output scores, known as logits, to align with those of a carefully designed “teacher” model. This teacher model generates outputs for both correct and incorrect solutions, encoding valuable reasoning information even in flawed attempts. The core of SLiM is an offline distillation process, meaning it doesn’t require ongoing interaction or human feedback during training.

Researchers prompt a teacher model with both correct and incorrect answers, and the student model learns to match the teacher’s logits, focusing on the tokens within the answer sequence. Results demonstrate that SLiM significantly improves performance on challenging reasoning benchmarks, including GSM8K, MATH500, and AIME2024, outperforming existing methods. Importantly, SLiM achieves these gains without requiring longer generated responses. The method begins by prompting a base model to generate multiple solution attempts, or “rollouts”, for a given problem, then automatically categorizing them as correct or incorrect, creating a curated dataset for subsequent training. SSB constructs specialized prompts that combine the original problem statement with a representative correct solution and a contrasting incorrect solution. The base model, functioning as a “teacher”, generates a single, detailed, and verified solution, refining and explaining the reasoning process.

Researchers extract token-level logits from the teacher model’s answer, storing these as “soft labels” representing the probability distribution over possible answer tokens. During training, a student model learns to match the teacher’s token distribution using a KL-based distillation loss, avoiding reward hacking and nudging the model’s output towards correct responses. Experiments using the Qwen2. This work overcomes limitations of traditional reinforcement learning methods by training a model on its own generated reasoning, effectively using the same base model as both teacher and student. The team curated a dataset of paired examples by processing a large number of questions, enabling efficient offline distillation without human intervention or online reinforcement learning loops. Experiments with the Qwen2.

5-3B-Instruct model on the GSM8K dataset demonstrate substantial improvements in accuracy on challenging benchmarks, including a significant increase on the MATH500 benchmark and a notable improvement on the AIME2024 benchmark. Detailed analysis of the training process revealed stable dynamics, with a gradual decrease in loss and gradient norm over time. Notably, the completion length did not significantly increase during training, suggesting that stronger reasoning does not necessarily require longer chains of thought or increased token usage. The method improves performance by training a model to learn from its own hinted reasoning process, effectively using the same model as both a teacher and a student. This approach constructs paired training examples from existing problem-answer data, allowing for efficient offline distillation without the need for human annotation or reinforcement learning. Notably, the technique maintains stable training dynamics and does not require increased response length, suggesting that improved reasoning does not necessarily depend on lengthy chains of thought. The authors acknowledge that further research is needed to explore the sample efficiency and scaling laws of this method with larger models and more extensive datasets, and suggest extending the technique to a wider range of domains.

👉 More information
🗞 Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2512.05105

Tags:

AIME2024 benchmark chain-of-thought inference group relative policy optimization GSM8K dataset Large Language Models MATH500 benchmark parameter-efficient fine-tuning reinforcement learning with verifiable rewards semantic soft bootstrapping

Rohail T.

Latest Posts by Rohail T.:

Material Links Magnetism and Electrons in New Ways

Quantum Battery Stores More Energy with Heat Applied

Quantum Graphs Learn Data with Fewer Qubits