Semantic Soft Bootstrapping Enables Long Context Reasoning in LLMs Without Reinforcement Learning, Achieving Gains over 10.6% and 10%

The ability of large language models to reason through complex problems benefits greatly from chain-of-thought inference, but training these models typically relies on computationally expensive reinforcement learning techniques. Purbesh Mitra and Sennur Ulukus, from the University of Maryland, address this challenge with a novel approach called Semantic Soft Bootstrapping. Their method circumvents the need for reinforcement learning by employing a self-distillation technique, where the model learns from subtly different contextual cues regarding the correctness of its own responses. This process automatically generates training data from raw problem-answer pairs, enabling the model to refine its reasoning process and achieve significant improvements in accuracy on challenging mathematical benchmarks, demonstrating a substantial leap forward in long-context reasoning capabilities without the limitations of traditional reinforcement learning methods.

Supervising Logits Improves Reasoning in LLMs

Scientists have developed a new method, SLiM (Supervised Logit Matching), to enhance the reasoning abilities of large language models after their initial training. SLiM offers a simpler, more efficient alternative to techniques like reinforcement learning by directly supervising the model’s internal output scores, known as logits, to align with those of a carefully designed “teacher” model. This teacher model generates outputs for both correct and incorrect solutions, encoding valuable reasoning information even in flawed attempts. The core of SLiM is an offline distillation process, meaning it doesn’t require ongoing interaction or human feedback during training.

Researchers prompt a teacher model with both correct and incorrect answers, and the student model learns to match the teacher’s logits, focusing on the tokens within the answer sequence. Results demonstrate that SLiM significantly improves performance on challenging reasoning benchmarks, including GSM8K, MATH500, and AIME2024, outperforming existing methods. Importantly, SLiM achieves these gains without requiring longer generated responses. The method begins by prompting a base model to generate multiple solution attempts, or “rollouts”, for a given problem, then automatically categorizing them as correct or incorrect, creating a curated dataset for subsequent training. SSB constructs specialized prompts that combine the original problem statement with a representative correct solution and a contrasting incorrect solution. The base model, functioning as a “teacher”, generates a single, detailed, and verified solution, refining and explaining the reasoning process.

Researchers extract token-level logits from the teacher model’s answer, storing these as “soft labels” representing the probability distribution over possible answer tokens. During training, a student model learns to match the teacher’s token distribution using a KL-based distillation loss, avoiding reward hacking and nudging the model’s output towards correct responses. Experiments using the Qwen2. This work overcomes limitations of traditional reinforcement learning methods by training a model on its own generated reasoning, effectively using the same base model as both teacher and student. The team curated a dataset of paired examples by processing a large number of questions, enabling efficient offline distillation without human intervention or online reinforcement learning loops. Experiments with the Qwen2.

5-3B-Instruct model on the GSM8K dataset demonstrate substantial improvements in accuracy on challenging benchmarks, including a significant increase on the MATH500 benchmark and a notable improvement on the AIME2024 benchmark. Detailed analysis of the training process revealed stable dynamics, with a gradual decrease in loss and gradient norm over time. Notably, the completion length did not significantly increase during training, suggesting that stronger reasoning does not necessarily require longer chains of thought or increased token usage. The method improves performance by training a model to learn from its own hinted reasoning process, effectively using the same model as both a teacher and a student. This approach constructs paired training examples from existing problem-answer data, allowing for efficient offline distillation without the need for human annotation or reinforcement learning. Notably, the technique maintains stable training dynamics and does not require increased response length, suggesting that improved reasoning does not necessarily depend on lengthy chains of thought. The authors acknowledge that further research is needed to explore the sample efficiency and scaling laws of this method with larger models and more extensive datasets, and suggest extending the technique to a wider range of domains.

👉 More information
🗞 Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2512.05105

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Rydberg RF Receiver Enhanced with Metamaterial Lens Achieves Improved 2.2~GHz and 3.6~GHz Sensitivity

Quantum Rydberg RF Receiver Enhanced with Metamaterial Lens Achieves Improved 2.2~GHz and 3.6~GHz Sensitivity

December 6, 2025
Gpu-portable Density Functional Theory Achieves 2.0-2.8 Speedups on AMD MI300A and Intel GH200 Architectures

Gpu-portable Density Functional Theory Achieves 2.0-2.8 Speedups on AMD MI300A and Intel GH200 Architectures

December 6, 2025
Hybrid Quantum-Classical Autoencoders Match Classical Performance in Network Intrusion Detection, Enabling Stronger Zero-Day Generalization

Hybrid Quantum-Classical Autoencoders Match Classical Performance in Network Intrusion Detection, Enabling Stronger Zero-Day Generalization

December 6, 2025