The pursuit of powerful artificial intelligence increasingly relies on large language models, but access to these technologies remains limited by closed-source designs, hindering wider research and development. Jiang Liu, Jialian Wu, and Xiaodong Yu, along with their colleagues, address this challenge by introducing Instella, a new family of fully open language models. Trained entirely on publicly available data and utilising Instinct MI300X GPUs, Instella achieves state-of-the-art performance among openly accessible models, rivalling even leading models of similar size despite requiring fewer training resources. The team further expands the capabilities of Instella with two specialised variants, Instella-Long for processing exceptionally long texts and Instella-Math for advanced mathematical reasoning, establishing a transparent and versatile platform that significantly advances open language modelling research.

LLM Evaluation Benchmarks And Datasets

This overview details a comprehensive collection of research papers and resources focused on Large Language Models (LLMs), highlighting a growing trend towards open-source development and robust assessment. Numerous benchmarks are designed to test general reasoning, knowledge, mathematical abilities, and long-context processing, utilizing datasets like BIG-Bench, HellaSwag, and MAmmoTH2 to challenge models with complex tasks. Others, such as DeepSeekMath and OpenMathInstruct-2, specifically target mathematical reasoning, while studies involving Helmet and ∞Bench address the increasing importance of evaluating models with very long context windows. The compilation also includes details on specific LLM models, such as Llama, Qwen2.

5, Qwen3, and Gemma 2, alongside technical reports outlining their architectures and training procedures. Researchers are actively developing new training frameworks, like HybridFlow, and refining existing techniques, such as attention mechanisms, to improve model performance. The inclusion of resources like OpenCode LLM Dataset and synthetic data generation methods demonstrates a commitment to expanding the availability of training data, fostering more community-driven research and collaboration.

Transparent Language Models with Synthetic Reasoning Data

Scientists engineered Instella, a family of three billion parameter language models, prioritizing transparency and reproducibility through openly available data and code. The development began with a two-stage pre-training process, initially using general-domain data containing four trillion tokens, followed by a second stage emphasizing reasoning with 57 billion tokens. To enhance reasoning capabilities, the team introduced a novel in-house synthetic dataset for mathematics, generated by converting problems into symbolic Python programs and creating diverse, solvable variations, ensuring both coverage and correctness. Weight ensembling further improved model performance.

Following pre-training, Instella underwent supervised fine-tuning on 2. 3 million high-quality instruction-response pairs, spanning mathematics, coding, commonsense reasoning, and dialogue, enabling it to follow complex prompts and generalize across tasks. This was refined through direct preference optimization, aligning model outputs with human expectations for helpfulness, safety, and accuracy. To extend capabilities into long-context processing, the team developed Instella-Long, capable of handling sequences up to 128,000 tokens, achieved through continued pre-training and long-context supervised fine-tuning.

Recognizing the limited availability of long-context data, scientists synthesized instruction-following examples directly from pre-training documents. Further specialization led to Instella-Math, a reasoning-focused model leveraging reinforcement learning, representing the first fully open three billion parameter model to apply multi-stage group relative policy optimization entirely on open datasets. Training involved gradually increasing rollout lengths and incorporating Olympiad-level problems, resulting in substantial improvements in mathematical and logical reasoning, and demonstrating the potential of reinforcement learning to enhance reasoning even in compact models.

Open Language Models Excel at Reasoning Tasks

Scientists have developed Instella, a new family of fully open three billion parameter language models, offering complete transparency in both model weights and training procedures. The research team trained Instella using openly available data and code, achieving state-of-the-art results among fully open models and competitive performance with leading open-weight models of comparable size. Initial pre-training involved a four trillion token general-domain stage, followed by a second stage utilizing 57 billion tokens with an emphasis on reasoning-heavy domains. To further enhance reasoning capabilities, the team introduced a novel in-house synthetic dataset for mathematics, generated by converting problems into symbolic Python programs and creating diverse, solvable variations, expanding mathematical coverage while ensuring data correctness.

Weight ensembling across multiple pre-training runs further improved model performance. Following pre-training, Instella underwent supervised fine-tuning on 2. 3 million high-quality instruction-response pairs, equipping it with the ability to follow complex prompts and generalize across diverse task formats, and direct preference optimization refined outputs to align with human expectations for helpfulness and factuality. Researchers extended Instella’s capabilities into the long-context domain with Instella-Long, capable of processing sequences up to 128,000 tokens, and trained it using 40 billion tokens of continued pre-training data, delivering competitive performance on the challenging Helmet benchmark. Furthermore, Instella-Math, a reasoning-focused model, applied multi-stage group relative policy optimization entirely on open datasets, achieving substantial improvements in mathematical and strategic reasoning benchmarks, establishing Instella as a versatile and transparent alternative for language modeling research.

Open Language Models with Strong Performance

The research team presents Instella, a family of fully open three billion parameter language models trained exclusively on openly available data and code. This work establishes a strong base pre-trained model, a supervised fine-tuned instruct model, a long-context model capable of processing 128,000 tokens, and a specialized model focused on reasoning capabilities, achieving state-of-the-art results among fully open models and remaining competitive with leading open-weight alternatives. Instella-Long demonstrates robust long-context handling, while Instella-Math achieves impressive gains on mathematical and strategic reasoning benchmarks. To promote reproducibility and further innovation, the team releases not only the model weights but also the training code, data recipes, and evaluation protocols, providing a transparent, performant, and extensible foundation for researchers and developers. The authors acknowledge that the models, like all language models, may exhibit limitations and biases inherent in the training data, and ongoing research is needed to address these challenges, with future work likely focusing on scaling the models, exploring novel architectures, and developing more effective methods for alignment and control.

👉 More information
🗞 Instella: Fully Open Language Models with Stellar Performance
🧠 ArXiv: https://arxiv.org/abs/2511.10628

Tags:

128K tokens instruction tuning Language Modeling Large Language Models mathematical tasks MI300X GPUs Open-Source Models open-weight models pre-training Reinforcement Learning

The Neuron

Instella: Open 3 Billion Parameter Language Models Achieve State-of-the-Art Performance

LLM Evaluation Benchmarks And Datasets

Transparent Language Models with Synthetic Reasoning Data

Open Language Models Excel at Reasoning Tasks

Open Language Models with Strong Performance

Latest Posts by The Neuron:

Merck (NYSE:MRK) to Leverage Mayo Clinic Platform for AI & Precision Medicine Advances

NVIDIA Blackwell Ultra Achieves Up to 50x Performance Boost & 35x Cost Reduction for Agentic AI

Ant Group’s Ring-1T-2.5 1 Trillion Parameter Model Achieves Gold-Tier Performance on IMO 2025 & CMO 2025 Benchmarks