Recent advances in artificial intelligence rely heavily on large language models, but building increasingly powerful models demands innovative architectural designs. Shikhar Tuli, James Smith, and Haris Jeelani, from Samsung Research America, along with Chi-Heng Lin, Abhishek Patel, Vasili Ramanishka, and colleagues, present a new approach called MossNet, which significantly improves upon existing models. MossNet achieves this by mimicking the functionality of multi-head attention, a key component of many successful language models, using a novel mixture-of-state-space-experts architecture. Extensive testing demonstrates that MossNet not only outperforms both transformer and state-space models of comparable size, but also scales effectively to handle vast datasets, offering a promising new direction for efficient and high-performing recurrent language models and exhibiting favourable performance on devices like the Samsung Galaxy S24 Ultra.

s, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, scientists propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realise multiple “attention heads”. Extensive experiments on language modelling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet.

Mobile and GPU Language Model Benchmarks

A comprehensive evaluation of several language models, including variations of Mamba, Llama, and MossNet, reveals significant performance differences across various metrics and hardware configurations. Results demonstrate that MossNet generally requires the most memory, particularly as the context length increases, while Mamba consistently uses the least. Long context performance reveals MossNet maintains a stable perplexity, a measure of prediction accuracy, even at 8192 tokens, benefiting from the use of State Space Attention (SWA). Models like Phi1. 5 and SmolLM2-1. 7B struggle significantly with long contexts, exhibiting high perplexity and running out of memory.

Mamba-1. 4B shows some improvement with SWA, but still faces challenges. These findings indicate that MossNet excels in handling long sequences, while Mamba prioritizes memory efficiency, and Llama3 struggles with both long contexts and memory usage.

MossNet Emulates Attention Within State-Space Models

Scientists have developed MossNet, a new language model architecture that advances the field of natural language processing. This work introduces a novel approach to building efficient and high-performing recurrent models, addressing limitations found in existing transformer and state-space models. The core innovation lies in MossNet’s ability to emulate a multi-head attention mechanism, similar to those used in leading transformer models, but within a state-space model framework. The team mathematically demonstrates how a mixture of state-space experts can effectively model a linear mixture-of-expert multi-head attention, allowing MossNet to capture more complex relationships within text.

Extensive experiments comparing MossNet to other recent language models reveal superior performance on both language modeling tasks and downstream benchmarks. Specifically, the larger MossNet-8x200M+ model, trained with trillions of tokens, substantially outperforms the Qwen2. 5-0. 5B model, despite being trained on fewer pre-training tokens. The MossNet-8x200M+ model exhibits significantly faster prefill and generation speeds, along with reduced memory consumption, compared to transformer and state-space models with similar parameter counts. These results confirm that MossNet represents a compelling new direction for building efficient, high-performing recurrent language models, offering substantial improvements in both speed and resource usage.

MossNet Surpasses Transformers in Language Modelling

Researchers have developed MossNet, a novel recurrent language model architecture that effectively emulates multi-head attention, a key component of transformer models, within a state-space model framework. By integrating a mixture-of-experts approach into both the channel-mixing and time-mixing components of the model, MossNet captures a richer representation of temporal context than traditional state-space models. Empirical results demonstrate that MossNet outperforms both transformer-based and other state-space model architectures on language modeling and downstream tasks, even with comparable model sizes and training data. Larger versions of MossNet, trained on extensive datasets, further confirm its scalability and superior performance characteristics.

The team also investigated the impact of activating different numbers of experts within the mixture-of-experts framework, finding that a balance between activating sufficient experts and managing model capacity is crucial for optimal performance. While MossNet demonstrates significant advantages, the researchers acknowledge its architectural complexity may pose challenges for replication and broader adoption. Future work will focus on evaluating MossNet on a wider range of downstream tasks, including multi-modal understanding and real-time applications, and exploring adaptive optimizations for diverse hardware platforms.

👉 More information
🗞 MossNet: Mixture of State-Space Experts is a Multi-Head Attention
🧠 ArXiv: https://arxiv.org/abs/2510.26182

Tags:

A100 GPU channel-mixing multi-layered perceptron Language Modeling Large Language Models mixture-of-state-space-experts Multi-head attention Recurrent Neural Networks Samsung Galaxy S24 Ultra state-space models transformer models

Mossnet: Mixture of State-Space Experts Emulates Multi-Head Attention for Advanced Language Modeling

Mobile and GPU Language Model Benchmarks

MossNet Emulates Attention Within State-Space Models

MossNet Surpasses Transformers in Language Modelling

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability