Japanese Dialogue AI Models Simultaneous Speech with New System.

Researchers developed the first publicly available full-duplex spoken dialogue system for Japanese. Training involved pre-training on extensive Japanese dialogue data, followed by refinement using stereo recordings and synthetically generated data. Evaluations confirm the model surpasses existing Japanese systems in both fluency and semantic coherence.

Natural human conversation is rarely a strictly turn-based affair; overlaps, interruptions and acknowledgements – known as backchannels – are commonplace. Replicating this complexity in machine dialogue systems presents a significant challenge, yet is crucial for creating truly natural interactions. Researchers are now addressing this gap specifically for the Japanese language, a domain where development of full-duplex systems – those capable of modelling simultaneous speech – has lagged behind English. Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, and Ryuichiro Higashinaka, all from the Graduate School of Informatics at Nagoya University, Japan, detail their work in the paper “Towards a Japanese Full-duplex Spoken Dialogue System”, presenting the first publicly available model of its kind, trained using a combination of existing and synthetically generated data, and demonstrably superior to current Japanese baseline models in both fluency and semantic coherence.

Full-Duplex Dialogue Systems Advance for Japanese Language Processing

Recent advances in deep learning and natural language processing have propelled conversational AI forward, yet a notable gap persists in the creation of genuinely interactive dialogue systems, particularly beyond English. Conventional systems typically operate on a turn-taking basis, processing input sequentially and generating responses accordingly, limiting the potential for fluid, human-like conversation. This research details the development of the first publicly available full-duplex spoken dialogue model for Japanese, facilitating more realistic interactions. By adapting an existing English-based full-duplex system, we demonstrate a viable pathway to achieving comparable performance in conversational fluency and coherence. This work addresses a critical need for advanced dialogue technologies in Japanese, with potential applications in customer service, education, and entertainment.

The foundation of this research lies in the adaptation of Moshi, a full-duplex dialogue system originally developed for English, to the Japanese language. Full-duplex capability enables the system to process simultaneous two-way communication, mirroring the natural flow of human conversation where speakers frequently overlap and interject. We hypothesised that transferring the core architecture of Moshi to Japanese would accelerate development, avoiding the need for de novo system construction. This approach necessitated careful consideration of the linguistic differences between English and Japanese, encompassing variations in grammar, syntax, and phonology.

To optimise performance, we employed a two-stage training process. Initially, the model underwent pre-training on a large-scale Japanese spoken dialogue dataset, allowing it to learn fundamental patterns of Japanese speech, including common phrases, grammatical structures, and pronunciation characteristics. Subsequently, we fine-tuned the model using a high-quality dataset specifically designed for full-duplex dialogue modelling.

Recognising the need for robust and scalable systems, we incorporated several advanced techniques, including data augmentation, transfer learning, and model compression. Data augmentation artificially expanded the training data by applying transformations to existing examples. Transfer learning leveraged pre-trained language models to initialise model parameters, reducing data requirements and improving generalisation. Model compression reduced model size without compromising accuracy, enabling deployment on resource-constrained devices.

We utilised corpora such as the Maekawa Corpus of Spontaneous Japanese and the Travel Agency Task Dialogue Corpus, alongside more recent datasets like the RealPersonaChat corpus, providing a diverse range of conversational data for training and evaluation. We also incorporated techniques such as the ZeRO (Zero Redundancy Optimizer) to optimise memory usage and facilitate training on larger datasets. ZeRO works by partitioning model states across multiple devices, reducing the memory footprint on any single device.

Evaluation results consistently demonstrated that our model outperformed baseline models across all metrics. Human judges consistently rated the responses generated by our model as more natural, coherent, and engaging. These findings provide strong evidence that our approach is effective in developing high-performing full-duplex dialogue systems for Japanese.

Looking ahead, several promising avenues for future research remain. One key area is the development of more sophisticated methods for handling overlapping speech and interruptions. Another important area is the integration of contextual information, such as user history and preferences, to personalise the conversation and provide more relevant responses. Furthermore, exploring the use of multimodal inputs, such as facial expressions and body language, could enhance the naturalness and expressiveness of the conversation. Finally, investigating the ethical implications of conversational AI, such as bias and privacy, is crucial for ensuring responsible development and deployment of these technologies. Addressing these challenges will pave the way for even more realistic, engaging, and beneficial conversational AI systems in the future. This research represents a significant step towards realising that vision, demonstrating the potential of full-duplex dialogue systems to transform the way we interact with machines.

👉 More information
🗞 Towards a Japanese Full-duplex Spoken Dialogue System
🧠 DOI: https://doi.org/10.48550/arXiv.2506.02979

The Neuron

The Neuron

With a keen intuition for emerging technologies, The Neuron brings over 5 years of deep expertise to the AI conversation. Coming from roots in software engineering, they've witnessed firsthand the transformation from traditional computing paradigms to today's ML-powered landscape. Their hands-on experience implementing neural networks and deep learning systems for Fortune 500 companies has provided unique insights that few tech writers possess. From developing recommendation engines that drive billions in revenue to optimizing computer vision systems for manufacturing giants, The Neuron doesn't just write about machine learning—they've shaped its real-world applications across industries. Having built real systems that are used across the globe by millions of users, that deep technological bases helps me write about the technologies of the future and current. Whether that is AI or Quantum Computing.

Latest Posts by The Neuron:

UPenn Launches Observer Dataset for Real-Time Healthcare AI Training

UPenn Launches Observer Dataset for Real-Time Healthcare AI Training

December 16, 2025
Researchers Target AI Efficiency Gains with Stochastic Hardware

Researchers Target AI Efficiency Gains with Stochastic Hardware

December 16, 2025
Study Links Genetic Variants to Specific Disease Phenotypes

Study Links Genetic Variants to Specific Disease Phenotypes

December 15, 2025