The ability to precisely control the characteristics of synthesised speech remains a significant challenge for open-source text-to-speech systems. Jingbin Hu, Huakang Chen from Northwestern Polytechnical University, Linhan Ma, and Dake Guo et al. address this problem with VoiceSculptor, a novel framework that allows users to design voices using natural language instructions. This system uniquely combines instruction-based voice design with high-fidelity voice cloning, enabling fine-grained control over attributes such as pitch, speaking rate, and emotion. By achieving state-of-the-art results on the InstructTTSEval-Zh benchmark and releasing all code and models publicly, the researchers aim to accelerate progress in reproducible, instruction-controlled speech synthesis.
This system uniquely combines instruction-based voice design with high-fidelity voice cloning, enabling fine-grained control over attributes such as pitch, speaking rate, and emotion. VoiceSculptor comprises two primary components: voice design and voice cloning, working in tandem to generate highly natural and nuanced speech. It utilises large-scale multimodal foundation models, building upon advances in audio-centric models like MiMo-Audio and Step-Audio2, which enhance speech representation and enable direct textual control.
The voice design component allows users to refine speech characteristics through natural-language descriptions and iterative refinement using Retrieval-Augmented Generation (RAG), enabling attribute-level edits across multiple dimensions of speech. Once the desired voice is designed, it is rendered into a prompt waveform, which then serves as input for the voice cloning model, facilitating high-fidelity timbre transfer. The system’s architecture allows for detailed control over speech parameters, moving beyond predefined prompts or embeddings used in earlier systems. By leveraging the semantic modelling capabilities of large language models, VoiceSculptor aims to overcome the scalability and expressiveness limitations of previous instruction-driven TTS approaches. Experimentation demonstrated VoiceSculptor achieves state-of-the-art performance on the InstructTTSEval-Zh benchmark, and the developers have fully open-sourced the system, including all code and pre-trained models, to promote reproducibility and further research. This commitment to open science allows other researchers to build upon their work and explore new avenues in instruction-controlled TTS, representing a significant step towards holistic, natural-language-driven content creation.
Instruction-Based Voice Design and High-Fidelity Cloning
Scientists have developed VoiceSculptor, an open-source text-to-speech (TTS) system capable of fine-grained control over speech attributes like pitch, speaking rate, age, emotion, and style directly from natural language descriptions. The research team integrated instruction-based voice design with high-fidelity voice cloning within a unified framework, enabling iterative refinement through retrieval-augmented generation (RAG). This innovative approach allows for the creation of a designed voice, rendered into a prompt waveform, and subsequently transferred to a cloning model for high-fidelity speech synthesis. Experiments conducted on the InstructTTSEval-Zh benchmark demonstrate VoiceSculptor’s state-of-the-art performance among open-source instruction-following TTS systems, achieving 77.2% Attribute Perception and Synthesis (APS) accuracy, alongside a Description-Speech Consistency (DSD) score of 65.1% and a Response Precision (RP) of 59.6%, yielding an average score of 67.3%.
Data shows that VoiceSculptor excels in accurately perceiving and rendering attributes, as evidenced by its leading APS and RP scores. Further analysis revealed that VoiceSculptor-VD, when combined with RAG, consistently outperforms other open-source models across the majority of metrics, achieving 75.7% on APS and 61.5% on RP. Tests prove that the designed prompt waveforms effectively preserve stylistic attributes when transferred to a downstream synthesis model, such as CosyVoice2, maintaining strong consistency with the intended vocal characteristics. The breakthrough delivers a unified, instruction-driven voice design framework that achieves state-of-the-art performance while remaining fully open-source, including code and pre-trained models. Measurements confirm that VoiceSculptor prioritises precise attribute control over stylistic similarity, resulting in more faithful and reliable instruction-following behavior during voice design, and establishing a robust foundation for reproducible research in instruction-controlled TTS.
Instruction Following Improves Voice Cloning Fidelity
VoiceSculptor represents a significant advance in open-source text-to-speech technology, offering a unified framework for instruction-based voice design and high-fidelity voice cloning. The system allows for fine-grained control over speech attributes, including pitch, rate, age, emotion, and style, directly from natural language descriptions, and supports iterative refinement through retrieval-augmented generation. Evaluations on the InstructTTSEval-Zh benchmark demonstrate state-of-the-art performance compared to other open-source instruction-following TTS systems, consistently achieving higher scores across multiple metrics. The research confirms that increased model capacity, richer training data, and staged training strategies contribute to improved instruction-following capabilities.
Ablation studies validate the effectiveness of key design choices, such as CoT-based attribute tokens and text-side cross-entropy supervision, which enhance instruction understanding, controllability, and robustness. Furthermore, the ability to reliably transfer designed voice characteristics to downstream speech synthesis models supports practical applications where voice design and speech generation are separate processes. The authors acknowledge limitations related to linguistic representation and reliance on external retrieval, and suggest future work focused on incorporating larger text datasets, instruction data augmentation, and more expressive audio codecs to address these areas.
👉 More information
🗞 VoiceSculptor: Your Voice, Designed By You
🧠 ArXiv: https://arxiv.org/abs/2601.10629
