On April 23, 2025, researchers Nicolas Jonason, Luca Casini, and Bob L. T. Sturm published SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward, exploring how reinforcement learning, guided by Meta’s Audiobox Aesthetics ratings, can refine piano MIDI models to produce more appealing compositions while balancing diversity in output.
The study investigates using aesthetic rating models to fine-tune a symbolic music generation system via reinforcement learning. Using group relative policy optimization, the researchers fine-tuned a piano MIDI model with Meta Audiobox Aesthetics ratings as rewards. The optimization improved low-level generated output features and increased average subjective ratings in a listening test. However, over-optimization significantly reduced diversity in model outputs.
Recent advancements in machine learning have significantly enhanced models’ ability to generate high-quality symbolic music, such as MIDI files or sheet music. A notable development is the use of large language models (LLMs) trained with specialized musical knowledge, exemplified by Notagen. This approach has demonstrated superior performance in terms of musicality and creativity compared to existing methods.
In experiments, various soundfonts were utilized, including MuseScore, FluidR3, Grandeur, and Yamaha. These collections of sounds are crucial for accurately reproducing intended musical nuances, impacting the perceived quality of generated music. The choice of soundfont can significantly affect how realistic and expressive the output sounds.
Notagen was trained on a diverse dataset featuring works by composers like Chopin, Mozart, and Philip Glass, ensuring varied and nuanced music generation. Evaluations using a linear mixed-effects model revealed that Notagen’s generated music received higher ratings than other systems, with statistically significant results (p < 0.001). This suggests that users perceive Notagen’s output as more appealing or higher quality.
Looking ahead, integrating real-time feedback and multi-modal approaches could enhance interactivity and creativity. Techniques to preserve distinct musical styles while allowing for innovation are essential, ensuring the model doesn’t blend genres into an indistinct mix. Additionally, methods like Clamp 3 may help maintain coherence across different aspects of music generation.
Notagen represents a significant advancement in symbolic music generation by leveraging LLMs with specialized training. While promising, further details on technical aspects and creativity metrics would provide deeper insights into the model’s capabilities. This innovation opens exciting possibilities for future developments in AI-generated music, offering potential for both artistic expression and practical applications.
👉 More information
🗞 SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward
🧠DOI: https://doi.org/10.48550/arXiv.2504.16839
