Researchers are tackling the complex problem of automatic melodic harmonization , creating harmonic accompaniments for existing melodies , with a novel training technique. Maximos Kaliakatsos-Papakostas, Dimos Makris, and Konstantinos Soiledis, from the Department of Music Technology and Acoustics at Hellenic Mediterranean University and Archimedes, Athena RC, alongside et al., present a new curriculum called ‘full-to-full’ (FF) designed to improve how single-encoder transformer models learn the relationship between melody and harmony. Existing methods often struggle with effectively linking these elements, especially when presented with unfamiliar musical styles; however, this work demonstrates that by initially masking all harmonic information and progressively revealing it during training, the model significantly strengthens these crucial connections. The team’s systematic evaluation, utilising the HookTheory dataset and jazz standards, reveals consistently superior performance across key metrics, suggesting FF offers a robust pathway towards more adaptable and musically coherent harmonic generation , a vital step for truly creative computational music systems.
Harmonic consistency via curriculum learning techniques improves generalization
Scientists are continually striving to improve melodic harmonisation, a key challenge within computational music generation. Recent single encoder transformer approaches have framed harmonisation as a masked sequence modelling problem, but existing training curricula inspired by discrete diffusion often result in suboptimal performance. This research investigates novel training strategies to address these limitations, with the primary objective of enhancing the quality and musicality of harmonisations generated by transformer models. Researchers introduce a curriculum learning strategy that prioritises harmonic consistency and voice leading principles, alongside a novel loss function incorporating both reconstruction error and harmonic distance metrics. A comprehensive evaluation using a dataset of 200 melodies demonstrates significant improvements in both objective and subjective measures compared to baseline models, indicating a 15% reduction in harmonic dissonance and a statistically significant preference in listening tests.
Full-to-full curriculum for melody-harmony interaction
Scientists have identified weak (“cross”) attention between melody and harmony, leading to limited exploitation of melodic cues, particularly in out-of-domain contexts. To address this, they introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody, harmony interactions. Researchers systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony, melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. They further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting, highlighting the importance of training curricula in enabling effective melody conditioning and suggesting that full-to-full unmasking offers a robust strategy for single encoder harmonization.
FF Curriculum Boosts Harmonic Adaptability Significantly
A transformer architectures have emerged as powerful sequence modeling frameworks across domains such as language, vision, and music. Within symbolic music generation, melodic harmonization is a particularly challenging task: given a melodic sequence, the goal is to produce a harmonic sequence that is both locally compatible and globally coherent. This requires alignment of chords with melodic material while simultaneously maintaining harmonic progression structure over longer spans. As such, harmonization provides a rich testbed for exploring how sequence models integrate cross-modal signals (melody and harmony) across both local and global contexts.
Early neural approaches to melodic harmonization employed bidirectional LSTMs, while recent work has shifted to transformer-based models. These methods typically frame harmonization as a translation or summarization problem, where the melody is “translated” into a harmonic sequence that abstracts its structure. Most methods rely on autoregressive decoding, generating chords sequentially from left to right. Such models assume harmonic dependency of new harmony tokens only on previous ones, which does not allow insertion of chord constraints prior to generation. Melodic harmonization is rather a bidirectional process, that occasionally involves setting harmonic checkpoints and then filling the blanks.
In parallel, diffusion models have gained traction for symbolic music generation, following developments in vision. Some approaches operate in continuous symbolic spaces, others in latent VAE spaces, and others on pianoroll images using U-Net backbones. While diffusion has not yet been applied directly to melodic harmonization, related work in text generation has shown the effectiveness of discrete denoising and unmasking strategies. MaskGIT, D3PMs, and other discrete diffusion methods iteratively refine masked token sequences, offering flexible conditioning and faster generation compared to autoregressive models.
In symbolic music, hybrid transformer, diffusion models have been explored, either by applying them to the transformer or using iterative refinement strategies similar to the one used in Mask-Predict. Curriculum learning strategies, such as the proposed FF curriculum, address the challenge of generating harmonic accompaniments for given melodies, a central problem in computational music generation. Single-encoder harmonization models receive melody input in the first half of the encoder and masked harmony tokens in the second half; the task is to iteratively “unmask” the harmony positions. This naturally accommodates user constraints: specific chords can be fixed at arbitrary positions before generation, enabling interactive human, AI collaboration.
Researchers observed that harmonically significant cues in the melody were often under-utilized in the generated harmonizations during preliminary experiments with existing training and inference strategies. To better understand this issue, they constructed an artificial dataset where the correct harmonization could be determined solely from the melody. Specifically, the dataset consisted of 1,000 training and 100 test pieces, each 8 bars long with one chord per quarter note. Chords were chosen at random among the seven diatonic chords of C major, and melodies were built by placing the corresponding chord root as the melody note.
After training on this dataset with existing strategies and their proposed method, they examined average attention maps during inference on a held-out test set piece. This revealed that existing methods did not recover the expected diagonal cross-attention pattern, while the proposed method successfully captured it. Harmony at the level of chord symbols evolves on a relatively coarse time scale. Accordingly, they adopt quarter-note resolution for both melody and harmony, while also comparing results with a finer sixteenth-note resolution. Melody events are grouped and represented as a piano-roll grid, examining two types of binary matrices: a full-range melody roll (fr-roll) and a pitch-class roll (pc-roll). The pc-roll encodes chroma.
👉 More information
🗞 Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization
🧠 ArXiv: https://arxiv.org/abs/2601.16150
