Researchers are tackling the challenge of text style transfer, a crucial task for applications ranging from automated content adaptation to personalised communication. Ruoxi Liu and Philipp Koehn, both from the Department of Computer Science at Johns Hopkins University, present a novel method leveraging parameter-efficient fine-tuning of large language models (LLMs) and a technique called round-trip translation. Their work addresses the limited availability of parallel data needed to train style transfer models by synthesising datasets from monolingual sources, effectively establishing a common stylistic base. This innovative approach consistently outperforms zero-shot prompting and few-shot in-context learning, as demonstrated by improved BLEU and style accuracy scores across multiple domains, and is further strengthened by the incorporation of retrieval-augmented generation to maintain terminology and stylistic coherence.
Scientists have devised a clever technique to reshape writing styles using the power of large language models. The method overcomes a key limitation, a lack of training data, by cleverly generating its own examples through translation. This promises more adaptable and nuanced AI writing tools for a variety of applications. Researchers have developed a new method for text style transfer, modifying the way text is written while preserving its meaning, that overcomes a critical limitation in the field: the lack of training data.
This work introduces a technique for synthesising parallel datasets from readily available monolingual text, effectively creating a shared stylistic baseline for both training and application of large language models (LLMs). The core innovation lies in employing roundtrip translation, a process where text is translated to a pivot language and back again, to ‘neutralize’ stylistic attributes and generate synthetic parallel data.
Initial experiments demonstrate that this approach consistently outperforms both zero-shot prompting and few-shot in-context learning techniques across four distinct domains. This study addresses a longstanding challenge in text style transfer (TST), a task concerned with altering characteristics like formality or tone without changing the underlying content.
To circumvent the scarcity of annotated parallel corpora, the research team devised a workflow leveraging roundtrip translation to create a pseudo-parallel corpus. By translating text into an intermediate language and then back to the original, they effectively stripped away the original stylistic markers, generating a ‘neutralized’ version paired with the original.
This synthetic dataset then enables supervised finetuning of LLMs for style transfer tasks. The team’s method begins with training neural machine translation models using large, general-domain bilingual corpora. These models form the basis of the roundtrip translation pipeline, which is then applied to a monolingual corpus in the target style. The resulting style-neutral to target-domain dataset is used to finetune LLMs, allowing them to learn the nuances of style transfer.
Furthermore, the researchers integrated retrieval-augmented generation (RAG) to improve the model’s ability to handle terminology and maintain stylistic consistency, particularly in complex domains. Evaluation using BLEU scores and style accuracy classifiers confirms the effectiveness of this approach, demonstrating a clear advantage over existing state-of-the-art methods.
Roundtrip translation creates synthetic data for language model style control
A central innovation of this work lies in the application of roundtrip translation to generate synthetic parallel corpora for training large language models (LLMs) in text style transfer. Initially, two neural machine translation (NMT) models were trained utilising a large-scale, general-domain bilingual dataset to facilitate translation between English and a pivot language.
This established a robust translation pipeline essential for the subsequent style neutralisation process. Monolingual corpora, exhibiting a consistent stylistic characteristic, were then processed through this NMT pipeline via roundtrip translation, translating from the source style into the pivot language and back into English. The resulting text, devoid of the original stylistic attributes, formed the basis of a pseudo-parallel corpus, pairing style-neutral sentences with their target-domain counterparts.
This synthetic dataset enabled supervised fine-tuning of LLMs specifically for text style transfer, addressing the common limitation of scarce parallel corpora in most style domains. By creating a shared, neutral input style during both training and inference, the method aims to improve the consistency and controllability of style transformations. To further enhance robustness, particularly when encountering complex or unseen stylistic nuances, a retrieval-augmented generation (RAG) system was integrated.
This system leverages terminology and name knowledge during inference, ensuring stylistic consistency and mitigating potential errors. Queries are roundtrip translated prior to inference, aligning the input with the training data’s neutral style and promoting coherence between training and deployment phases. This approach differs from direct fine-tuning or prompting methods by actively manipulating the input to match the model’s learned stylistic baseline.
Roundtrip translation enhances parameter-efficient style transfer and text generation quality
Employing roundtrip translation to synthesize parallel datasets, this work achieves substantial gains in text style transfer through parameter-efficient fine-tuning of large language models. The core of the method generates ‘neutralized’ text, effectively establishing a shared input style for both training and inference procedures. The proposed technique consistently outperforms zero-shot prompting and few-shot in-context learning approaches across four distinct domains.
Style accuracy scores demonstrate a clear advantage, indicating a heightened ability to accurately reflect the target style after transformation. BLEU scores, a standard metric for evaluating text generation quality, further quantify the improvement achieved by this method. Performance gains are consistently observed when compared to baseline techniques, demonstrating the effectiveness of the synthesized parallel corpora in guiding the fine-tuning process.
The integration of retrieval-augmented generation (RAG) further refines the model’s capabilities, enhancing both robustness and stylistic consistency. RAG specifically addresses challenges related to terminology and proper name handling, ensuring accurate and contextually appropriate transformations. This study leverages bilingual general-domain parallel corpora to train neural machine translation models, which are then used to create the style-neutral texts.
The resulting pseudo-parallel corpus enables supervised finetuning of LLMs for text style transfer, facilitating a transfer from machine translation output to the desired target domain. Systematic evaluation across several text styles with distinctive features confirms the efficacy of the finetuned LLMs, alongside various prompts, RAG methods, and inference techniques employed throughout the research.
Roundtrip translation unlocks stylistic text manipulation without paired data
Scientists have devised a surprisingly effective method for altering the style of text using large language models, sidestepping a longstanding problem in the field. The difficulty has always been the need for extensive paired datasets, examples of the same content written in different styles, which are expensive and time-consuming to create. This new approach cleverly synthesises such data using roundtrip translation, effectively stripping stylistic elements from text to create a neutral base for learning.
The implications extend beyond simply mimicking writing styles. Imagine automatically adapting legal documents for public consumption, transforming technical manuals into accessible guides, or even personalising communication to suit different audiences. The integration of retrieval-augmented generation further refines the process, ensuring terminology and proper names are consistently translated, a crucial detail often overlooked.
However, the reliance on neutralisation, while ingenious, introduces a potential bottleneck. Over-simplification during this process could lead to a loss of nuance or subtle meaning. Furthermore, the demonstrated improvements, while significant, are currently limited to a handful of domains. Scaling this technique to encompass a wider range of styles and subject matter will be a considerable challenge.
Looking ahead, the real innovation may lie in combining this synthetic data approach with other techniques, such as adversarial training, to further refine stylistic control. The broader effort to build truly adaptable language models is gaining momentum, and this work represents a valuable step towards bridging the gap between theoretical possibility and practical application.
👉 More information
🗞 Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
🧠 ArXiv: https://arxiv.org/abs/2602.15013
