Hybrid AI Feedback Method Enhances Language Model Training, Study Finds

Researchers have proposed a method called Reinforcement Learning from AI Feedback (RLAIF) that uses AI to provide feedback for AI, offering advantages such as lower costs and shorter annotation cycles. However, while RLAIF enhances human preferences, it can decrease the satisfaction rate of responses. To address this, a new method called Hybrid Reinforcement Learning from AI Feedback (HRLAIF) was proposed, which improves the reliability of AI annotations and enhances the harmlessness of the model. In human evaluations, HRLAIF outperformed basic RLAIF in terms of response satisfaction rate and benchmarks for helpfulness and harmlessness.

What is Reinforcement Learning from AI Feedback (RLAIF) and its Advantages?

Reinforcement Learning from AI Feedback (RLAIF) is a method that uses AI to provide feedback for AI, as proposed by researchers such as Bai et al. (2022b). This method has been found to have several advantages over Reinforcement Learning from Human Feedback (RLHF), including lower costs and shorter annotation cycles. This makes RLAIF highly efficient during the rapid strategy iteration periods of large language model (LLM) training.

In a study conducted by Tencent PCG, ChatGPT (Liu et al. 2023) was used as a labeler for aligning language models. The results showed that after RLAIF, the model’s responses had a higher win ratio in human preference comparison, indicating that RLAIF indeed has the advantage of enhancing human preferences at a lower cost. However, the study also identified a decrease in the human satisfaction rate of responses with this method.

The increase in preference was primarily attributed to improvements in the stylistic paradigms of model responses. The decrease in satisfaction rate, on the other hand, was due to some responses becoming less helpful, particularly in terms of correctness and truthfulness. This issue arises mainly because AI has a lower accuracy in preference annotation for certain types of tasks, leading to the RM trained with AI feedback being ineffective in judging the correctness of responses.

What is Hybrid Reinforcement Learning from AI Feedback (HRLAIF) and How Does it Address the Issues in RLAIF?

To address the issues identified in RLAIF, a novel method called Hybrid Reinforcement Learning from AI Feedback (HRLAIF) was proposed. The term “Hybrid” in this context refers to the phased annotating on different prompt categories during the AI preference labeling process. This method significantly enhances the reliability of AI annotations on certain categories of prompts, leading to a more robust model in helpfulness during Reinforcement Learning (RL).

Moreover, HRLAIF also includes using AI for Red Teaming with a toxic prompt set. This approach further enhances the harmlessness of the model. In the study, hybrid AI preference labeling was primarily implemented in prompt categories such as math computation, multiple-choice question, and toxic prompt rejection, followed by RL training.

How Does HRLAIF Perform in Human Evaluations and Benchmarks?

Human evaluation results on prompts of 12 categories showed that both basic RLAIF and HRLAIF can enhance the model’s win ratio in human preference comparison after training. However, in terms of the response satisfaction rate, compared to the policy model before training, basic RLAIF experiences a decrease of 4.58%, while HRLAIF achieves an increase of 2.08%.

The effectiveness of the above approaches was quantified with popular LLM benchmarks and human evaluations based on a Chinese multi-category evaluation set. In benchmarks for helpfulness and harmlessness, HRLAIF outperforms basic RLAIF.

What are the Main Contributions of the Study?

The main contributions of the study are the proposal of a novel method, HRLAIF, which addresses the issue of decreased helpfulness observed in the basic RLAIF process while also further enhancing models’ harmlessness. The study also quantifies the effectiveness of the approaches with popular LLM benchmarks and human evaluations.

How Does HRLAIF Relate to Other Works in LLM Learning from Human Feedback?

LLM learning from human feedback has been explored in several studies. Christiano et al. (2017) explored goals defined in terms of human preferences between pairs of trajectory segments. Shin et al. (2020) used on-policy learning and modeled the user’s emotional state to improve a seq2seq model. Nahian et al. (2021) introduced an approach to value-aligned reinforcement learning and trained an agent with human reward signals. Ouyang et al. (2022) showed an avenue for aligning LLMs with user intent on a wide range of tasks by fine-tuning with human feedback. Touvron et al. (2023) developed Llama 2, which was trained with batches of human preference data annotation in RLHF fine-tuning stage. There were also studies that optimize the difference in log probabilities between winning and losing responses based on human feedback results (Rafailov et al. 2023, Yuan et al. 2023).

Publication details: “HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain
Reinforcement Learning From AI Feedback”
Publication Date: 2024-03-13
Authors: Xiayu Li, Qiugen Xiao, Pan Cao, Jian Tang, et al.
Source: arXiv (Cornell University)
DOI: https://doi.org/10.48550/arxiv.2403.08309

Tags: