Draft:Direct Preference Optimization

From Wikipedia, the free encyclopedia
  • Comment: Only YouTube videos and Medium blogs have been added since the prior submission, neither of which are generally considered reliable. This concept does not appear to be notable according to the sources given. Perhaps some of this material can be added to existing articles. Remsense 05:22, 29 February 2024 (UTC)
  • Comment: None of the presented sources appear to be WP:RS. Hitro talk 10:45, 28 February 2024 (UTC)

Direct Preference Optimization (DPO) is an algorithm to align large language models (LLMs) to human preferences, eliminating the necessity for reward model training and reinforcement learning.[1] It addresses the challenge of precisely controlling the behavior of unsupervised language models, which, despite their broad knowledge and reasoning skills, are difficult to steer due to their unsupervised nature of training. Traditional approaches involve collecting human labels on model generations and fine-tuning the model to align with these preferences using Reinforcement Learning from Human Feedback (RLHF). However, RLHF can be complex and unstable.[2]

DPO addresses one approximation of RLHF by eliminating the need for a separate reward modeling stage. Instead of first learning a reward model from human preferences and then training a policy to maximize these inferred rewards, DPO directly optimizes the policy based on collected preference data. This direct optimization is achieved by reparameterizing the optimization objective in such a way that it directly reflects the preference data, thereby allowing the policy to be trained to align with human preferences without the intermediate step of reward modeling.[3]

However, it's important to note that while DPO sidesteps the challenges associated with the generalization of reward models, it still relies on another approximation: The translation of pairwise preferences into a form that can be optimized directly. This reliance means that while DPO offers a significant advancement in terms of simplifying the training process and potentially improving the stability and performance of the model, it does not completely overcome all the theoretical challenges associated with learning from human preferences. The method still operates under the assumption that pairwise preferences can effectively guide the optimization of the policy in a meaningful way, an assumption that carries its own set of limitations and challenges in accurately capturing and reflecting the nuances of human judgments.[3]

DPO requires less data and less computing power to be as good as Proximal Policy Optimization.[4]

One shortcoming of DPO is that it tends to quickly overfit on the preference dataset.[5]

The Loss Function[edit]

A central part of DPO is its loss function.

[2]

On the left side of the equals sign, we calculate the loss function of the policy model, with respect to the reference model.

On the right side, we calculate the expected value of the dataset for samples, from the outputs that were chosen () and the outputs that were rejected (). Then we apply the sigmoid function logarithmically on the part that comes next:

Inside the brackets, we subtract two log probabilty ratios: On the left is the probability of the chosen output, on the right the probability of the rejected output. Both of the probabilities are multiplied with , which is a hyperparameter.[6][7]

See also[edit]

References[edit]

  1. ^ Amini, Afra; Vieira, Tim; Cotterell, Ryan (2024-02-16), Direct Preference Optimization with an Offset, arXiv:2402.10571
  2. ^ a b Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023-12-13), Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv:2305.18290
  3. ^ a b Azar, Mohammad Gheshlaghi; Rowland, Mark; Piot, Bilal; Guo, Daniel; Calandriello, Daniele; Valko, Michal; Munos, Rémi (2023-11-21), A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv:2310.12036
  4. ^ Gunton, Matthew (2024-02-23). "Understanding Direct Preference Optimization". Medium. Retrieved 2024-02-28.
  5. ^ "Preference Tuning LLMs with Direct Preference Optimization Methods". huggingface.co. Retrieved 2024-02-28.
  6. ^ Direct Preference Optimization (DPO) Explained: A Comprehensive Tutorial, retrieved 2024-02-28
  7. ^ Lages, João (2023-11-05). "Direct Preference Optimization (DPO)". Medium. Retrieved 2024-02-28.