Draft:Direct Preference Optimization

Submission declined on 29 February 2024 by Remsense (talk).

This submission reads more like an essay than an encyclopedia article. Submissions should summarise information in secondary, reliable sources and not contain opinions or original research. Please write about the topic from a neutral point of view in an encyclopedic manner.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by Remsense 2 months ago. Last edited by Citation bot 2 months ago. Reviewer: Inform author.

Resubmit

Please note that if the issues are not fixed, the draft will be declined again.

Submission declined on 28 February 2024 by HitroMilanese (talk).

This submission is not adequately supported by reliable sources. Reliable sources are required so that information can be verified. If you need help with referencing, please see Referencing for beginners and Citing sources.

Declined by HitroMilanese 2 months ago.

Comment: Only YouTube videos and Medium blogs have been added since the prior submission, neither of which are generally considered reliable. This concept does not appear to be notable according to the sources given. Perhaps some of this material can be added to existing articles. Remsense诉 05:22, 29 February 2024 (UTC)

Comment: None of the presented sources appear to be WP:RS. Hitro talk 10:45, 28 February 2024 (UTC)

Direct Preference Optimization (DPO) is an algorithm to align large language models (LLMs) to human preferences, eliminating the necessity for reward model training and reinforcement learning.^[1] It addresses the challenge of precisely controlling the behavior of unsupervised language models, which, despite their broad knowledge and reasoning skills, are difficult to steer due to their unsupervised nature of training. Traditional approaches involve collecting human labels on model generations and fine-tuning the model to align with these preferences using Reinforcement Learning from Human Feedback (RLHF). However, RLHF can be complex and unstable.^[2]

DPO addresses one approximation of RLHF by eliminating the need for a separate reward modeling stage. Instead of first learning a reward model from human preferences and then training a policy to maximize these inferred rewards, DPO directly optimizes the policy based on collected preference data. This direct optimization is achieved by reparameterizing the optimization objective in such a way that it directly reflects the preference data, thereby allowing the policy to be trained to align with human preferences without the intermediate step of reward modeling.^[3]

However, it's important to note that while DPO sidesteps the challenges associated with the generalization of reward models, it still relies on another approximation: The translation of pairwise preferences into a form that can be optimized directly. This reliance means that while DPO offers a significant advancement in terms of simplifying the training process and potentially improving the stability and performance of the model, it does not completely overcome all the theoretical challenges associated with learning from human preferences. The method still operates under the assumption that pairwise preferences can effectively guide the optimization of the policy in a meaningful way, an assumption that carries its own set of limitations and challenges in accurately capturing and reflecting the nuances of human judgments.^[3]

DPO requires less data and less computing power to be as good as Proximal Policy Optimization.^[4]

One shortcoming of DPO is that it tends to quickly overfit on the preference dataset.^[5]

The Loss Function[edit]

A central part of DPO is its loss function.

${\mathcal {L}}_{DPO}(\pi _{\theta };\pi _{\text{ref}})=-\mathbb {E} _{(x,y_{w},y_{l})\sim {\mathcal {D}}}\left[\log \sigma \left(\beta \log {\frac {\pi _{\theta }(y_{w}|x)}{\pi _{\text{ref}}(y_{w}|x)}}-\beta \log {\frac {\pi _{\theta }(y_{l}|x)}{\pi _{\text{ref}}(y_{l}|x)}}\right)\right].$ ^[2]

On the left side of the equals sign, we calculate the loss function of the policy model, with respect to the reference model.

On the right side, we calculate the expected value of the dataset for $x$ samples, from the outputs that were chosen ( $y_{w}$ ) and the outputs that were rejected ( $y_{l}$ ). Then we apply the sigmoid function logarithmically on the part that comes next:

Inside the brackets, we subtract two log probabilty ratios: On the left is the probability of the chosen output, on the right the probability of the rejected output. Both of the probabilities are multiplied with $\beta$ , which is a hyperparameter.^[6]^[7]

References[edit]

^ Amini, Afra; Vieira, Tim; Cotterell, Ryan (2024-02-16), Direct Preference Optimization with an Offset, arXiv:2402.10571
^ ^a ^b Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023-12-13), Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv:2305.18290
^ ^a ^b Azar, Mohammad Gheshlaghi; Rowland, Mark; Piot, Bilal; Guo, Daniel; Calandriello, Daniele; Valko, Michal; Munos, Rémi (2023-11-21), A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv:2310.12036
^ Gunton, Matthew (2024-02-23). "Understanding Direct Preference Optimization". Medium. Retrieved 2024-02-28.
^ "Preference Tuning LLMs with Direct Preference Optimization Methods". huggingface.co. Retrieved 2024-02-28.
^ Direct Preference Optimization (DPO) Explained: A Comprehensive Tutorial, retrieved 2024-02-28
^ Lages, João (2023-11-05). "Direct Preference Optimization (DPO)". Medium. Retrieved 2024-02-28.

[1] Amini, Afra; Vieira, Tim; Cotterell, Ryan (2024-02-16), Direct Preference Optimization with an Offset, arXiv:2402.10571

[:1-2] Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023-12-13), Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv:2305.18290

[:0-3] Azar, Mohammad Gheshlaghi; Rowland, Mark; Piot, Bilal; Guo, Daniel; Calandriello, Daniele; Valko, Michal; Munos, Rémi (2023-11-21), A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv:2310.12036

[4] Gunton, Matthew (2024-02-23). "Understanding Direct Preference Optimization". Medium. Retrieved 2024-02-28.

[5] "Preference Tuning LLMs with Direct Preference Optimization Methods". huggingface.co. Retrieved 2024-02-28.

[6] Direct Preference Optimization (DPO) Explained: A Comprehensive Tutorial, retrieved 2024-02-28

[7] Lages, João (2023-11-05). "Direct Preference Optimization (DPO)". Medium. Retrieved 2024-02-28.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

The Loss Function[edit]

See also[edit]

References[edit]