Draft:Policy-Space Response Oracles

Draft article not currently submitted for review.

This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window.

To be accepted, a draft should:

Show the subject qualifies for a Wikipedia article by using multiple sources that meet four criteria. The sources should be (1) reliable (2) secondary (3) independent of the subject (4) talk about the subject in some depth. For some topics, there are alternative criteria.
Be written from a neutral point of view
Respect copyright and do not plagiarize. Do not copy-paste.

It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Last edited by 2A00:79E0:63:208:5D06:1942:18C7:AE0F (talk | contribs) 59 days ago. (Update)

Submit the draft for review!

In Multi-Agent Learning, Reinforcement Learning, and Game Theory, Policy-Space Response Oracles^[1] (PSRO) is a collection of multi-agent learning algorithms for training agents in two-player, zero-sum, imperfect information (partially observable), extensive form (stochastic form) games using deep reinforcement learning as an approximate best response operator. Knowledge of all player's payoffs is required (complete information). PSRO unifies, and is heavily influenced by other algorithms such as Double Oracle ^[2] (DO), fictitious play (FP). It can be considered a framework of algorithms which under certain parameterizations are equivalent to algorithms that have come before it. PSRO is closely related to Empirical Game Theoretic Analysis (EGTA).

The multi-agent problem setting involves agents learning to interact with others in a shared environment. PSRO works by iteratively training new policies against past opponents policies (so called "self-play"). A key property of PSRO is that the resulting distribution over policies it finds provably converges to a normal-form Nash Equilibrium (NE) under certain parameterizations. For two-player, zero-sum games an NE cannot be exploited by any other policy, which makes it a particularly suitable solution concept in this setting. Many interesting games are two-player, zero-sum ("purely competitive"). Notable projects such as AlphaZero^[3], and AlphaStar^[4] make use of this family algorithms.

Other classes of games such as those with more than two players, or general payoffs are not usually solved using PSRO because of difficulties selecting a Nash equilibrium in these game classes (equilibrium selection problem). Extensions (such as JPSRO) are are more suitable, but use different solution concepts.

History

(TODO) There is a long list of breakthroughs / other algorithms that PRSO is based on. Credit them here.

Double Oracle^[2] (DO).

Empirical game-theoretic analysis (EGTA)

Algorithm

PSRO works by iteratively training a policy against a distribution over all previous opponent policies found so far. This step of the algorithm is called the best response (BR) and is commonly estimated using reinforcement learning (RL) and function approximation (typically a neural network).

The distribution over opponent policies is determined by a meta-solver (MS) - which in turn determines many of the properties of PSRO. For example, if one were to use a uniform distribution, PSRO would be similar to FSP, and if the Nash distribution were used, PSRO would be similar to Double Oracle.

The meta-solver determines a distribution from an a meta-game.

(Placeholder update)

function expected_return(policy policy₁, policy policy₂) → float is
    return payoff₁, payoff₂
    
function meta_solver(matrix payoff₁, matrix payoff₂) → (dist, dist) is
    return payoff₁, payoff₂

function PSRO(game g) → (dist, dist), (list[policy], list[policy]) is
    // Initialize.
    Π₁ := {π_random}
    Π₂ := {π_random}
    // Iterate until convergence.
    for i in 1,... do
    
        if gap == 0 then
            break
    return (σ₁,  σ₂), (Π₁, Π₂)

Convergence

PSRO is known to converge to an equilibrium. Without loss of generality, consider only pure best-responses. At some point in the algorithm's execution we have a set of pure policies for each player, and an equilibrium over the induced restricted normal-form game. After computing a best-response for each player there are two possibilities: 1) these best-responses improve the performance and are necessarily novel policies, or 2) no best-responses improve the performance.

In the second case, by definition an equilibrium has ben found and the algorithm can terminate. In the first case, note that the new best-response is necessarily novel. Therefore, because there are a finite number of pure policies to search through, the algorithm must eventually terminate.

Although the convergence proof for PSRO is vacuous, it performs much better in practice.

Other extensions

Performance Extensions

Pipeline PSRO^[5]

TODO

Double Oracle citation:

McMahan, H. Brendan, Geoffrey J. Gordon, and Avrim Blum. "Planning in the presence of cost functions controlled by an adversary." Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003

(Include more citations)

Draft edit update.

References

^ Lanctot, Marc; Zambaldi, Vinicius; Gruslys, Audrunas; Lazaridou, Angeliki; Tuyls, Karl; Perolat, Julien; Silver, David; Graepel, Thore (2017). "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning". arXiv:1711.00832 [cs.AI].
^ ^a ^b Fawcett, Tom; Mishra, Nina (21 August 2003). Planning in the presence of cost functions controlled by an adversary. Icml'03. pp. 536–543. ISBN 9781577351894.
^ Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (7 December 2018). "A general reinforcement learning algorithm that masters chess, shogi, and go through self-play". Science. 362 (6419): 1140–1144. Bibcode:2018Sci...362.1140S. doi:10.1126/science.aar6404. PMID 30523106.
^ Vinyals, Oriol; Babuschkin, Igor; Czarnecki, Wojciech M.; et al. (30 October 2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning". Nature. 575 (7782): 350–354. Bibcode:2019Natur.575..350V. doi:10.1038/s41586-019-1724-z. PMID 31666705. S2CID 204972004.
^ McAleer, Stephen; Lanier, John; Fox, Roy; Baldi, Pierre (2020). "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games". arXiv:2006.08555 [cs.GT].
^ McAleer, Stephen; Lanier, John; Baldi, Pierre; Fox, Roy (2021). "XDO: A Double Oracle Algorithm for Extensive-Form Games". arXiv:2103.06426 [cs.GT].
^ Marris, Luke; Muller, Paul; Lanctot, Marc; Tuyls, Karl; Graepel, Thore (2021). "Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers". arXiv:2106.09435 [cs.MA].
^ Muller, Paul; Omidshafiei, Shayegan; Rowland, Mark; Tuyls, Karl; Perolat, Julien; Liu, Siqi; Hennes, Daniel; Marris, Luke; Lanctot, Marc; Hughes, Edward; Wang, Zhe; Lever, Guy; Heess, Nicolas; Graepel, Thore; Munos, Remi (2019). "A Generalized Training Approach for Multiagent Learning". arXiv:1909.12823 [cs.MA].

[1] Lanctot, Marc; Zambaldi, Vinicius; Gruslys, Audrunas; Lazaridou, Angeliki; Tuyls, Karl; Perolat, Julien; Silver, David; Graepel, Thore (2017). "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning". arXiv:1711.00832 [cs.AI].

[do-2] Fawcett, Tom; Mishra, Nina (21 August 2003). Planning in the presence of cost functions controlled by an adversary. Icml'03. pp. 536–543. ISBN 9781577351894.

[alphazero-3] Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (7 December 2018). "A general reinforcement learning algorithm that masters chess, shogi, and go through self-play". Science. 362 (6419): 1140–1144. Bibcode:2018Sci...362.1140S. doi:10.1126/science.aar6404. PMID 30523106.

[4] Vinyals, Oriol; Babuschkin, Igor; Czarnecki, Wojciech M.; et al. (30 October 2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning". Nature. 575 (7782): 350–354. Bibcode:2019Natur.575..350V. doi:10.1038/s41586-019-1724-z. PMID 31666705. S2CID 204972004.

[5] McAleer, Stephen; Lanier, John; Fox, Roy; Baldi, Pierre (2020). "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games". arXiv:2006.08555 [cs.GT].

[6] McAleer, Stephen; Lanier, John; Baldi, Pierre; Fox, Roy (2021). "XDO: A Double Oracle Algorithm for Extensive-Form Games". arXiv:2103.06426 [cs.GT].

[7] Marris, Luke; Muller, Paul; Lanctot, Marc; Tuyls, Karl; Graepel, Thore (2021). "Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers". arXiv:2106.09435 [cs.MA].

[8] Muller, Paul; Omidshafiei, Shayegan; Rowland, Mark; Tuyls, Karl; Perolat, Julien; Liu, Siqi; Hennes, Daniel; Marris, Luke; Lanctot, Marc; Hughes, Edward; Wang, Zhe; Lever, Guy; Heess, Nicolas; Graepel, Thore; Munos, Remi (2019). "A Generalized Training Approach for Multiagent Learning". arXiv:1909.12823 [cs.MA].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]