Jump to content

User:DltLuis/draft article on POMDP

From Wikipedia, the free encyclopedia

A Partially Observable Markov Decision Process (POMDP) is a generalization of a Markov Decision Process. A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

An solution to a POMDP must yield an action for each possible belief over the directly unobservable states. A solution is optimal if it maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. A sequence of optimal actions is known as an optimal policy of the agent interacting with its environment.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general.[1] The framework originated in the Operations Research community. In addition, much work in PODMDP has been done by the Artificial Intelligence and Automated Planning communities.


Definition[edit]

Formal Definition[edit]

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a tuple , where

  • is a set of (unobservable) states,
  • is a set of actions,
  • is a set of observations,
  • is a set of conditional probabilities of transitioning between states,
  • is a set of conditional probabilities of observation,
  • is the reward function.

Decision Making Sequence[edit]

At each time period, the environment is in some state . The agent does not know with certainty the current state and instead maintains a probability , the probability that he is in state .

The agent takes an action , which causes the environment to transition to state with probability . The agent then observes some information with some probability and updates the probabilities based on the current observations. The agent receives a reward with expected value , and the process repeats.

It is instructive to compare the above definition with the definition of a Markov Decision Process. In an MDP, the state is observed directly. Thus, an MDP is a special case of a POMDP where , the observation is , and the belief probability function and the observation probabilities are 1 for the states and , respectively, and zero for all other states.

Belief update[edit]

An agent updates his belief upon taking the action and observing . Since the state is Markovian, maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. Below we describe how this belief update is computed.

For each state is updated using Bayes' Rule. Given observation , and are the posterior and prior probabilities of being in state , and observation is observed with probability . Applying Bayes' Rule gives

Reward Function[edit]

Since the state is observable, the reward function is not directly computable. Instead, the reward is computed as an average over possible states, given belief probabilities:


Representing POMDPs as Continuous State MDPs[edit]

In an MDP, an optimal policy is Markovian with respect to the core state. For a POMDP, a policy that is Markovian on the core states would be unusable, since the core state is not known by the agent. A policy that is Markovian with respect to observations is well-defined, but can be arbitrarily suboptimal. [2]. Instead of keeping track of the entire history, the belief state is a sufficient statistic for the history of observations. [3].

In a PODMP, a policy maps a belief state into the action space. The optimal policy can be understood as the solution of a continuous space so-called belief Markov Decision Process [4] (MDP). It is defined as a tuple where

  • is the set of belief states over the POMDP states,
  • is the same finite set of action as for the original POMDP,
  • is the belief state transition function,
  • is the reward function on belief states defined in the previous section.

This MDP is defined over a continuous state space, the space of all probability mass (or density) functions over . Given that the belief space is a probability mass (or density) function over , the belief space is the is the -dimensional standard simplex , where

Policy and Value Function[edit]

The agent's policy specifies an action for any belief . Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When defines a cost, the objective becomes the minimization of the expected cost.

The expected reward for policy starting from belief is defined as

where is the discount factor. The optimal policy is obtained by optimizing the long-term reward.

where is the initial belief.

The optimal policy, noted yields the highest expected reward value for each belief state, compactly represented by the optimal value function, noted . This value function is a solution to the Bellman optimality equation:

Solving POMDPs[edit]

Unlike with a discrete state MDP, the value functions for each state cannot be stored in a look-up table as they are computed. In general, the value function for a continuous state MDP need not have any specific structure. For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex.[5] It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an -optimal value function, and preserves its piecewise linearity and convexity.[6] By improving the value, the policy is implicitly improved. Policy iteration explicitly represents and improves the policy instead.[7][8]

Approximate POMDP solutions[edit]

In practice, POMDPs are often computationally intractable to solve exactly, so researchers have developed methods that approximate solutions for POMDPs.

Grid-based algorithms [9] comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered and that aren't in the set of grid points. Other recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure. For example, the Symbolic Perseus solver has been used to approximate an application with 207,360 states, 198 observations and 25 actions. [10][11] Approximate methods often take into account the problem structure when sampling from the belief space. In a problem, there may be belief states that are unlikely to be reached. For these states, an accurate estimate of the value function may contribute little to defining an optimal policy. Instead, methods such as point-based value iteration attempt to sample a smaller set of more likely reachable points and get accurate estimates of relevant belief states. [12] Dimensionality reduction using PCA has also been explored.[13]

POMDP Applications[edit]

POMDPs model many kinds of real-world problems. Below is a selection of applications and references.

  • Assisting for persons with dementia. [10][11]
  • Cancer Screening.[14]
  • Conservation of the critically endangered and difficult to detect Sumatran tigers.[15]
  • Inventory Management.[16]
  • Manipulating multi-fingered robotic hands for grasping.[17]
  • Robot Navigation. [18]
  • Selecting Relays to Transmit Wireless Data [19]
  • Spoken dialog systems, including speech recognition and response. [20] [21]

POMDP Software[edit]

Below is a listing of some POMDP software. All are free to download (licenses on use vary between them). Many implement multiple types of algorithms and the different methods implemented in each vary significantly from one another.

Software Name Language Type of Methods Implemented Author
APPL C++ Approximate H. Kurniawati, D. Hsu, W.S. Lee
Perseus MATLAB Approximate Matthijs Spaan
POMDP-Solve C Exact and Approximate Anthony Cassandra
SPUDD C Approximate Jesse Hoey, Robert St-Aubin, Alan Hu, Craig Boutilier
Symbolic Perseus MATLAB/Java Approximate Pascal Poupart
ZMDP C++ Approximate Trey Smith

References[edit]

  1. ^ Cassandra, A.R. (1998). "A Survey of POMDP Applications.". AAAI Fall Symposium. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  2. ^ Cassndra, A.R. (1998). Exact and approximate algorithms for partially observable Markov decision processes (PhD thesis). Brown University.
  3. ^ Smallwood, R.D., Sondik, E.J. (1973). "The Optimal Control of Partially Observable Markov Processes over a Finite Horizon". Operations Research. 21: 1071–1088.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. ^ Kaelbling, L.P., Littman, M.L., Cassandra, A.R. (1998). "Planning and acting in partially observable stochastic domains". Artificial Intelligence Journal. 101: 99–134. doi:10.1016/S0004-3702(98)00023-X.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  5. ^ Sondik, E.J. (1971). The optimal control of partially observable Markov processes (PhD thesis). Stanford University.
  6. ^ Smallwood, R.D., Sondik, E.J. (1973). "The optimal control of partially observable Markov decision processes over a finite horizon". Operations Research. 21 (5): 1071–88. doi:10.1287/opre.21.5.1071.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  7. ^ Sondik, E.J. (1978). "The optimal control of partially observable Markov processes over the infinite horizon: discounted cost". Operations Research. 26 (2): 282–304. doi:10.1287/opre.26.2.282.
  8. ^ Hansen, E. (1998). "Solving POMDPs by searching in policy space". Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI-98). {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  9. ^ Lovejoy, W. (1991). "Computationally feasible bounds for partially observed Markov decision processes". Operations Research. 39: 162–175. doi:10.1287/opre.39.1.162.
  10. ^ a b Jesse Hoey, Axel von Bertoldi, Pascal Poupart, Alex Mihailidis (2007). "Assisting Persons with Dementia during Handwashing Using a Partially Observable Markov Decision Process". Proc. International Conference on Computer Vision Systems (ICVS). doi:10.2390/biecoll-icvs2007-89. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
  11. ^ a b Jesse Hoey, Pascal Poupart, Axel von Bertoldi, Tammy Craig, Craig Boutilier, Alex Mihailidis. (2010). "Automated Handwashing Assistance For Persons With Dementia Using Video and a Partially Observable Markov Decision Process". Computer Vision and Image Understanding (CVIU). 114 (5). doi:10.1016/j.cviu.2009.06.008.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  12. ^ Pineau, J., Gordon, G., Thrun, S. (August 2003). "Point-based value iteration: An anytime algorithm for POMDPs". International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico. pp. 1025–32. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
  13. ^ Roy, Nicholas; Gordon, Geoffrey (2003). "Exponential Family PCA for Belief Compression in POMDPs". Advances in Neural Information Processing Systems.{{cite book}}: CS1 maint: multiple names: authors list (link)
  14. ^ Oguzhan Alagoz (2011). "Optimizing Cancer Screening Using Partially Observable Markov Decision Processes". INFORMS TutORials. {{cite journal}}: line feed character in |title= at position 44 (help)
  15. ^ Chadès, I., McDonald-Madden, E., McCarthy, M.A., Wintle, B., Linkie, M., Possingham, H.P. (16 September 2008). "When to stop managing or surveying cryptic threatened species". Proc. Natl. Acad. Sci. U.S.A. 105 (37): 13936–40. Bibcode:2008PNAS..10513936C. doi:10.1073/pnas.0805265105. PMC 2544557. PMID 18779594.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  16. ^ Kenan Arifoglu, Suleyman Ozekici (2011). "Inventory management with random supply and imperfect information: A hidden markov model". Int. J. Production Economics (134). {{cite journal}}: Text "pages 123-137" ignored (help)
  17. ^ Kaijen Hsiao, Leslie Pack Kaelbling, Tomas Lozano-Perez (2007). "Grasping POMDPs". in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA). pp. 4685–4692. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
  18. ^ Salvatore Candido, Seth Hutchinson (2011). "Minimum Uncertainty Robot Navigation using Information-guided POMDP Planning". in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA). {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  19. ^ Yong Li, Yibin Hou, Zhangqin Huang, Yifei Wei (2011). "Cooperative Relay Selection Policy Using Partially Observable Markov Decision Process". 2011 Seventh International Conference on Natural Computation. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
  20. ^ Williams, Jason D. (2010). "Spoken dialog systems as an application of POMDPs". POMDP Practitioners Workshop: solving real-world POMDP problems. ICAPS Workshop, Toronto, Canada. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  21. ^ Williams, Jason D. (1998). Partially Observable Markov Decision Processes for Spoken Dialogue Management (PhD thesis). University of Cambridge.

External links[edit]

[[Category:Dynamic programming]] [[Category:Machine learning]] [[Category:Markov processes]] [[Category:Stochastic control]] [[fr:Processus de décision markovien partiellement observable]] [[zh:部分可觀察馬可夫決策過程]]