Thursday, March 22, 2012

Feedback 22-03-2012

Algorithm for POMDPs with continuous observations

  • Steps colored in blue are different from standard TLS
  • Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step
    • Using the observation receveived in the final state t=max_roll_out_depth of the roll-out is not correct because the action and the state at time t influence the observation and not the action and the state at time t=max_roll_out_depth
Selection
  1. Start at root.
  2. If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
  3. Apply the action corresponding to the selected child to receive an observation ot from the environment simulator and store ot in a list O=(o1, o2, ..., ot).
  4. Use ot to traverse the regression tree until your reach a leaf.
  5. Repeat Steps (2) - (4).
  6. Return that leaf l.
Expansion
  1. Choose an action a* at leaf l.
  2. Connect  l  to a new regression tree with only one node (the root).
Roll-outs
  1. Apply the selected action a* to receive an observation o* from the environment simulator and store o*  in  O=(o1, o2, ..., o*).
  2. Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
  3. Receive a reward r.
Backpropagation
  1. Use r to to update all encountered nodes up to the root of the tree.
  2. Use (o*, r) to update the statistics of the leafs of the regression tree corresponding to action a*.
  3. Use (ot, r) to update the statistics of the leafs of the regression tree encountered at time step t back up on the way to the root (for t = 1, 2, ...).
  4. Split leafs of a regression tree if necessary.

Example of first five steps






















(remark: a* is always equal to the action at the node "above" it)

No comments:

Post a Comment