Algorithm for POMDPs with continuous observations
- Steps colored in blue are different from standard TLS
- Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step
- Using the observation receveived in the final state t=max_roll_out_depth of the roll-out is not correct because the action and the state at time t influence the observation and not the action and the state at time t=max_roll_out_depth
- Start at root.
- If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
- Apply the action corresponding to the selected child to receive an observation ot from the environment simulator and store ot in a list O=(o1, o2, ..., ot).
- Use ot to traverse the regression tree until your reach a leaf.
- Repeat Steps (2) - (4).
- Return that leaf l.
- Choose an action a* at leaf l.
- Connect l to a new regression tree with only one node (the root).
- Apply the selected action a* to receive an observation o* from the environment simulator and store o* in O=(o1, o2, ..., o*).
- Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
- Receive a reward r.
- Use r to to update all encountered nodes up to the root of the tree.
- Use (o*, r) to update the statistics of the leafs of the regression tree corresponding to action a*.
- Use (ot, r) to update the statistics of the leafs of the regression tree encountered at time step t back up on the way to the root (for t = 1, 2, ...).
- Split leafs of a regression tree if necessary.
Example of first five steps
(remark: a* is always equal to the action at the node "above" it)
No comments:
Post a Comment