Master Thesis Artificial Intelligence: Feedback 22-03-2012

Algorithm for POMDPs with continuous observations

Steps colored in blue are different from standard TLS
Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step

Using the observation receveived in the final state t=max_roll_out_depth of the roll-out is not correct because the action and the state at time t influence the observation and not the action and the state at time t=max_roll_out_depth

Selection

Start at root.
If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
Apply the action corresponding to the selected child to receive an observation o^t from the environment simulator and store o^t in a list O=(o¹, o², ..., o^t).
Use o^t to traverse the regression tree until your reach a leaf.
Repeat Steps (2) - (4).
Return that leaf l.

Expansion

Roll-outs

Apply the selected action a* to receive an observation o* from the environment simulator and store o* in O=(o¹, o², ..., o*).
Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
Receive a reward r.

Backpropagation

Use r to to update all encountered nodes up to the root of the tree.
Use (o*, r) to update the statistics of the leafs of the regression tree corresponding to action a*.
Use (o^t, r) to update the statistics of the leafs of the regression tree encountered at time step t back up on the way to the root (for t = 1, 2, ...).
Split leafs of a regression tree if necessary.