- Planning
- We reviewed our planning based on the Gantt chart made for the thesis plan
- Update chart if necessary (like in my case)
- We discussed the graduation date and the reviewing and hand-in of the thesis (and of draft versions of it)
- Hand-in of thesis draft in three weeks
- Results
- We elaborated on our current results
- I should make regret plots (optimal - mean) to show differences between the algorithms
- Try to add optimal lines to the grid world plots
- Michael wants to think about a simple environment to show the difference between MCTS and MC
- Other
- We discussed the following "transposition tree":
- Assume the state / observation space to be Markovian
- The algorithm builds a tree consisting of two parts:
- An "observation tree" which provides a discretization of the observation space (or the state space). This tree is conceptually the same as the observation tree in COMCTS
- For each leaf in the observation tree, there is an action node which keeps the UCB knowledge about the actions for the observation range that is represented by the leaf (in case of continuous actions, this action node is replaced by an action tree which provides a discretization of the action space)
- The algorithm starts with one leaf in the observation tree connected to one action node.
- Splits can be decided on a simple criterion such as a count [1] or a more informative one such as the statistics used in the TG algorithm [2]
- Because of the assumption made above, this tree can be re-used after a step in the real world is taken. However, for POMDPs the observation might be dependent on the action taken and this is not considered by this algorithm.
- Complete recall: reconstruct the tree only on the first level
- Keep one list of samples, pointers to the correct depth and to the correct examples
- Backpropagation:
- first update all UCB knowledge (bottom to top)
- after that check for a split (top to bottom)
- this avoids splits at high depths of the tree that would become redundant if there also occurs a split at a lower depth
- Update Gantt chart
- Plotting:
- Regret plots
- Optimal curves for grid world
- Think further about "transposition trees" for POMDPs
- Think further about re-using and sharing trees based on belief state distance
- Adapt Wumpus world to continuous observations
[1] Ali Nouri, Michael L. Littman. Multi-resolution Exploration in Continuous Spaces. Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008 (December 2008), pp. 1209-1216.
[2] K Driessens, J Ramon, H Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. Machine Learning: ECML 2001, 97-108.
No comments:
Post a Comment