Master Thesis Artificial Intelligence: May 2012

Saturday, May 26, 2012

Results: Tiger Problem

Experimental Set-up

Regret = optimal value - mean value
Optimal value = sum over max. positive reward achievable in each step (20 for Horizon 2, 100 for Horizon 10)

Results: Regret Plots

Horizon 2

Horizon 2 (zoomed in)

Horizon 10

Horizon 10 (zoomed in)

Friday, May 25, 2012

Minutes: Joint Meeting

Discussion

Planning

We reviewed our planning based on the Gantt chart made for the thesis plan

Update chart if necessary (like in my case)

We discussed the graduation date and the reviewing and hand-in of the thesis (and of draft versions of it)
Hand-in of thesis draft in three weeks

Results

We elaborated on our current results
I should make regret plots (optimal - mean) to show differences between the algorithms
Try to add optimal lines to the grid world plots
Michael wants to think about a simple environment to show the difference between MCTS and MC

Other

We discussed the following "transposition tree":

Assume the state / observation space to be Markovian
The algorithm builds a tree consisting of two parts:

An "observation tree" which provides a discretization of the observation space (or the state space). This tree is conceptually the same as the observation tree in COMCTS

For each leaf in the observation tree, there is an action node which keeps the UCB knowledge about the actions for the observation range that is represented by the leaf (in case of continuous actions, this action node is replaced by an action tree which provides a discretization of the action space)

The algorithm starts with one leaf in the observation tree connected to one action node.
Splits can be decided on a simple criterion such as a count [1] or a more informative one such as the statistics used in the TG algorithm [2]
Because of the assumption made above, this tree can be re-used after a step in the real world is taken. However, for POMDPs the observation might be dependent on the action taken and this is not considered by this algorithm.

Complete recall: reconstruct the tree only on the first level

Keep one list of samples, pointers to the correct depth and to the correct examples

Backpropagation:

first update all UCB knowledge (bottom to top)
after that check for a split (top to bottom)
this avoids splits at high depths of the tree that would become redundant if there also occurs a split at a lower depth

Planning

Update Gantt chart
Plotting:

Regret plots

Optimal curves for grid world

Think further about "transposition trees" for POMDPs
Think further about re-using and sharing trees based on belief state distance
Adapt Wumpus world to continuous observations

[1] Ali Nouri, Michael L. Littman. Multi-resolution Exploration in Continuous Spaces. Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008 (December 2008), pp. 1209-1216.

[2] K Driessens, J Ramon, H Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. Machine Learning: ECML 2001, 97-108.

Monday, May 21, 2012

Updated Results

Progress

Implemented complete recall for COMCTS:

Stores all examples (observation(t),reward(t)) together with their "future" {<action(t+1),observation(t+1),reward(t+1)>,<action(t+2),observation(t+2),reward(t+2)>,...)}
Uses this knowledge to reconstruct the tree after a split

Implemented a 4x3 grid world:

Actions: up, down, left right
Observations: number of adjacent walls corrupted by some normally distributed noise with standard deviation = 0.965
Rewards: +1 for goal state, -1 for penalty state, movement costs -0.04

Tiger Problem: Horizon 2

Tiger Problem: Horizon 10

Grid World: max. 22 steps

Wednesday, May 16, 2012

Feedback 16-05-12

Progress

Implemented heuristic and magic agent for the Tiger Problem (see below)
Changed the choice of actions in COMCTS to a more random choice in both the final action selection for the step in the "real world" and for the selection step in the simulation

This made the algorithm more stable and removed a lot of dips in the plots (see below)

Experimental Set-up

Mean value is taken over:

10,000 runs for Horizon 2
5,000 runs for Horizon 10
1,000 runs for Horizon 100

Error bars indicate standard error
Environment is continuing
Execution is stopped after Horizon=k actions
Discount factor = 0.5, discount horizon = 0.000001

Algorithms

Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10 and 100
MC: Monte Carlo with uniformly sampled actions
COMCTS with deletion: deletes everything below a node which is split in the regression tree
COMCTS with local recall:

stores all examples
when a split occurs at a node w and two new children l and r are created, the values stored at the winning test for the split point at w are given to l and r such that l and r start with some mean and count
the stored examples are used to create new tests (for split points) at l and r

MC and COMCTS both use ε-accurate horizon time [Michael J. Kearns, Yishay Mansour, Andrew Y. Ng: A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49(2-3): 193-208 (2002)]

Heuristic: listens until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
Magic: can magically observe the real state of the environment but waits until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors

Results

Horizon 2:

Horizon 10:

Horizon 100:

Wednesday, May 9, 2012

Updated Results: Time vs Roll-outs

Experimental Set-up

Mean is taken over 100 episodes
Number of roll-outs is measured for the first time step
Discount horizon is 0.000001

Results

1 millisecond to 1000 milliseconds

1 millisecond to 1000 milliseconds (x-axis: log scale)

Tuesday, May 8, 2012

Updated Results: Number of Roll-outs vs Mean Value (cont'd)

Experimental Set-up

Mean value is taken over 5,000 runs
Everything else follows the same set-up as the previous results

Results

Horizon 100:

First 200 samples

First 200 samples (x-axis: log scale)

Updated Results: Number of Roll-outs vs Mean Value

Experimental Set-up

Mean value is taken over:

10,000 runs for Horizon 2
5,000 runs for Horizon 10

Error bars indicate standard error
Environment is continuing
Execution is stopped after Horizon=k actions
Discount factor = 0.5, discount horizon = 0.000001

Algorithms

Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10
Optimal: represents the MDP solution (assuming that the correct action is always taken and assuming a perfect observation model)
COMCTS with local recall: stores all examples and uses them to rebuild the regression tree
MC and COMCTS both use e-accurate horizon time [Michael J. Kearns, Yishay Mansour, Andrew Y. Ng: A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49(2-3): 193-208 (2002)]

Results