Saturday, May 26, 2012

Results: Tiger Problem

Experimental Set-up
  • Regret = optimal value - mean value
  • Optimal value = sum over max. positive reward achievable in each step (20 for Horizon 2, 100 for Horizon 10)

Results: Regret Plots
Horizon 2

Horizon 2 (zoomed in)

Horizon 10

Horizon 10 (zoomed in)

Friday, May 25, 2012

Minutes: Joint Meeting

Discussion 
  • Planning
    • We reviewed our planning based on the Gantt chart  made for the thesis plan
      • Update chart if necessary (like in my case)
    • We discussed the graduation date and the reviewing and hand-in of the thesis (and of draft versions of it)
    • Hand-in of thesis draft in three weeks
  • Results
    • We elaborated on our current results
    • I should make regret plots (optimal - mean) to show differences between the algorithms
    • Try to add optimal lines to the grid world plots
    • Michael wants to think about a simple environment  to show the difference between MCTS and MC
  • Other
    • We discussed the following "transposition tree":
      • Assume the state / observation space to be Markovian
      • The algorithm builds a tree consisting of two parts:
        • An "observation tree" which provides a discretization of the observation space (or the state space). This tree is conceptually the same as the observation tree in COMCTS
        • For each leaf in the observation tree, there is an action node which keeps the UCB knowledge about the actions for the observation range that is represented by the leaf (in case of continuous actions, this action node is replaced by an action tree which provides a discretization of the action space)
      • The algorithm starts with one leaf in the observation tree connected to one action node.
      • Splits can be decided on a simple criterion such as a count [1] or a more informative one such as the statistics used in the TG algorithm [2]
      • Because of the assumption made above, this tree can be re-used after a step in the real world is taken. However, for POMDPs the observation might be dependent on the action taken and this is not considered by this algorithm.
    • Complete recall: reconstruct the tree only on the first level
      • Keep one list of samples, pointers to the correct depth and to the correct examples
    • Backpropagation: 
      • first update all UCB knowledge (bottom to top)
      • after that check for a split (top to bottom)
      • this avoids splits at high depths of the tree that would become redundant if there also occurs a split at a lower depth
Planning
  1. Update Gantt chart
  2. Plotting: 
    • Regret plots
    • Optimal curves for grid world
  3. Think further about  "transposition trees" for POMDPs
  4. Think further about re-using and sharing trees based on belief state distance
  5. Adapt Wumpus world to continuous observations


[1] Ali Nouri, Michael L. Littman. Multi-resolution Exploration in Continuous Spaces. Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008 (December 2008), pp. 1209-1216.

[2] K Driessens, J Ramon, H Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner.  Machine Learning: ECML 2001, 97-108.

Monday, May 21, 2012

Updated Results

Progress
  • Implemented complete recall for COMCTS:
    • Stores all examples  (observation(t),reward(t)) together with their "future" {<action(t+1),observation(t+1),reward(t+1)>,<action(t+2),observation(t+2),reward(t+2)>,...)}
    • Uses this knowledge to reconstruct the tree after a split
  • Implemented a 4x3 grid world:

    • Actions: up, down, left right
    • Observations: number of adjacent walls corrupted by some normally distributed noise with standard deviation = 0.965
    • Rewards: +1 for goal state, -1 for penalty state, movement costs -0.04
 

Tiger Problem: Horizon 2

 
 Tiger Problem: Horizon 10












Grid World: max. 22 steps


Wednesday, May 16, 2012

Feedback 16-05-12

Progress
  • Implemented heuristic and magic agent for the Tiger Problem (see below)
  • Changed the choice of actions in COMCTS to a more random choice in both the final action selection for the step in the "real world" and for the selection step in the simulation
    • This made the algorithm more stable and removed a lot of dips in the plots (see below)
Experimental Set-up
  • Mean value is taken over:
    • 10,000 runs for Horizon 2
    • 5,000 runs for Horizon 10
    • 1,000 runs for Horizon 100
  • Error bars indicate standard error
  • Environment is continuing
  • Execution is stopped after Horizon=k actions
  • Discount factor = 0.5, discount horizon = 0.000001
Algorithms
  • Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10 and 100
  • MC: Monte Carlo with uniformly sampled actions
  • COMCTS with deletion: deletes everything below a node which is split in the regression tree
  • COMCTS with local recall
    • stores all examples
    • when a split occurs at a node w and two new children l and r are createdthe values stored at the winning test for the split point at w are given to l and r such that l and r start with some mean and count
    • the stored examples are used to create new tests (for split points) at l and r
  • MC and COMCTS both use ε-accurate horizon time [Michael J. Kearns, Yishay Mansour, Andrew Y. Ng: A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49(2-3): 193-208 (2002)]
  • Heuristic: listens until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
  • Magic: can magically observe the real state of the environment but waits until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
Results
  • Horizon 2:








  


  • Horizon 10:













  • Horizon 100:

Wednesday, May 9, 2012

Updated Results: Time vs Roll-outs

Experimental Set-up
  • Mean is taken over 100 episodes
  • Number of roll-outs is measured for the first time step
  • Discount horizon is 0.000001

Results

1 millisecond to 1000 milliseconds

1 millisecond to 1000 milliseconds (x-axis: log scale)

Tuesday, May 8, 2012

Updated Results: Number of Roll-outs vs Mean Value (cont'd)

Experimental Set-up
  • Mean value is taken over 5,000 runs
  • Everything else follows the same set-up as the previous results
Results
  • Horizon 100:
    First 200 samples



First 200 samples (x-axis: log scale)

Updated Results: Number of Roll-outs vs Mean Value

Experimental Set-up
  • Mean value is taken over:
    • 10,000 runs for Horizon 2
    • 5,000 runs for Horizon 10
  • Error bars indicate standard error
  • Environment is continuing
  • Execution is stopped after Horizon=k actions
  • Discount factor = 0.5, discount horizon = 0.000001
Algorithms
  • Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10
  • Optimal: represents the MDP solution (assuming that the correct action is always taken and assuming a perfect observation model)
  • COMCTS with local recall: stores all examples and uses them to rebuild the regression tree
  • MC and COMCTS both use e-accurate horizon time [Michael J. Kearns, Yishay Mansour, Andrew Y. Ng: A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49(2-3): 193-208 (2002)]
Results
  • Horizon 2:
First 1000 samples

First 1000 samples (x-axis in log scale)

First 200 samples

First 200 samples (x-axis in log scale)    



















































  • Horizon 10:
First 200 Samples

First 200 Samples (x-axis in log scale)

First 500 Samples

First 500 Samples (x-axis in log scale)  



























































































Wednesday, May 2, 2012

Results for Joint Meeting

Experimental Set-up
  • Mean value is taken over 1,000 episodes
  • Error bars indicate standard error
  • Environment is continuing
  • Execution is stopped after Horizon=k actions
  • Discount factor = 0.5, discount horizon = 0.01
Results
Horizon 10 without error bars

Horizon 10 with error bars

Horizon 2 without error bars

Horizon 2 with error bars