Wednesday, July 11, 2012

Feedback 11-07-10

Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 5000
  • Number of steps: 2
  • Algorithms:
    • Both Keep 1st Node variants use perfect recall as the splitting strategy
    • Keep 1st Node + Update also updates the "first node" during backpropagation
  • Regret plots: the mean value achieved by magic guesser is taken as the optimum
Results

Regret: Variations of COMCTS

Regret: Variations of TT























Mean: Variations of COMCTS

Mean: Variations of TT

Tuesday, July 10, 2012

Meeting 10-07-2012

Discussion
  • Thesis:
    • Structure stays as it is (maybe some sections will still be moved later-on)
    • Add research question about computational complexity of both algorithms
  • Transposition tree: recreate split tests from stored samples if perfect recall is used (but otherwise do not copy them)
  • Plots: 
    • Include data of up to 10^3 roll-outs
    • Regret: change y-axis to log scale
    • Thesis: 
      • Separate plots for the variations of each algorithm
      • Plots to compare best variants
      • Plots for variations with similar curves
  • Transposition tree: related work is function approximation / generalization with decision / regression trees in reinforcement learning
  • Transposition tree for light-dark domain:
    • Changes to light-dark domain:
      • Continuous state space with a squared region for the goal
      • Discrete actions, corrupted by some Gaussian noise
    • Belief space represented as a multivariate Gaussian centered around the agent's actual location (x,y) with three parameters: mean(x), mean(y), standard deviation
Planning
  • Hand-in of chapters 2 and 3 this Friday July 13
  • Next meeting: Monday July 15

Monday, July 9, 2012

Feedback 09-07-2012

Progress
  • Transposition Tree: Implemented not splitting the first node 
    • The agent's initial belief is represented by an additional node
    • This "first node" is updated separately from all other nodes
    • There is a parameter that allows to also update it during backpropagation if the belief falls in the first node's range
  • Transposition Tree: Corrected the way an action is selected at the end of a simulation
Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001 
    • (with this setting, one roll-out means 25 updates to the tree)
  • Number of episodes: 5000
  • Number of steps: 2
  • Algorithms:
    • Both Keep 1st Node variants use Perfect Recall as the splitting strategy
    • Keep 1st Node + Update also updates the "first node" during backpropagation 
Results


Saturday, July 7, 2012

Feedback 07-07-2012

Progress
  • Wrote some new sections for the thesis
  • Changed the structure:

Friday, July 6, 2012

Feedback 06-07-2012

Progress
  • Implemented perfect recall for the transposition tree
    • Remembers all samples (belief, action, discounted reward)
    • In case of a split, uses the stored samples to recreate the mean and the count of the action nodes which are below the new leaves
    • Space for the samples is pre-allocated: 
      • #samples = #roll-outs * epsilon-horizon
  • All three variants (deletion, insertion, transposition tree) decrease in performance after the first split(s) (in the plot below, there is a reoccurring dip at the beginning of each curve)
Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 5000
  • Number of steps: 2

Preliminary Results
First 100 samples

Thursday, July 5, 2012

Feedback 05-07-2012

Progress
  • Implemented the transposition tree
    • Starts from a single leaf node which represents the complete belief space ( [0,1] in the Tiger problem )
    • Uses an F-test for splitting in the belief space
    • Collects test statistics for each action for the two "sides" of each test
    • Splits if there is a significant difference between any action (e.g., between action 0 on the "left" side and action 0 on the "right" side)
  • Re-use of knowledge after a split:
    • Deletion (implemented)
    • Insertion of information from winning test into new action nodes (implemented)
    • Insert split tests into new leaves (not implemented)
    • Other strategies (like perfect recall)?
  • Problems:
    • All following problems relate to the agent's true belief in the real environment at the current time step and occur for both deletion and insertion
    • If there is a split near the end of the episode at a leaf whose range includes the agent's true belief, the algorithm cannot gather enough new data about the expected value of each action at that range to give reliable estimates.
    • Splitting sometimes creates a leaf with a large range, e.g., [0.5000, 0.9170]. Now, if the agent's true belief is 0.91, the expected value of the actions is again not correct.
Planning
  • Implement other strategies
  • Implement not splitting the first node

Tuesday, July 3, 2012

Feedback 03-07-2012

Progress
  • Corrected and improved the implementation of the formulas for the univariate and bivariate normal distribution
    • Univariate: 




    • Bivariate:



  • Effects:
    • Simulations are more stable
    • Simulations are much faster
  • Added visualization of the belief space for grid worlds



Results (Preview)
  • Experimental Set-up:
    • Light Dark domain with the following layout:
********L*
******A*L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L* 
  • Initial belief: uniform over all possible states, except the goal state
  • # Episodes: 5000
  • Optimal is: two steps right, four steps down, five steps left (probably not the true optimal line)