Sunday, July 29, 2012

Feedback 29-07-2012

Progress
  • Implemented visualization for light-dark domain in RL-Viz:
 










 



  • Finished time-based performance experiment for light-dark domain (see below)
  • Thesis: 
    • Included the experiment mentioned before
    • Some small language / content corrections
Experimental Set-up
  • Environment: 10x10 discrete state space light-dark domain
  • Max. steps: 20
  • Discount factor: 0.95
  • Number of episodes: 2,000
Results
time vs roll-outs

time vs mean

Thursday, July 26, 2012

Feedback 26-07-12

Progress
  • Thesis: 
    • Wrote a section about related work
    • Started to write on conclusion chapter
    • Added experiments about performance in infinite horizon tiger problem
Overview of experiments
  • Performance
    • Sample based (# roll-outs vs mean)
      1. Variations of COMCTS
      2. Variations of TT
      3. Best COMCTS variation against best TT variation
      4. Variations with similar performance
    • Time based: same four experiments (time vs mean)
  • Practical time and space usage (time vs roll-outs)
  • Varying Noise (Standard deviation of noise vs mean)

Experiments: Time vs roll-outs, time vs mean


Set-up
  • Environment: infinite horizon Tiger problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 2,000
  • Algorithms:
    • BFAIRMC (pronounced: "be fair mc") = Belief based Function Approximation using Incremental Regression trees and Monte Carlo = transposition tree
Results
time vs rollouts










time vs mean

time vs mean (zoomed in)

time vs regret (optimum - mean)

Saturday, July 21, 2012

Feedback 21-07-2012

Results
  • All variations of TT: bootstrap on the standard transposition tree (without keeping and updating the 1st node separately)












Friday, July 20, 2012

Feedback 20-07-2012

Progress
  • Implemented bootstrapping for the transposition tree:
    • Updates the tree from bottom to top (same way as before)
    • Formula: Rk...d-1 + γd-k * V(bd) where  
      • k is the depth of the current update,  
      • d is the horizon depth given by ε-horizon time, and
      • V(bd) = maxa(Q(bd,a)) 
    • Q(bd,a) refers to the average outcome of action a in belief state bd (given by the corresponding UCT value)
    • Rk...d-1 is the return or cumulative discounted reward of the step rewards from step k to step d - 1
  • Now, the deletion and insertion strategies perform as good as the perfect recall strategy (see plots below)
  • Remark: the results below all keep the first node separate from the other nodes and also update this node during backpropagation; the remaining nodes form the transposition tree
Results


Monday, July 16, 2012

Feedback 16-07-2012

Progress
  • Implemented a continuous state space variation of the light-dark domain:
    • State space is a subspace [minX, maxX], [minY, maxY] of R^2
      • Goal is a square of size 1
    • Actions: discrete, with some additive Gaussian noise
    • Wrap-around effect: if agent leaves on one side, it enters on the opposite side
    • Observations: same as in discrete state space variation
  • Implemented a belief approximation:
    • Bivariate Gaussian with three parameters: meanX, meanY, variance
    • Updates are done with a Kalman filter (simplified to the specific Gaussian and the environment)
    • Matrix computations are performed with apache commons math
    • Bivariate Gaussian is realized with a subset of classes from jahmm
  • Implemented a transposition tree for this Gaussian:
    • Instead of maintaining and updating a 1D belief during a simulation, the algorithm maintains and updates the three parameters of the Gaussian
    • Searches for split points along the Gaussian's three dimensions
    • Selects the best F-test among all dimensions for a split

Saturday, July 14, 2012

Feedback 14-07-12

Progress
  • Changed the light-dark domain such that it locates the agent at a randomly selected starting cell at the beginning of an episode
  • Implemented a heuristic for the light-dark domain:
    • Assumes the agent is on the grid cell for which the belief is the highest
    • Takes the action that has the smallest Euclidean distance from that cell to the goal cell
Experimental Set-up
  • Light Dark domain with the following layout:
********L*
********L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L* 
  • At the beginning of each episode, the agent is randomly placed at any cell except the goal G
  • Initial belief: uniform over all possible states, except the goal state
  • Number of episodes: 1000
  • Discount factor: 0.95
Results


Wednesday, July 11, 2012

Feedback 11-07-10

Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 5000
  • Number of steps: 2
  • Algorithms:
    • Both Keep 1st Node variants use perfect recall as the splitting strategy
    • Keep 1st Node + Update also updates the "first node" during backpropagation
  • Regret plots: the mean value achieved by magic guesser is taken as the optimum
Results

Regret: Variations of COMCTS

Regret: Variations of TT























Mean: Variations of COMCTS

Mean: Variations of TT

Tuesday, July 10, 2012

Meeting 10-07-2012

Discussion
  • Thesis:
    • Structure stays as it is (maybe some sections will still be moved later-on)
    • Add research question about computational complexity of both algorithms
  • Transposition tree: recreate split tests from stored samples if perfect recall is used (but otherwise do not copy them)
  • Plots: 
    • Include data of up to 10^3 roll-outs
    • Regret: change y-axis to log scale
    • Thesis: 
      • Separate plots for the variations of each algorithm
      • Plots to compare best variants
      • Plots for variations with similar curves
  • Transposition tree: related work is function approximation / generalization with decision / regression trees in reinforcement learning
  • Transposition tree for light-dark domain:
    • Changes to light-dark domain:
      • Continuous state space with a squared region for the goal
      • Discrete actions, corrupted by some Gaussian noise
    • Belief space represented as a multivariate Gaussian centered around the agent's actual location (x,y) with three parameters: mean(x), mean(y), standard deviation
Planning
  • Hand-in of chapters 2 and 3 this Friday July 13
  • Next meeting: Monday July 15

Monday, July 9, 2012

Feedback 09-07-2012

Progress
  • Transposition Tree: Implemented not splitting the first node 
    • The agent's initial belief is represented by an additional node
    • This "first node" is updated separately from all other nodes
    • There is a parameter that allows to also update it during backpropagation if the belief falls in the first node's range
  • Transposition Tree: Corrected the way an action is selected at the end of a simulation
Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001 
    • (with this setting, one roll-out means 25 updates to the tree)
  • Number of episodes: 5000
  • Number of steps: 2
  • Algorithms:
    • Both Keep 1st Node variants use Perfect Recall as the splitting strategy
    • Keep 1st Node + Update also updates the "first node" during backpropagation 
Results


Saturday, July 7, 2012

Feedback 07-07-2012

Progress
  • Wrote some new sections for the thesis
  • Changed the structure:

Friday, July 6, 2012

Feedback 06-07-2012

Progress
  • Implemented perfect recall for the transposition tree
    • Remembers all samples (belief, action, discounted reward)
    • In case of a split, uses the stored samples to recreate the mean and the count of the action nodes which are below the new leaves
    • Space for the samples is pre-allocated: 
      • #samples = #roll-outs * epsilon-horizon
  • All three variants (deletion, insertion, transposition tree) decrease in performance after the first split(s) (in the plot below, there is a reoccurring dip at the beginning of each curve)
Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 5000
  • Number of steps: 2

Preliminary Results
First 100 samples

Thursday, July 5, 2012

Feedback 05-07-2012

Progress
  • Implemented the transposition tree
    • Starts from a single leaf node which represents the complete belief space ( [0,1] in the Tiger problem )
    • Uses an F-test for splitting in the belief space
    • Collects test statistics for each action for the two "sides" of each test
    • Splits if there is a significant difference between any action (e.g., between action 0 on the "left" side and action 0 on the "right" side)
  • Re-use of knowledge after a split:
    • Deletion (implemented)
    • Insertion of information from winning test into new action nodes (implemented)
    • Insert split tests into new leaves (not implemented)
    • Other strategies (like perfect recall)?
  • Problems:
    • All following problems relate to the agent's true belief in the real environment at the current time step and occur for both deletion and insertion
    • If there is a split near the end of the episode at a leaf whose range includes the agent's true belief, the algorithm cannot gather enough new data about the expected value of each action at that range to give reliable estimates.
    • Splitting sometimes creates a leaf with a large range, e.g., [0.5000, 0.9170]. Now, if the agent's true belief is 0.91, the expected value of the actions is again not correct.
Planning
  • Implement other strategies
  • Implement not splitting the first node

Tuesday, July 3, 2012

Feedback 03-07-2012

Progress
  • Corrected and improved the implementation of the formulas for the univariate and bivariate normal distribution
    • Univariate: 




    • Bivariate:



  • Effects:
    • Simulations are more stable
    • Simulations are much faster
  • Added visualization of the belief space for grid worlds



Results (Preview)
  • Experimental Set-up:
    • Light Dark domain with the following layout:
********L*
******A*L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L* 
  • Initial belief: uniform over all possible states, except the goal state
  • # Episodes: 5000
  • Optimal is: two steps right, four steps down, five steps left (probably not the true optimal line)