Wednesday, December 19, 2012

Progress 19-12-2012

Discretization for POMCP
  • Equal width binning
    • Predefined number of cut points per dimension (m)
    • Number of dimensions (n)
    • For each sequence node, there are (m+1)n history nodes => lots of nodes!
    • Used for the tree and to update the particle filter 
Hallway environment
  • Based on [1]
  • Actions: move forward, turn left, turn right, turn around
  • Reward: -1 per action
  • Observations: 4 wall detection sensors (with Gaussian noise), 1 landmark / goal detection sensor
  • Belief: location of agent (agent's orientation is known)
  • Initial belief distribution: uniform over all free cells
  • Currently available maps (0 = free, 1 = wall, G = goal, L = landmark):
    • 0000G
    • 00000000000
      11L1L1L1G11

[1] Michael Littman, Anthony Cassandra, and Leslie Kaelbling. Learning policies for partially observable environments: Scaling up. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 362--370, San Francisco, CA, 1995. Morgan Kaufmann.

Tree Visualization


Sunday, July 29, 2012

Feedback 29-07-2012

Progress
  • Implemented visualization for light-dark domain in RL-Viz:
 










 



  • Finished time-based performance experiment for light-dark domain (see below)
  • Thesis: 
    • Included the experiment mentioned before
    • Some small language / content corrections
Experimental Set-up
  • Environment: 10x10 discrete state space light-dark domain
  • Max. steps: 20
  • Discount factor: 0.95
  • Number of episodes: 2,000
Results
time vs roll-outs

time vs mean

Thursday, July 26, 2012

Feedback 26-07-12

Progress
  • Thesis: 
    • Wrote a section about related work
    • Started to write on conclusion chapter
    • Added experiments about performance in infinite horizon tiger problem
Overview of experiments
  • Performance
    • Sample based (# roll-outs vs mean)
      1. Variations of COMCTS
      2. Variations of TT
      3. Best COMCTS variation against best TT variation
      4. Variations with similar performance
    • Time based: same four experiments (time vs mean)
  • Practical time and space usage (time vs roll-outs)
  • Varying Noise (Standard deviation of noise vs mean)

Experiments: Time vs roll-outs, time vs mean


Set-up
  • Environment: infinite horizon Tiger problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 2,000
  • Algorithms:
    • BFAIRMC (pronounced: "be fair mc") = Belief based Function Approximation using Incremental Regression trees and Monte Carlo = transposition tree
Results
time vs rollouts










time vs mean

time vs mean (zoomed in)

time vs regret (optimum - mean)

Saturday, July 21, 2012

Feedback 21-07-2012

Results
  • All variations of TT: bootstrap on the standard transposition tree (without keeping and updating the 1st node separately)












Friday, July 20, 2012

Feedback 20-07-2012

Progress
  • Implemented bootstrapping for the transposition tree:
    • Updates the tree from bottom to top (same way as before)
    • Formula: Rk...d-1 + γd-k * V(bd) where  
      • k is the depth of the current update,  
      • d is the horizon depth given by ε-horizon time, and
      • V(bd) = maxa(Q(bd,a)) 
    • Q(bd,a) refers to the average outcome of action a in belief state bd (given by the corresponding UCT value)
    • Rk...d-1 is the return or cumulative discounted reward of the step rewards from step k to step d - 1
  • Now, the deletion and insertion strategies perform as good as the perfect recall strategy (see plots below)
  • Remark: the results below all keep the first node separate from the other nodes and also update this node during backpropagation; the remaining nodes form the transposition tree
Results


Monday, July 16, 2012

Feedback 16-07-2012

Progress
  • Implemented a continuous state space variation of the light-dark domain:
    • State space is a subspace [minX, maxX], [minY, maxY] of R^2
      • Goal is a square of size 1
    • Actions: discrete, with some additive Gaussian noise
    • Wrap-around effect: if agent leaves on one side, it enters on the opposite side
    • Observations: same as in discrete state space variation
  • Implemented a belief approximation:
    • Bivariate Gaussian with three parameters: meanX, meanY, variance
    • Updates are done with a Kalman filter (simplified to the specific Gaussian and the environment)
    • Matrix computations are performed with apache commons math
    • Bivariate Gaussian is realized with a subset of classes from jahmm
  • Implemented a transposition tree for this Gaussian:
    • Instead of maintaining and updating a 1D belief during a simulation, the algorithm maintains and updates the three parameters of the Gaussian
    • Searches for split points along the Gaussian's three dimensions
    • Selects the best F-test among all dimensions for a split

Saturday, July 14, 2012

Feedback 14-07-12

Progress
  • Changed the light-dark domain such that it locates the agent at a randomly selected starting cell at the beginning of an episode
  • Implemented a heuristic for the light-dark domain:
    • Assumes the agent is on the grid cell for which the belief is the highest
    • Takes the action that has the smallest Euclidean distance from that cell to the goal cell
Experimental Set-up
  • Light Dark domain with the following layout:
********L*
********L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L* 
  • At the beginning of each episode, the agent is randomly placed at any cell except the goal G
  • Initial belief: uniform over all possible states, except the goal state
  • Number of episodes: 1000
  • Discount factor: 0.95
Results