Wednesday, December 19, 2012

Progress 19-12-2012

Discretization for POMCP
  • Equal width binning
    • Predefined number of cut points per dimension (m)
    • Number of dimensions (n)
    • For each sequence node, there are (m+1)n history nodes => lots of nodes!
    • Used for the tree and to update the particle filter 
Hallway environment
  • Based on [1]
  • Actions: move forward, turn left, turn right, turn around
  • Reward: -1 per action
  • Observations: 4 wall detection sensors (with Gaussian noise), 1 landmark / goal detection sensor
  • Belief: location of agent (agent's orientation is known)
  • Initial belief distribution: uniform over all free cells
  • Currently available maps (0 = free, 1 = wall, G = goal, L = landmark):
    • 0000G
    • 00000000000

[1] Michael Littman, Anthony Cassandra, and Leslie Kaelbling. Learning policies for partially observable environments: Scaling up. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 362--370, San Francisco, CA, 1995. Morgan Kaufmann.

Tree Visualization

Sunday, July 29, 2012

Feedback 29-07-2012

  • Implemented visualization for light-dark domain in RL-Viz:


  • Finished time-based performance experiment for light-dark domain (see below)
  • Thesis: 
    • Included the experiment mentioned before
    • Some small language / content corrections
Experimental Set-up
  • Environment: 10x10 discrete state space light-dark domain
  • Max. steps: 20
  • Discount factor: 0.95
  • Number of episodes: 2,000
time vs roll-outs

time vs mean

Thursday, July 26, 2012

Feedback 26-07-12

  • Thesis: 
    • Wrote a section about related work
    • Started to write on conclusion chapter
    • Added experiments about performance in infinite horizon tiger problem
Overview of experiments
  • Performance
    • Sample based (# roll-outs vs mean)
      1. Variations of COMCTS
      2. Variations of TT
      3. Best COMCTS variation against best TT variation
      4. Variations with similar performance
    • Time based: same four experiments (time vs mean)
  • Practical time and space usage (time vs roll-outs)
  • Varying Noise (Standard deviation of noise vs mean)

Experiments: Time vs roll-outs, time vs mean

  • Environment: infinite horizon Tiger problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 2,000
  • Algorithms:
    • BFAIRMC (pronounced: "be fair mc") = Belief based Function Approximation using Incremental Regression trees and Monte Carlo = transposition tree
time vs rollouts

time vs mean

time vs mean (zoomed in)

time vs regret (optimum - mean)

Saturday, July 21, 2012

Feedback 21-07-2012

  • All variations of TT: bootstrap on the standard transposition tree (without keeping and updating the 1st node separately)

Friday, July 20, 2012

Feedback 20-07-2012

  • Implemented bootstrapping for the transposition tree:
    • Updates the tree from bottom to top (same way as before)
    • Formula: Rk...d-1 + γd-k * V(bd) where  
      • k is the depth of the current update,  
      • d is the horizon depth given by ε-horizon time, and
      • V(bd) = maxa(Q(bd,a)) 
    • Q(bd,a) refers to the average outcome of action a in belief state bd (given by the corresponding UCT value)
    • Rk...d-1 is the return or cumulative discounted reward of the step rewards from step k to step d - 1
  • Now, the deletion and insertion strategies perform as good as the perfect recall strategy (see plots below)
  • Remark: the results below all keep the first node separate from the other nodes and also update this node during backpropagation; the remaining nodes form the transposition tree

Monday, July 16, 2012

Feedback 16-07-2012

  • Implemented a continuous state space variation of the light-dark domain:
    • State space is a subspace [minX, maxX], [minY, maxY] of R^2
      • Goal is a square of size 1
    • Actions: discrete, with some additive Gaussian noise
    • Wrap-around effect: if agent leaves on one side, it enters on the opposite side
    • Observations: same as in discrete state space variation
  • Implemented a belief approximation:
    • Bivariate Gaussian with three parameters: meanX, meanY, variance
    • Updates are done with a Kalman filter (simplified to the specific Gaussian and the environment)
    • Matrix computations are performed with apache commons math
    • Bivariate Gaussian is realized with a subset of classes from jahmm
  • Implemented a transposition tree for this Gaussian:
    • Instead of maintaining and updating a 1D belief during a simulation, the algorithm maintains and updates the three parameters of the Gaussian
    • Searches for split points along the Gaussian's three dimensions
    • Selects the best F-test among all dimensions for a split

Saturday, July 14, 2012

Feedback 14-07-12

  • Changed the light-dark domain such that it locates the agent at a randomly selected starting cell at the beginning of an episode
  • Implemented a heuristic for the light-dark domain:
    • Assumes the agent is on the grid cell for which the belief is the highest
    • Takes the action that has the smallest Euclidean distance from that cell to the goal cell
Experimental Set-up
  • Light Dark domain with the following layout:
  • At the beginning of each episode, the agent is randomly placed at any cell except the goal G
  • Initial belief: uniform over all possible states, except the goal state
  • Number of episodes: 1000
  • Discount factor: 0.95

Wednesday, July 11, 2012

Feedback 11-07-10

Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 5000
  • Number of steps: 2
  • Algorithms:
    • Both Keep 1st Node variants use perfect recall as the splitting strategy
    • Keep 1st Node + Update also updates the "first node" during backpropagation
  • Regret plots: the mean value achieved by magic guesser is taken as the optimum

Regret: Variations of COMCTS

Regret: Variations of TT

Mean: Variations of COMCTS

Mean: Variations of TT

Tuesday, July 10, 2012

Meeting 10-07-2012

  • Thesis:
    • Structure stays as it is (maybe some sections will still be moved later-on)
    • Add research question about computational complexity of both algorithms
  • Transposition tree: recreate split tests from stored samples if perfect recall is used (but otherwise do not copy them)
  • Plots: 
    • Include data of up to 10^3 roll-outs
    • Regret: change y-axis to log scale
    • Thesis: 
      • Separate plots for the variations of each algorithm
      • Plots to compare best variants
      • Plots for variations with similar curves
  • Transposition tree: related work is function approximation / generalization with decision / regression trees in reinforcement learning
  • Transposition tree for light-dark domain:
    • Changes to light-dark domain:
      • Continuous state space with a squared region for the goal
      • Discrete actions, corrupted by some Gaussian noise
    • Belief space represented as a multivariate Gaussian centered around the agent's actual location (x,y) with three parameters: mean(x), mean(y), standard deviation
  • Hand-in of chapters 2 and 3 this Friday July 13
  • Next meeting: Monday July 15

Monday, July 9, 2012

Feedback 09-07-2012

  • Transposition Tree: Implemented not splitting the first node 
    • The agent's initial belief is represented by an additional node
    • This "first node" is updated separately from all other nodes
    • There is a parameter that allows to also update it during backpropagation if the belief falls in the first node's range
  • Transposition Tree: Corrected the way an action is selected at the end of a simulation
Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001 
    • (with this setting, one roll-out means 25 updates to the tree)
  • Number of episodes: 5000
  • Number of steps: 2
  • Algorithms:
    • Both Keep 1st Node variants use Perfect Recall as the splitting strategy
    • Keep 1st Node + Update also updates the "first node" during backpropagation 

Saturday, July 7, 2012

Feedback 07-07-2012

  • Wrote some new sections for the thesis
  • Changed the structure:

Friday, July 6, 2012

Feedback 06-07-2012

  • Implemented perfect recall for the transposition tree
    • Remembers all samples (belief, action, discounted reward)
    • In case of a split, uses the stored samples to recreate the mean and the count of the action nodes which are below the new leaves
    • Space for the samples is pre-allocated: 
      • #samples = #roll-outs * epsilon-horizon
  • All three variants (deletion, insertion, transposition tree) decrease in performance after the first split(s) (in the plot below, there is a reoccurring dip at the beginning of each curve)
Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem
  • Discount factor: 0.5
  • Discount horizon: 0.00001
  • Number of episodes: 5000
  • Number of steps: 2

Preliminary Results
First 100 samples

Thursday, July 5, 2012

Feedback 05-07-2012

  • Implemented the transposition tree
    • Starts from a single leaf node which represents the complete belief space ( [0,1] in the Tiger problem )
    • Uses an F-test for splitting in the belief space
    • Collects test statistics for each action for the two "sides" of each test
    • Splits if there is a significant difference between any action (e.g., between action 0 on the "left" side and action 0 on the "right" side)
  • Re-use of knowledge after a split:
    • Deletion (implemented)
    • Insertion of information from winning test into new action nodes (implemented)
    • Insert split tests into new leaves (not implemented)
    • Other strategies (like perfect recall)?
  • Problems:
    • All following problems relate to the agent's true belief in the real environment at the current time step and occur for both deletion and insertion
    • If there is a split near the end of the episode at a leaf whose range includes the agent's true belief, the algorithm cannot gather enough new data about the expected value of each action at that range to give reliable estimates.
    • Splitting sometimes creates a leaf with a large range, e.g., [0.5000, 0.9170]. Now, if the agent's true belief is 0.91, the expected value of the actions is again not correct.
  • Implement other strategies
  • Implement not splitting the first node

Tuesday, July 3, 2012

Feedback 03-07-2012

  • Corrected and improved the implementation of the formulas for the univariate and bivariate normal distribution
    • Univariate: 

    • Bivariate:

  • Effects:
    • Simulations are more stable
    • Simulations are much faster
  • Added visualization of the belief space for grid worlds

Results (Preview)
  • Experimental Set-up:
    • Light Dark domain with the following layout:
  • Initial belief: uniform over all possible states, except the goal state
  • # Episodes: 5000
  • Optimal is: two steps right, four steps down, five steps left (probably not the true optimal line)

Wednesday, June 27, 2012

Feedback 27-06-2012

  • Environment overview:
    • Light dark domain with a discrete state space (grid world)
    • If a move would end outside of the grid, the agent enters the grid again on the opposite side (e.g., agent leaves on the left side and enters on the right side)
    • Initial belief: uniform over all possible states (except for goal state)
    • Current set-up (A=agent, G=goal, L=light):

  • Behavior of the agent:
    • No tendency to go in a particular direction in the first step
  • Belief update: There seems to be something incorrect with either taking the probability density function to update the belief or with the probability density function itself (it gives probabilities > 1 which is not correct)
  • Implemented a visualization tool for the tree:

Monday, June 18, 2012

Feedback 18-06-12

  • Extended the incremental regression tree learner to multidimensional, continuous observations
  • Implemented a discrete state space version of the light-dark environment:
    • Agent A is placed in a grid world and has to reach a goal location G
    • It is very dark in this grid world but there is a light source in one column, so the idea is that the agent has to move away from its goal to localize itself
    • Actions: move up, down, left, right
    • Observations: location(x,y) corrupted by some zero-mean Gaussian with a standard deviation given by the following quadratic equation: 
                STD = sqrt [ 0.5 * (xbest - x)^2 + K ]
    • Rewards: -1 for moving, +10 for reaching the goal
  • Finished the background chapter
Experimental set-up
  • Environment: 5x10 light-dark domain, light source in column 9:
  • Number of episodes: 1,000
  • Max. steps: 25
  • UCT-C: 10 
  • Discount factor: 0.95
  • Optimal is the MDP-solution based on shortest distance from the agent's starting location to the goal location
Normal Scaling

X-Axis scaled to Log

  • Try a larger grid world (the agent does not show the behavior of going right for the first few steps and then left)
  • Continue writing
  • Meeting tomorrow at 13:00

Friday, June 15, 2012

Results: Deletion vs Re-use

Experimental Set-up
  • One action performed in the real world, from different initial beliefs
  • Probability(state = tiger left) = initial belief
  • Number of episodes: 5,000
  • Algorithms:
    • Re-use computes the belief range for each leaf of the observation tree and only splits a leaf in the observation tree if the distance between the belief ranges that the two new children would have is below some (very low) threshold
    • Deletion is the standard COMCTS algorithm

  • Re-use is straight line
  • Deletion is dotted line
  • The color of each line segment (p1,p2) is the RGB-mixture of the average percentages of the selected actions at the points p1 and p2
  • Black markers indicate the data points (i.e., the initial beliefs)

Results: Without Re-Use

Experimental Set-up
  • Same as in previous post

Results: Re-Using

Experimental Set-up
  • One step in the real world from different initial beliefs
  • Color is the average RGB-mixture of the percentages of the selected actions at the two end points of each line segment
    • red: listen
    • green: open left
    • blue: open right

Wednesday, June 13, 2012

Results: Tiger Problem - Increasing Noise

Experimental Set-up
  • Environment: Infinite Horizon Tiger Problem (stopped after 2 steps)
  • Roll-outs: 500
  • Standard deviation is of the Gaussian noise added to the observation signal

Tuesday, June 12, 2012

Results: Belief-Based Re-Use in Tiger Problem


  • computes for each leaf of the regression tree the corresponding range in the belief space
  • re-uses leaves with a similar range

First 1,000 roll-outs

Saturday, May 26, 2012

Results: Tiger Problem

Experimental Set-up
  • Regret = optimal value - mean value
  • Optimal value = sum over max. positive reward achievable in each step (20 for Horizon 2, 100 for Horizon 10)

Results: Regret Plots
Horizon 2

Horizon 2 (zoomed in)

Horizon 10

Horizon 10 (zoomed in)

Friday, May 25, 2012

Minutes: Joint Meeting

  • Planning
    • We reviewed our planning based on the Gantt chart  made for the thesis plan
      • Update chart if necessary (like in my case)
    • We discussed the graduation date and the reviewing and hand-in of the thesis (and of draft versions of it)
    • Hand-in of thesis draft in three weeks
  • Results
    • We elaborated on our current results
    • I should make regret plots (optimal - mean) to show differences between the algorithms
    • Try to add optimal lines to the grid world plots
    • Michael wants to think about a simple environment  to show the difference between MCTS and MC
  • Other
    • We discussed the following "transposition tree":
      • Assume the state / observation space to be Markovian
      • The algorithm builds a tree consisting of two parts:
        • An "observation tree" which provides a discretization of the observation space (or the state space). This tree is conceptually the same as the observation tree in COMCTS
        • For each leaf in the observation tree, there is an action node which keeps the UCB knowledge about the actions for the observation range that is represented by the leaf (in case of continuous actions, this action node is replaced by an action tree which provides a discretization of the action space)
      • The algorithm starts with one leaf in the observation tree connected to one action node.
      • Splits can be decided on a simple criterion such as a count [1] or a more informative one such as the statistics used in the TG algorithm [2]
      • Because of the assumption made above, this tree can be re-used after a step in the real world is taken. However, for POMDPs the observation might be dependent on the action taken and this is not considered by this algorithm.
    • Complete recall: reconstruct the tree only on the first level
      • Keep one list of samples, pointers to the correct depth and to the correct examples
    • Backpropagation: 
      • first update all UCB knowledge (bottom to top)
      • after that check for a split (top to bottom)
      • this avoids splits at high depths of the tree that would become redundant if there also occurs a split at a lower depth
  1. Update Gantt chart
  2. Plotting: 
    • Regret plots
    • Optimal curves for grid world
  3. Think further about  "transposition trees" for POMDPs
  4. Think further about re-using and sharing trees based on belief state distance
  5. Adapt Wumpus world to continuous observations

[1] Ali Nouri, Michael L. Littman. Multi-resolution Exploration in Continuous Spaces. Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008 (December 2008), pp. 1209-1216.

[2] K Driessens, J Ramon, H Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner.  Machine Learning: ECML 2001, 97-108.

Monday, May 21, 2012

Updated Results

  • Implemented complete recall for COMCTS:
    • Stores all examples  (observation(t),reward(t)) together with their "future" {<action(t+1),observation(t+1),reward(t+1)>,<action(t+2),observation(t+2),reward(t+2)>,...)}
    • Uses this knowledge to reconstruct the tree after a split
  • Implemented a 4x3 grid world:

    • Actions: up, down, left right
    • Observations: number of adjacent walls corrupted by some normally distributed noise with standard deviation = 0.965
    • Rewards: +1 for goal state, -1 for penalty state, movement costs -0.04

Tiger Problem: Horizon 2

 Tiger Problem: Horizon 10

Grid World: max. 22 steps