Master Thesis Artificial Intelligence: July 2012

Sunday, July 29, 2012

Feedback 29-07-2012

Progress

Implemented visualization for light-dark domain in RL-Viz:

Finished time-based performance experiment for light-dark domain (see below)
Thesis:

Included the experiment mentioned before
Some small language / content corrections

Experimental Set-up

Environment: 10x10 discrete state space light-dark domain
Max. steps: 20
Discount factor: 0.95
Number of episodes: 2,000

Results

time vs roll-outs

time vs mean

Thursday, July 26, 2012

Feedback 26-07-12

Progress

Thesis:

Wrote a section about related work
Started to write on conclusion chapter

Added experiments about performance in infinite horizon tiger problem

Overview of experiments

Performance

Sample based (# roll-outs vs mean)

Variations of COMCTS
Variations of TT
Best COMCTS variation against best TT variation
Variations with similar performance

Time based: same four experiments (time vs mean)

Practical time and space usage (time vs roll-outs)
Varying Noise (Standard deviation of noise vs mean)

Experiments: Time vs roll-outs, time vs mean

Set-up

Environment: infinite horizon Tiger problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 2,000
Algorithms:

BFAIRMC (pronounced: "be fair mc") = Belief based Function Approximation using Incremental Regression trees and Monte Carlo = transposition tree

Results

time vs rollouts

time vs mean

time vs mean (zoomed in)

time vs regret (optimum - mean)

Saturday, July 21, 2012

Feedback 21-07-2012

Results

All variations of TT: bootstrap on the standard transposition tree (without keeping and updating the 1st node separately)

Friday, July 20, 2012

Feedback 20-07-2012

Progress

Implemented bootstrapping for the transposition tree:

Updates the tree from bottom to top (same way as before)
Formula: R_k...d-1 + γ^d-k* V(b^d) where

k is the depth of the current update,
d is the horizon depth given by ε-horizon time, and
V(b^d) = max_a(Q(b^d,a))

Q(b^d,a) refers to the average outcome of action a in belief state b^d (given by the corresponding UCT value)
R_k...d-1 is the return or cumulative discounted reward of the step rewards from step k to step d - 1

Now, the deletion and insertion strategies perform as good as the perfect recall strategy (see plots below)
Remark: the results below all keep the first node separate from the other nodes and also update this node during backpropagation; the remaining nodes form the transposition tree

Results

Monday, July 16, 2012

Feedback 16-07-2012

Progress

Implemented a continuous state space variation of the light-dark domain:

State space is a subspace [minX, maxX], [minY, maxY] of R^2

Goal is a square of size 1

Actions: discrete, with some additive Gaussian noise
Wrap-around effect: if agent leaves on one side, it enters on the opposite side
Observations: same as in discrete state space variation

Implemented a belief approximation:

Bivariate Gaussian with three parameters: meanX, meanY, variance
Updates are done with a Kalman filter (simplified to the specific Gaussian and the environment)
Matrix computations are performed with apache commons math
Bivariate Gaussian is realized with a subset of classes from jahmm

Implemented a transposition tree for this Gaussian:

Instead of maintaining and updating a 1D belief during a simulation, the algorithm maintains and updates the three parameters of the Gaussian
Searches for split points along the Gaussian's three dimensions
Selects the best F-test among all dimensions for a split

Saturday, July 14, 2012

Feedback 14-07-12

Progress

Changed the light-dark domain such that it locates the agent at a randomly selected starting cell at the beginning of an episode
Implemented a heuristic for the light-dark domain:

Assumes the agent is on the grid cell for which the belief is the highest
Takes the action that has the smallest Euclidean distance from that cell to the goal cell

Experimental Set-up

Light Dark domain with the following layout:

********L*
********L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L*

At the beginning of each episode, the agent is randomly placed at any cell except the goal G
Initial belief: uniform over all possible states, except the goal state
Number of episodes: 1000
Discount factor: 0.95

Results

Wednesday, July 11, 2012

Feedback 11-07-10

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 5000
Number of steps: 2
Algorithms:

Both Keep 1st Node variants use perfect recall as the splitting strategy
Keep 1st Node + Update also updates the "first node" during backpropagation

Regret plots: the mean value achieved by magic guesser is taken as the optimum

Results

Regret: Variations of COMCTS


Regret: Variations of TT

Mean: Variations of COMCTS

Mean: Variations of TT

Tuesday, July 10, 2012

Meeting 10-07-2012

Discussion

Thesis:

Structure stays as it is (maybe some sections will still be moved later-on)
Add research question about computational complexity of both algorithms

Transposition tree: recreate split tests from stored samples if perfect recall is used (but otherwise do not copy them)
Plots:

Include data of up to 10^3 roll-outs
Regret: change y-axis to log scale
Thesis:

Separate plots for the variations of each algorithm
Plots to compare best variants
Plots for variations with similar curves

Transposition tree: related work is function approximation / generalization with decision / regression trees in reinforcement learning
Transposition tree for light-dark domain:

Changes to light-dark domain:

Continuous state space with a squared region for the goal
Discrete actions, corrupted by some Gaussian noise

Belief space represented as a multivariate Gaussian centered around the agent's actual location (x,y) with three parameters: mean(x), mean(y), standard deviation

Planning

Hand-in of chapters 2 and 3 this Friday July 13
Next meeting: Monday July 15

Monday, July 9, 2012

Feedback 09-07-2012

Progress

Transposition Tree: Implemented not splitting the first node

The agent's initial belief is represented by an additional node
This "first node" is updated separately from all other nodes
There is a parameter that allows to also update it during backpropagation if the belief falls in the first node's range

Transposition Tree: Corrected the way an action is selected at the end of a simulation

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001

(with this setting, one roll-out means 25 updates to the tree)

Number of episodes: 5000
Number of steps: 2
Algorithms:

Both Keep 1st Node variants use Perfect Recall as the splitting strategy
Keep 1st Node + Update also updates the "first node" during backpropagation

Results

Saturday, July 7, 2012

Feedback 07-07-2012

Progress

Wrote some new sections for the thesis
Changed the structure:

Friday, July 6, 2012

Feedback 06-07-2012

Progress

Implemented perfect recall for the transposition tree

Remembers all samples (belief, action, discounted reward)
In case of a split, uses the stored samples to recreate the mean and the count of the action nodes which are below the new leaves
Space for the samples is pre-allocated:

#samples = #roll-outs * epsilon-horizon

All three variants (deletion, insertion, transposition tree) decrease in performance after the first split(s) (in the plot below, there is a reoccurring dip at the beginning of each curve)

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 5000
Number of steps: 2

Preliminary Results

First 100 samples

Thursday, July 5, 2012

Feedback 05-07-2012

Progress

Implemented the transposition tree

Starts from a single leaf node which represents the complete belief space ( [0,1] in the Tiger problem )
Uses an F-test for splitting in the belief space
Collects test statistics for each action for the two "sides" of each test
Splits if there is a significant difference between any action (e.g., between action 0 on the "left" side and action 0 on the "right" side)

Re-use of knowledge after a split:

Deletion (implemented)
Insertion of information from winning test into new action nodes (implemented)
Insert split tests into new leaves (not implemented)
Other strategies (like perfect recall)?

Problems:

All following problems relate to the agent's true belief in the real environment at the current time step and occur for both deletion and insertion
If there is a split near the end of the episode at a leaf whose range includes the agent's true belief, the algorithm cannot gather enough new data about the expected value of each action at that range to give reliable estimates.
Splitting sometimes creates a leaf with a large range, e.g., [0.5000, 0.9170]. Now, if the agent's true belief is 0.91, the expected value of the actions is again not correct.

Planning

Implement other strategies
Implement not splitting the first node

Tuesday, July 3, 2012

Feedback 03-07-2012

Progress

Corrected and improved the implementation of the formulas for the univariate and bivariate normal distribution

Univariate:

Bivariate:

Effects:

Simulations are more stable
Simulations are much faster

Added visualization of the belief space for grid worlds

Results (Preview)

Experimental Set-up:

Light Dark domain with the following layout:

********L*
******A*L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L*

Initial belief: uniform over all possible states, except the goal state
# Episodes: 5000
Optimal is: two steps right, four steps down, five steps left (probably not the true optimal line)