Master Thesis Artificial Intelligence: 2012

Wednesday, December 19, 2012

Progress 19-12-2012

Discretization for POMCP

Equal width binning

Predefined number of cut points per dimension (m)
Number of dimensions (n)

For each sequence node, there are (m+1)ⁿ history nodes => lots of nodes!
Used for the tree and to update the particle filter

Hallway environment

Based on [1]
Actions: move forward, turn left, turn right, turn around
Reward: -1 per action
Observations: 4 wall detection sensors (with Gaussian noise), 1 landmark / goal detection sensor
Belief: location of agent (agent's orientation is known)
Initial belief distribution: uniform over all free cells
Currently available maps (0 = free, 1 = wall, G = goal, L = landmark):

0000G
00000000000
11L1L1L1G11

[1] Michael Littman, Anthony Cassandra, and Leslie Kaelbling. Learning policies for partially observable environments: Scaling up. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 362--370, San Francisco, CA, 1995. Morgan Kaufmann.

Tree Visualization

Sunday, July 29, 2012

Feedback 29-07-2012

Progress

Implemented visualization for light-dark domain in RL-Viz:

Finished time-based performance experiment for light-dark domain (see below)
Thesis:

Included the experiment mentioned before
Some small language / content corrections

Experimental Set-up

Environment: 10x10 discrete state space light-dark domain
Max. steps: 20
Discount factor: 0.95
Number of episodes: 2,000

Results

time vs roll-outs

time vs mean

Thursday, July 26, 2012

Feedback 26-07-12

Progress

Thesis:

Wrote a section about related work
Started to write on conclusion chapter

Added experiments about performance in infinite horizon tiger problem

Overview of experiments

Performance

Sample based (# roll-outs vs mean)

Variations of COMCTS
Variations of TT
Best COMCTS variation against best TT variation
Variations with similar performance

Time based: same four experiments (time vs mean)

Practical time and space usage (time vs roll-outs)
Varying Noise (Standard deviation of noise vs mean)

Experiments: Time vs roll-outs, time vs mean

Set-up

Environment: infinite horizon Tiger problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 2,000
Algorithms:

BFAIRMC (pronounced: "be fair mc") = Belief based Function Approximation using Incremental Regression trees and Monte Carlo = transposition tree

Results

time vs rollouts

time vs mean

time vs mean (zoomed in)

time vs regret (optimum - mean)

Saturday, July 21, 2012

Feedback 21-07-2012

Results

All variations of TT: bootstrap on the standard transposition tree (without keeping and updating the 1st node separately)

Friday, July 20, 2012

Feedback 20-07-2012

Progress

Implemented bootstrapping for the transposition tree:

Updates the tree from bottom to top (same way as before)
Formula: R_k...d-1 + γ^d-k* V(b^d) where

k is the depth of the current update,
d is the horizon depth given by ε-horizon time, and
V(b^d) = max_a(Q(b^d,a))

Q(b^d,a) refers to the average outcome of action a in belief state b^d (given by the corresponding UCT value)
R_k...d-1 is the return or cumulative discounted reward of the step rewards from step k to step d - 1

Now, the deletion and insertion strategies perform as good as the perfect recall strategy (see plots below)
Remark: the results below all keep the first node separate from the other nodes and also update this node during backpropagation; the remaining nodes form the transposition tree

Results

Monday, July 16, 2012

Feedback 16-07-2012

Progress

Implemented a continuous state space variation of the light-dark domain:

State space is a subspace [minX, maxX], [minY, maxY] of R^2

Goal is a square of size 1

Actions: discrete, with some additive Gaussian noise
Wrap-around effect: if agent leaves on one side, it enters on the opposite side
Observations: same as in discrete state space variation

Implemented a belief approximation:

Bivariate Gaussian with three parameters: meanX, meanY, variance
Updates are done with a Kalman filter (simplified to the specific Gaussian and the environment)
Matrix computations are performed with apache commons math
Bivariate Gaussian is realized with a subset of classes from jahmm

Implemented a transposition tree for this Gaussian:

Instead of maintaining and updating a 1D belief during a simulation, the algorithm maintains and updates the three parameters of the Gaussian
Searches for split points along the Gaussian's three dimensions
Selects the best F-test among all dimensions for a split

Saturday, July 14, 2012

Feedback 14-07-12

Progress

Changed the light-dark domain such that it locates the agent at a randomly selected starting cell at the beginning of an episode
Implemented a heuristic for the light-dark domain:

Assumes the agent is on the grid cell for which the belief is the highest
Takes the action that has the smallest Euclidean distance from that cell to the goal cell

Experimental Set-up

Light Dark domain with the following layout:

********L*
********L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L*

At the beginning of each episode, the agent is randomly placed at any cell except the goal G
Initial belief: uniform over all possible states, except the goal state
Number of episodes: 1000
Discount factor: 0.95

Results

Wednesday, July 11, 2012

Feedback 11-07-10

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 5000
Number of steps: 2
Algorithms:

Both Keep 1st Node variants use perfect recall as the splitting strategy
Keep 1st Node + Update also updates the "first node" during backpropagation

Regret plots: the mean value achieved by magic guesser is taken as the optimum

Results

Regret: Variations of COMCTS


Regret: Variations of TT

Mean: Variations of COMCTS

Mean: Variations of TT

Tuesday, July 10, 2012

Meeting 10-07-2012

Discussion

Thesis:

Structure stays as it is (maybe some sections will still be moved later-on)
Add research question about computational complexity of both algorithms

Transposition tree: recreate split tests from stored samples if perfect recall is used (but otherwise do not copy them)
Plots:

Include data of up to 10^3 roll-outs
Regret: change y-axis to log scale
Thesis:

Separate plots for the variations of each algorithm
Plots to compare best variants
Plots for variations with similar curves

Transposition tree: related work is function approximation / generalization with decision / regression trees in reinforcement learning
Transposition tree for light-dark domain:

Changes to light-dark domain:

Continuous state space with a squared region for the goal
Discrete actions, corrupted by some Gaussian noise

Belief space represented as a multivariate Gaussian centered around the agent's actual location (x,y) with three parameters: mean(x), mean(y), standard deviation

Planning

Hand-in of chapters 2 and 3 this Friday July 13
Next meeting: Monday July 15

Monday, July 9, 2012

Feedback 09-07-2012

Progress

Transposition Tree: Implemented not splitting the first node

The agent's initial belief is represented by an additional node
This "first node" is updated separately from all other nodes
There is a parameter that allows to also update it during backpropagation if the belief falls in the first node's range

Transposition Tree: Corrected the way an action is selected at the end of a simulation

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001

(with this setting, one roll-out means 25 updates to the tree)

Number of episodes: 5000
Number of steps: 2
Algorithms:

Both Keep 1st Node variants use Perfect Recall as the splitting strategy
Keep 1st Node + Update also updates the "first node" during backpropagation

Results

Saturday, July 7, 2012

Feedback 07-07-2012

Progress

Wrote some new sections for the thesis
Changed the structure:

Friday, July 6, 2012

Feedback 06-07-2012

Progress

Implemented perfect recall for the transposition tree

Remembers all samples (belief, action, discounted reward)
In case of a split, uses the stored samples to recreate the mean and the count of the action nodes which are below the new leaves
Space for the samples is pre-allocated:

#samples = #roll-outs * epsilon-horizon

All three variants (deletion, insertion, transposition tree) decrease in performance after the first split(s) (in the plot below, there is a reoccurring dip at the beginning of each curve)

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 5000
Number of steps: 2

Preliminary Results

First 100 samples

Thursday, July 5, 2012

Feedback 05-07-2012

Progress

Implemented the transposition tree

Starts from a single leaf node which represents the complete belief space ( [0,1] in the Tiger problem )
Uses an F-test for splitting in the belief space
Collects test statistics for each action for the two "sides" of each test
Splits if there is a significant difference between any action (e.g., between action 0 on the "left" side and action 0 on the "right" side)

Re-use of knowledge after a split:

Deletion (implemented)
Insertion of information from winning test into new action nodes (implemented)
Insert split tests into new leaves (not implemented)
Other strategies (like perfect recall)?

Problems:

All following problems relate to the agent's true belief in the real environment at the current time step and occur for both deletion and insertion
If there is a split near the end of the episode at a leaf whose range includes the agent's true belief, the algorithm cannot gather enough new data about the expected value of each action at that range to give reliable estimates.
Splitting sometimes creates a leaf with a large range, e.g., [0.5000, 0.9170]. Now, if the agent's true belief is 0.91, the expected value of the actions is again not correct.

Planning

Implement other strategies
Implement not splitting the first node

Tuesday, July 3, 2012

Feedback 03-07-2012

Progress

Corrected and improved the implementation of the formulas for the univariate and bivariate normal distribution

Univariate:

Bivariate:

Effects:

Simulations are more stable
Simulations are much faster

Added visualization of the belief space for grid worlds

Results (Preview)

Experimental Set-up:

Light Dark domain with the following layout:

********L*
******A*L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L*

Initial belief: uniform over all possible states, except the goal state
# Episodes: 5000
Optimal is: two steps right, four steps down, five steps left (probably not the true optimal line)

Wednesday, June 27, 2012

Feedback 27-06-2012

Progress

Environment overview:

Light dark domain with a discrete state space (grid world)
If a move would end outside of the grid, the agent enters the grid again on the opposite side (e.g., agent leaves on the left side and enters on the right side)
Initial belief: uniform over all possible states (except for goal state)
Current set-up (A=agent, G=goal, L=light):

                  ***L*
         **AL*
         *G*L*
         ***L*
         ***L*

Behavior of the agent:

No tendency to go in a particular direction in the first step

Belief update: There seems to be something incorrect with either taking the probability density function to update the belief or with the probability density function itself (it gives probabilities > 1 which is not correct)
Implemented a visualization tool for the tree:

Monday, June 18, 2012

Feedback 18-06-12

Progress

Extended the incremental regression tree learner to multidimensional, continuous observations
Implemented a discrete state space version of the light-dark environment:

Agent A is placed in a grid world and has to reach a goal location G
It is very dark in this grid world but there is a light source in one column, so the idea is that the agent has to move away from its goal to localize itself
Actions: move up, down, left, right
Observations: location(x,y) corrupted by some zero-mean Gaussian with a standard deviation given by the following quadratic equation:

STD = sqrt [ 0.5 * (x_best- x)^2 + K ]

Rewards: -1 for moving, +10 for reaching the goal

Finished the background chapter

Experimental set-up

Environment: 5x10 light-dark domain, light source in column 9:

G*******L*
********L*
******A*L*
********L*
********L*

Number of episodes: 1,000
Max. steps: 25
UCT-C: 10
Discount factor: 0.95
Optimal is the MDP-solution based on shortest distance from the agent's starting location to the goal location

Results

Normal Scaling

X-Axis scaled to Log

Planning

Try a larger grid world (the agent does not show the behavior of going right for the first few steps and then left)
Continue writing
Meeting tomorrow at 13:00

Friday, June 15, 2012

Results: Deletion vs Re-use

Experimental Set-up

One action performed in the real world, from different initial beliefs
Probability(state = tiger left) = initial belief
Number of episodes: 5,000
Algorithms:

Re-use computes the belief range for each leaf of the observation tree and only splits a leaf in the observation tree if the distance between the belief ranges that the two new children would have is below some (very low) threshold
Deletion is the standard COMCTS algorithm

Legend

Re-use is straight line
Deletion is dotted line
The color of each line segment (p1,p2) is the RGB-mixture of the average percentages of the selected actions at the points p1 and p2
Black markers indicate the data points (i.e., the initial beliefs)

Results

Results: Without Re-Use

Experimental Set-up

Same as in previous post

Results

Results: Re-Using

Experimental Set-up

One step in the real world from different initial beliefs
Color is the average RGB-mixture of the percentages of the selected actions at the two end points of each line segment

red: listen
green: open left
blue: open right

Results

Wednesday, June 13, 2012

Results: Tiger Problem - Increasing Noise

Experimental Set-up

Environment: Infinite Horizon Tiger Problem (stopped after 2 steps)
Roll-outs: 500
Standard deviation is of the Gaussian noise added to the observation signal

Results

Tuesday, June 12, 2012

Results: Belief-Based Re-Use in Tiger Problem

Algorithm:

computes for each leaf of the regression tree the corresponding range in the belief space
re-uses leaves with a similar range

Results:

First 1,000 roll-outs

Saturday, May 26, 2012

Results: Tiger Problem

Experimental Set-up

Regret = optimal value - mean value
Optimal value = sum over max. positive reward achievable in each step (20 for Horizon 2, 100 for Horizon 10)

Results: Regret Plots

Horizon 2

Horizon 2 (zoomed in)

Horizon 10

Horizon 10 (zoomed in)

Friday, May 25, 2012

Minutes: Joint Meeting

Discussion

Planning

We reviewed our planning based on the Gantt chart made for the thesis plan

Update chart if necessary (like in my case)

We discussed the graduation date and the reviewing and hand-in of the thesis (and of draft versions of it)
Hand-in of thesis draft in three weeks

Results

We elaborated on our current results
I should make regret plots (optimal - mean) to show differences between the algorithms
Try to add optimal lines to the grid world plots
Michael wants to think about a simple environment to show the difference between MCTS and MC

Other

We discussed the following "transposition tree":

Assume the state / observation space to be Markovian
The algorithm builds a tree consisting of two parts:

An "observation tree" which provides a discretization of the observation space (or the state space). This tree is conceptually the same as the observation tree in COMCTS

For each leaf in the observation tree, there is an action node which keeps the UCB knowledge about the actions for the observation range that is represented by the leaf (in case of continuous actions, this action node is replaced by an action tree which provides a discretization of the action space)

The algorithm starts with one leaf in the observation tree connected to one action node.
Splits can be decided on a simple criterion such as a count [1] or a more informative one such as the statistics used in the TG algorithm [2]
Because of the assumption made above, this tree can be re-used after a step in the real world is taken. However, for POMDPs the observation might be dependent on the action taken and this is not considered by this algorithm.

Complete recall: reconstruct the tree only on the first level

Keep one list of samples, pointers to the correct depth and to the correct examples

Backpropagation:

first update all UCB knowledge (bottom to top)
after that check for a split (top to bottom)
this avoids splits at high depths of the tree that would become redundant if there also occurs a split at a lower depth

Planning

Update Gantt chart
Plotting:

Regret plots

Optimal curves for grid world

Think further about "transposition trees" for POMDPs
Think further about re-using and sharing trees based on belief state distance
Adapt Wumpus world to continuous observations

[1] Ali Nouri, Michael L. Littman. Multi-resolution Exploration in Continuous Spaces. Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008 (December 2008), pp. 1209-1216.

[2] K Driessens, J Ramon, H Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. Machine Learning: ECML 2001, 97-108.

Monday, May 21, 2012

Updated Results

Progress

Implemented complete recall for COMCTS:

Stores all examples (observation(t),reward(t)) together with their "future" {<action(t+1),observation(t+1),reward(t+1)>,<action(t+2),observation(t+2),reward(t+2)>,...)}
Uses this knowledge to reconstruct the tree after a split

Implemented a 4x3 grid world:

Actions: up, down, left right
Observations: number of adjacent walls corrupted by some normally distributed noise with standard deviation = 0.965
Rewards: +1 for goal state, -1 for penalty state, movement costs -0.04

Tiger Problem: Horizon 2

Tiger Problem: Horizon 10

Grid World: max. 22 steps