Master Thesis Artificial Intelligence

Wednesday, December 19, 2012

Progress 19-12-2012

Discretization for POMCP

Equal width binning

Predefined number of cut points per dimension (m)
Number of dimensions (n)

For each sequence node, there are (m+1)ⁿ history nodes => lots of nodes!
Used for the tree and to update the particle filter

Hallway environment

Based on [1]
Actions: move forward, turn left, turn right, turn around
Reward: -1 per action
Observations: 4 wall detection sensors (with Gaussian noise), 1 landmark / goal detection sensor
Belief: location of agent (agent's orientation is known)
Initial belief distribution: uniform over all free cells
Currently available maps (0 = free, 1 = wall, G = goal, L = landmark):

0000G
00000000000
11L1L1L1G11

[1] Michael Littman, Anthony Cassandra, and Leslie Kaelbling. Learning policies for partially observable environments: Scaling up. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 362--370, San Francisco, CA, 1995. Morgan Kaufmann.

Tree Visualization

Sunday, July 29, 2012

Feedback 29-07-2012

Progress

Implemented visualization for light-dark domain in RL-Viz:

Finished time-based performance experiment for light-dark domain (see below)
Thesis:

Included the experiment mentioned before
Some small language / content corrections

Experimental Set-up

Environment: 10x10 discrete state space light-dark domain
Max. steps: 20
Discount factor: 0.95
Number of episodes: 2,000

Results

time vs roll-outs

time vs mean

Thursday, July 26, 2012

Feedback 26-07-12

Progress

Thesis:

Wrote a section about related work
Started to write on conclusion chapter

Added experiments about performance in infinite horizon tiger problem

Overview of experiments

Performance

Sample based (# roll-outs vs mean)

Variations of COMCTS
Variations of TT
Best COMCTS variation against best TT variation
Variations with similar performance

Time based: same four experiments (time vs mean)

Practical time and space usage (time vs roll-outs)
Varying Noise (Standard deviation of noise vs mean)

Experiments: Time vs roll-outs, time vs mean

Set-up

Environment: infinite horizon Tiger problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 2,000
Algorithms:

BFAIRMC (pronounced: "be fair mc") = Belief based Function Approximation using Incremental Regression trees and Monte Carlo = transposition tree

Results

time vs rollouts

time vs mean

time vs mean (zoomed in)

time vs regret (optimum - mean)

Saturday, July 21, 2012

Feedback 21-07-2012

Results

All variations of TT: bootstrap on the standard transposition tree (without keeping and updating the 1st node separately)

Friday, July 20, 2012

Feedback 20-07-2012

Progress

Implemented bootstrapping for the transposition tree:

Updates the tree from bottom to top (same way as before)
Formula: R_k...d-1 + γ^d-k* V(b^d) where

k is the depth of the current update,
d is the horizon depth given by ε-horizon time, and
V(b^d) = max_a(Q(b^d,a))

Q(b^d,a) refers to the average outcome of action a in belief state b^d (given by the corresponding UCT value)
R_k...d-1 is the return or cumulative discounted reward of the step rewards from step k to step d - 1

Now, the deletion and insertion strategies perform as good as the perfect recall strategy (see plots below)
Remark: the results below all keep the first node separate from the other nodes and also update this node during backpropagation; the remaining nodes form the transposition tree

Results

Monday, July 16, 2012

Feedback 16-07-2012

Progress

Implemented a continuous state space variation of the light-dark domain:

State space is a subspace [minX, maxX], [minY, maxY] of R^2

Goal is a square of size 1

Actions: discrete, with some additive Gaussian noise
Wrap-around effect: if agent leaves on one side, it enters on the opposite side
Observations: same as in discrete state space variation

Implemented a belief approximation:

Bivariate Gaussian with three parameters: meanX, meanY, variance
Updates are done with a Kalman filter (simplified to the specific Gaussian and the environment)
Matrix computations are performed with apache commons math
Bivariate Gaussian is realized with a subset of classes from jahmm

Implemented a transposition tree for this Gaussian:

Instead of maintaining and updating a 1D belief during a simulation, the algorithm maintains and updates the three parameters of the Gaussian
Searches for split points along the Gaussian's three dimensions
Selects the best F-test among all dimensions for a split

Saturday, July 14, 2012

Feedback 14-07-12

Progress

Changed the light-dark domain such that it locates the agent at a randomly selected starting cell at the beginning of an episode
Implemented a heuristic for the light-dark domain:

Assumes the agent is on the grid cell for which the belief is the highest
Takes the action that has the smallest Euclidean distance from that cell to the goal cell

Experimental Set-up

Light Dark domain with the following layout:

********L*
********L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L*

At the beginning of each episode, the agent is randomly placed at any cell except the goal G
Initial belief: uniform over all possible states, except the goal state
Number of episodes: 1000
Discount factor: 0.95

Results