Master Thesis Artificial Intelligence

Wednesday, July 11, 2012

Feedback 11-07-10

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 5000
Number of steps: 2
Algorithms:

Both Keep 1st Node variants use perfect recall as the splitting strategy
Keep 1st Node + Update also updates the "first node" during backpropagation

Regret plots: the mean value achieved by magic guesser is taken as the optimum

Results

Regret: Variations of COMCTS


Regret: Variations of TT

Mean: Variations of COMCTS

Mean: Variations of TT

Tuesday, July 10, 2012

Meeting 10-07-2012

Discussion

Thesis:

Structure stays as it is (maybe some sections will still be moved later-on)
Add research question about computational complexity of both algorithms

Transposition tree: recreate split tests from stored samples if perfect recall is used (but otherwise do not copy them)
Plots:

Include data of up to 10^3 roll-outs
Regret: change y-axis to log scale
Thesis:

Separate plots for the variations of each algorithm
Plots to compare best variants
Plots for variations with similar curves

Transposition tree: related work is function approximation / generalization with decision / regression trees in reinforcement learning
Transposition tree for light-dark domain:

Changes to light-dark domain:

Continuous state space with a squared region for the goal
Discrete actions, corrupted by some Gaussian noise

Belief space represented as a multivariate Gaussian centered around the agent's actual location (x,y) with three parameters: mean(x), mean(y), standard deviation

Planning

Hand-in of chapters 2 and 3 this Friday July 13
Next meeting: Monday July 15

Monday, July 9, 2012

Feedback 09-07-2012

Progress

Transposition Tree: Implemented not splitting the first node

The agent's initial belief is represented by an additional node
This "first node" is updated separately from all other nodes
There is a parameter that allows to also update it during backpropagation if the belief falls in the first node's range

Transposition Tree: Corrected the way an action is selected at the end of a simulation

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001

(with this setting, one roll-out means 25 updates to the tree)

Number of episodes: 5000
Number of steps: 2
Algorithms:

Both Keep 1st Node variants use Perfect Recall as the splitting strategy
Keep 1st Node + Update also updates the "first node" during backpropagation

Results

Saturday, July 7, 2012

Feedback 07-07-2012

Progress

Wrote some new sections for the thesis
Changed the structure:

Friday, July 6, 2012

Feedback 06-07-2012

Progress

Implemented perfect recall for the transposition tree

Remembers all samples (belief, action, discounted reward)
In case of a split, uses the stored samples to recreate the mean and the count of the action nodes which are below the new leaves
Space for the samples is pre-allocated:

#samples = #roll-outs * epsilon-horizon

All three variants (deletion, insertion, transposition tree) decrease in performance after the first split(s) (in the plot below, there is a reoccurring dip at the beginning of each curve)

Experimental Set-up

Environment: Infinite Horizon Tiger Problem
Discount factor: 0.5
Discount horizon: 0.00001
Number of episodes: 5000
Number of steps: 2

Preliminary Results

First 100 samples

Thursday, July 5, 2012

Feedback 05-07-2012

Progress

Implemented the transposition tree

Starts from a single leaf node which represents the complete belief space ( [0,1] in the Tiger problem )
Uses an F-test for splitting in the belief space
Collects test statistics for each action for the two "sides" of each test
Splits if there is a significant difference between any action (e.g., between action 0 on the "left" side and action 0 on the "right" side)

Re-use of knowledge after a split:

Deletion (implemented)
Insertion of information from winning test into new action nodes (implemented)
Insert split tests into new leaves (not implemented)
Other strategies (like perfect recall)?

Problems:

All following problems relate to the agent's true belief in the real environment at the current time step and occur for both deletion and insertion
If there is a split near the end of the episode at a leaf whose range includes the agent's true belief, the algorithm cannot gather enough new data about the expected value of each action at that range to give reliable estimates.
Splitting sometimes creates a leaf with a large range, e.g., [0.5000, 0.9170]. Now, if the agent's true belief is 0.91, the expected value of the actions is again not correct.

Planning

Implement other strategies
Implement not splitting the first node

Tuesday, July 3, 2012

Feedback 03-07-2012

Progress

Corrected and improved the implementation of the formulas for the univariate and bivariate normal distribution

Univariate:

Bivariate:

Effects:

Simulations are more stable
Simulations are much faster

Added visualization of the belief space for grid worlds

Results (Preview)

Experimental Set-up:

Light Dark domain with the following layout:

********L*
******A*L*
********L*
********L*
********L*
***G****L*
********L*
********L*
********L*
********L*

Initial belief: uniform over all possible states, except the goal state
# Episodes: 5000
Optimal is: two steps right, four steps down, five steps left (probably not the true optimal line)