Master Thesis Artificial Intelligence: March 2012

Tuesday, March 27, 2012

Feedback 27-03-2012

Progress

Implemented exact belief state update
Implemented .properties-file loader to load parameter settings for algorithms
Implemented algorithm for POMDPs with continuous observations (see prev. post)

Thursday, March 22, 2012

Algorithm for POMDPs with continuous observations

Steps colored in blue are different from standard TLS
Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step

Using the observation receveived in the final state t=max_roll_out_depth of the roll-out is not correct because the action and the state at time t influence the observation and not the action and the state at time t=max_roll_out_depth

Selection

Start at root.
If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
Apply the action corresponding to the selected child to receive an observation o^t from the environment simulator and store o^t in a list O=(o¹, o², ..., o^t).
Use o^t to traverse the regression tree until your reach a leaf.
Repeat Steps (2) - (4).
Return that leaf l.

Expansion

Choose an action a* at leaf l.
Connect l to a new regression tree with only one node (the root).

Roll-outs

Apply the selected action a* to receive an observation o* from the environment simulator and store o* in O=(o¹, o², ..., o*).
Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
Receive a reward r.

Backpropagation

Use r to to update all encountered nodes up to the root of the tree.
Use (o*, r) to update the statistics of the leafs of the regression tree corresponding to action a*.
Use (o^t, r) to update the statistics of the leafs of the regression tree encountered at time step t back up on the way to the root (for t = 1, 2, ...).
Split leafs of a regression tree if necessary.

Example of first five steps

(remark: a* is always equal to the action at the node "above" it)

Thursday, March 15, 2012

Feedback 15-03-12

General Structures for Nodes and Trees

I've implemented some general structures for nodes and trees
These structures are based on the shared interfaces

Observation Tree

I've implemented (and tested) an incremental regression tree learner
It's based on the FIMT (see references from previous post) and TG [1,2] algorithms
Input examples are instances of type Observation from RL-Glue
Handles only one-dimensional observations at the moment
Uses the F-Test for finding splits

Critical values for the F-distribution from 1 to 1000 are stored in an array

There are three parameters to influence the splitting:

The minimum number of examples which need to be seen before the algorithm tests for possible splits
The maximum number of split points that can be stored in a node
The minimum number of examples which need to be seen for each "side" of a split point before the algorithm considers this split point

For split points the first n examples are taken (where n is equal to the 2nd parameter described above)

Planning

Implement simulator for the Tiger Problem
Combine regression tree with MCTS

[1] Driessens, K., Ramon, J., & Blockeel, H. (2001). Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. In L. De Raedt & P. Flach (Eds.), Machine Learning ECML 2001 (Vol. 2167, pp. 97-108). Springer.
[2] Cristina Hohan Yu Chang. Relational Reinforcement Learning with the Algorithm TG. Politecnico di Milano. 2010.

Monday, March 12, 2012

Feedback 12-03-2012

Tiger Problem

I've changed my existing Tiger Problem environment to continuous observations

One-dimensional observation: -1.0 (tiger is left) to 1.0 (tiger is right)
Observation is the position of the tiger corrupted by some zero-mean, normally distributed noise with a low standard deviation σ

Implementation Plan

General structures for nodes and trees (incorporating the shared interfaces)
Tree for observations [1,2]
Simulation to generate examples (for tree for observations)
Combine tree for observations with remainder of algorithm

[1] Elena Ikonomovska, Joao Gama: Learning Model Trees from Data Streams. Discovery Science 2008: 52-63.
[2] Geoff Hulten, Pedro Domingos, Laurie Spencer. Mining Massive Data Streams. Journal of Machine Learning Research (2005).
[3] P. Domingos and G. Hulten. Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’00 (2000) 71–80.3.

Thursday, March 8, 2012

Minutes 08-03-2012

Discussion

Discretizing the observations should give an approximation of the value function
Keep advanced questions for later meetings
Algorithm for horizon 2

MCTS + regression tree
Maintain exact belief (in execution step)
Environment as simulator (2nd instance)

Always splitting (MCTS, HOO) vs not always splitting (regression tree)
Transposition table -> transposition tree

Planning

Extend implementation of Tiger problem with continuous observations
Write implementation plan for algorithm
Implement the algorithm

Minutes 07-03-2012

Assignment Pitch

Get to point earlier
The existing technique already works on continuous states
Belief space is already a continuous space

Things that all of us (could) use

RL-Glue (framework)
Difference between rollouts (planning) and runs (execution)
Adaptive range exploration factor C (UCT)
Evaluation methods

#Samples/max time vs performance
Performance over time
Performance over horizon
Difference between approximate and exact results
Error

Continuous (multi-dimensional) action tree interface

Aim: Common interface for all our different trees
Methods: update, getGreedyAction, getUCTAction

Meta tree interface

Applicable for my tree?

Miscellaneous

Model-based vs model-free learning
Cloning of state information from the environment for the rollouts might be expensive
Is there an analytical way to calculate C in the UCT-formula?
Expected mean value convergence vs behavioral convergence (variance of greedy actions)
Candidate split generation:

First n samples
Grid (equally spaced sample ranges)

Splitting: first optimality condition (SDR, f-test), then sufficiency condition (f-test)

Planning

Write about adaptive C in LaTeX
(Prepare for) meeting with Frans and Michael tomorrow (08-03-2011, 14:00) (done)

Tuesday, March 6, 2012

Minutes 06-03-2012

D. Silver's paper

Discussed belief state approximation with particles
Discussed pseudo code in detail
Worked out tree construction

Possible Extensions

Continuous observations

Store observation and corresponding reward for all roll-outs

Continuous actions

How to transform a large number of discrete actions to one (or a few) continuous action(s)?

Combinations of both

Possible Modifications

Particle filter could be replaced by (extended) Kalman filter
Expected reward computation (R_Diff)

Evaluation

Compare approximate results to exact results (e.g. from J. Hoey [1])

[1] Jesse Hoey and Pascal Poupart. Solving POMDPs with continuous or large discrete observation spaces. In Proceedings on the International Joint Conference on Artificial Intelligence. pages 1332 - 1338. 2005.

Monday, March 5, 2012

Feedback 05-03-2012

Parser

Since the problems do not contain continuous actions / observations, I think it's redundant to implement a parser for such POMDP problems for RL-Glue

Paper by D. Silver

Combination of:

MCTS (UCT) for optimal action selection
Particle filter for belief state approximation

Same simulations used for both techniques
Search tree is based on:

Histories in the nodes
Actions and observations in the edges (i.e. "action" edges followed by "observation" edges)

Each node is coupled to a set of particles which approximate the belief state
The algorithm (called POMCP) can make use of domain knowledge
POMCP performs on a high level in discrete state spaces with up to 10^56 states and beats other online and offline planning algorithms
Possibility 1: Extend to continuous observations
Possibility 2: Build the tree on action-observation pairs
Possiblity 3: Use a Bayes or a Kalman filter for the belief state approximation (although I think a particle filter is already the best choice)

Planning

Preparation for meeting on Wednesday (March 7, 2012, 11:00)

Master Thesis Artificial Intelligence

Tuesday, March 27, 2012

Feedback 27-03-2012

Thursday, March 22, 2012

Feedback 22-03-2012

Algorithm for POMDPs with continuous observations

Thursday, March 15, 2012

Feedback 15-03-12

Monday, March 12, 2012

Feedback 12-03-2012

Thursday, March 8, 2012

Minutes 08-03-2012

Minutes 07-03-2012

Tuesday, March 6, 2012

Minutes 06-03-2012

Monday, March 5, 2012

Feedback 05-03-2012

About Me