Tuesday, March 27, 2012

Feedback 27-03-2012

Progress
  • Implemented exact belief state update
  • Implemented .properties-file loader to load parameter settings for algorithms
  • Implemented algorithm for POMDPs with continuous observations (see prev. post)

Thursday, March 22, 2012

Feedback 22-03-2012

Algorithm for POMDPs with continuous observations

  • Steps colored in blue are different from standard TLS
  • Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step
    • Using the observation receveived in the final state t=max_roll_out_depth of the roll-out is not correct because the action and the state at time t influence the observation and not the action and the state at time t=max_roll_out_depth
Selection
  1. Start at root.
  2. If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
  3. Apply the action corresponding to the selected child to receive an observation ot from the environment simulator and store ot in a list O=(o1, o2, ..., ot).
  4. Use ot to traverse the regression tree until your reach a leaf.
  5. Repeat Steps (2) - (4).
  6. Return that leaf l.
Expansion
  1. Choose an action a* at leaf l.
  2. Connect  l  to a new regression tree with only one node (the root).
Roll-outs
  1. Apply the selected action a* to receive an observation o* from the environment simulator and store o*  in  O=(o1, o2, ..., o*).
  2. Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
  3. Receive a reward r.
Backpropagation
  1. Use r to to update all encountered nodes up to the root of the tree.
  2. Use (o*, r) to update the statistics of the leafs of the regression tree corresponding to action a*.
  3. Use (ot, r) to update the statistics of the leafs of the regression tree encountered at time step t back up on the way to the root (for t = 1, 2, ...).
  4. Split leafs of a regression tree if necessary.

Example of first five steps






















(remark: a* is always equal to the action at the node "above" it)

Thursday, March 15, 2012

Feedback 15-03-12

General Structures for Nodes and Trees
  • I've implemented some general structures for  nodes and trees
  • These structures are based on the shared interfaces
Observation Tree
  • I've implemented (and tested) an incremental regression tree learner
  • It's based on the FIMT (see references from previous post) and TG [1,2] algorithms
  • Input examples are instances of type Observation from RL-Glue
  • Handles only one-dimensional observations at the moment
  • Uses the F-Test for finding splits
    • Critical values for the F-distribution from 1 to 1000 are stored in an array
  • There are three parameters to influence the splitting:
    1. The minimum number of examples which need to be seen before the algorithm tests for possible splits
    2. The maximum number of split points that can be stored in a node
    3. The minimum number of examples which need to be seen for each "side" of a split point before the algorithm considers this split point
  • For split points the first n examples are taken (where n is equal to the 2nd parameter described above)
Planning
  • Implement simulator for the Tiger Problem
  • Combine regression tree with MCTS
[1] Driessens, K., Ramon, J., & Blockeel, H. (2001). Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. In L. De Raedt & P. Flach (Eds.), Machine Learning ECML 2001 (Vol. 2167, pp. 97-108). Springer.
[2]  Cristina Hohan Yu Chang. Relational Reinforcement Learning with the Algorithm TG. Politecnico di Milano. 2010.

Monday, March 12, 2012

Feedback 12-03-2012

Tiger Problem
  • I've changed my existing Tiger Problem environment to continuous observations 
    • One-dimensional observation: -1.0 (tiger is left) to 1.0 (tiger is right)
    • Observation is the position of the tiger corrupted by some zero-mean, normally distributed noise with a low standard deviation σ
Implementation Plan
  1. General structures for nodes and trees (incorporating the shared interfaces)
  2. Tree for observations [1,2]
  3. Simulation to generate examples (for tree for observations)
  4. Combine tree for observations with remainder of algorithm
[1] Elena Ikonomovska, Joao Gama: Learning Model Trees from Data Streams. Discovery Science 2008: 52-63.
[2] Geoff Hulten, Pedro Domingos, Laurie Spencer. Mining Massive Data Streams. Journal of Machine Learning Research (2005).
[3] P. Domingos and G. Hulten. Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’00 (2000) 71–80.3.

Thursday, March 8, 2012

Minutes 08-03-2012

Discussion
  • Discretizing the observations should give an approximation of the value function
  • Keep advanced questions for later meetings
  • Algorithm for horizon 2
    • MCTS + regression tree
    • Maintain exact belief (in execution step)
    • Environment as simulator (2nd instance)
  • Always splitting (MCTS, HOO) vs not always splitting (regression tree)
  • Transposition table -> transposition tree
Planning
  • Extend implementation of Tiger problem with continuous observations
  • Write implementation plan for algorithm
  • Implement the algorithm

Minutes 07-03-2012

Assignment Pitch
  • Get to point earlier
  • The existing technique already works on continuous states
  • Belief space is already a continuous space
Things that all of us (could) use
  • RL-Glue (framework)
  • Difference between rollouts (planning) and runs (execution)
  • Adaptive range exploration factor C (UCT)
  • Evaluation methods 
    • #Samples/max time vs performance
    • Performance over time
    • Performance over horizon
    • Difference between approximate and exact results
    • Error
  • Continuous (multi-dimensional) action tree interface
    • Aim: Common interface for all our different trees
    • Methods: update, getGreedyAction, getUCTAction
  • Meta tree interface
    • Applicable for my tree?
Miscellaneous
  • Model-based vs model-free learning
  • Cloning of state information from the environment for the rollouts might be expensive
  • Is there an analytical way to calculate C in the UCT-formula?
  • Expected mean value convergence vs behavioral convergence (variance of greedy actions)
  • Candidate split generation:
    • First n samples
    • Grid (equally spaced sample ranges)
  • Splitting: first optimality condition (SDR, f-test), then sufficiency condition (f-test)
Planning
  • Write about adaptive C in LaTeX
  • (Prepare for) meeting with Frans and Michael tomorrow (08-03-2011, 14:00) (done)

Tuesday, March 6, 2012

Minutes 06-03-2012

D. Silver's paper
  • Discussed belief state approximation with particles
  • Discussed pseudo code in detail
  • Worked out tree construction
Possible Extensions
  • Continuous observations
    • Store observation and corresponding reward for all roll-outs
  • Continuous actions
    • How to transform a large number of discrete actions to one (or a few) continuous action(s)?
  • Combinations of both
Possible Modifications
  • Particle filter could be replaced by (extended) Kalman filter
  • Expected reward computation (RDiff)
Evaluation
  • Compare approximate results to exact results (e.g. from J. Hoey [1])
[1] Jesse Hoey and Pascal Poupart. Solving POMDPs with continuous or large discrete observation spaces. In Proceedings on the International Joint Conference on Artificial Intelligence. pages 1332 - 1338. 2005.

Monday, March 5, 2012

Feedback 05-03-2012

Parser
  • Since the problems do not contain continuous actions / observations, I think it's redundant to implement a parser for such POMDP problems for RL-Glue
Paper by D. Silver
  • Combination of: 
    • MCTS (UCT) for optimal action selection
    • Particle filter for belief state approximation
  • Same simulations used for both techniques
  • Search tree is based on: 
    • Histories in the nodes
    • Actions and observations in the edges (i.e. "action" edges followed by "observation" edges)
  • Each node is coupled to a set of particles which approximate the belief state
  • The algorithm (called POMCP) can make use of domain knowledge
  • POMCP performs on a high level in discrete state spaces with up to 10^56 states and beats other online and offline planning algorithms
  • Possibility 1: Extend to continuous observations
  • Possibility 2: Build the tree on action-observation pairs
  • Possiblity 3: Use a Bayes or a Kalman filter for the belief state approximation (although I think a particle filter is already the best choice)
Planning
  • Preparation for meeting on Wednesday (March 7, 2012, 11:00)