- Implemented exact belief state update
- Implemented .properties-file loader to load parameter settings for algorithms
- Implemented algorithm for POMDPs with continuous observations (see prev. post)
Tuesday, March 27, 2012
Feedback 27-03-2012
Progress
Thursday, March 22, 2012
Feedback 22-03-2012
Algorithm for POMDPs with continuous observations
- Steps colored in blue are different from standard TLS
- Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step
- Using the observation receveived in the final state t=max_roll_out_depth of the roll-out is not correct because the action and the state at time t influence the observation and not the action and the state at time t=max_roll_out_depth
- Start at root.
- If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
- Apply the action corresponding to the selected child to receive an observation ot from the environment simulator and store ot in a list O=(o1, o2, ..., ot).
- Use ot to traverse the regression tree until your reach a leaf.
- Repeat Steps (2) - (4).
- Return that leaf l.
- Choose an action a* at leaf l.
- Connect l to a new regression tree with only one node (the root).
- Apply the selected action a* to receive an observation o* from the environment simulator and store o* in O=(o1, o2, ..., o*).
- Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
- Receive a reward r.
- Use r to to update all encountered nodes up to the root of the tree.
- Use (o*, r) to update the statistics of the leafs of the regression tree corresponding to action a*.
- Use (ot, r) to update the statistics of the leafs of the regression tree encountered at time step t back up on the way to the root (for t = 1, 2, ...).
- Split leafs of a regression tree if necessary.
Example of first five steps
(remark: a* is always equal to the action at the node "above" it)
Thursday, March 15, 2012
Feedback 15-03-12
General Structures for Nodes and Trees
[2] Cristina Hohan Yu Chang. Relational Reinforcement Learning with the Algorithm TG. Politecnico di Milano. 2010.
- I've implemented some general structures for nodes and trees
- These structures are based on the shared interfaces
- I've implemented (and tested) an incremental regression tree learner
- It's based on the FIMT (see references from previous post) and TG [1,2] algorithms
- Input examples are instances of type Observation from RL-Glue
- Handles only one-dimensional observations at the moment
- Uses the F-Test for finding splits
- Critical values for the F-distribution from 1 to 1000 are stored in an array
- There are three parameters to influence the splitting:
- The minimum number of examples which need to be seen before the algorithm tests for possible splits
- The maximum number of split points that can be stored in a node
- The minimum number of examples which need to be seen for each "side" of a split point before the algorithm considers this split point
- For split points the first n examples are taken (where n is equal to the 2nd parameter described above)
- Implement simulator for the Tiger Problem
- Combine regression tree with MCTS
[2] Cristina Hohan Yu Chang. Relational Reinforcement Learning with the Algorithm TG. Politecnico di Milano. 2010.
Monday, March 12, 2012
Feedback 12-03-2012
Tiger Problem
[2] Geoff Hulten, Pedro Domingos, Laurie Spencer. Mining Massive Data Streams. Journal of Machine Learning Research (2005).
[3] P. Domingos and G. Hulten. Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’00 (2000) 71–80.3.
- I've changed my existing Tiger Problem environment to continuous observations
- One-dimensional observation: -1.0 (tiger is left) to 1.0 (tiger is right)
- Observation is the position of the tiger corrupted by some zero-mean, normally distributed noise with a low standard deviation σ
- General structures for nodes and trees (incorporating the shared interfaces)
- Tree for observations [1,2]
- Simulation to generate examples (for tree for observations)
- Combine tree for observations with remainder of algorithm
[2] Geoff Hulten, Pedro Domingos, Laurie Spencer. Mining Massive Data Streams. Journal of Machine Learning Research (2005).
[3] P. Domingos and G. Hulten. Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’00 (2000) 71–80.3.
Thursday, March 8, 2012
Minutes 08-03-2012
Discussion
- Discretizing the observations should give an approximation of the value function
- Keep advanced questions for later meetings
- Algorithm for horizon 2
- MCTS + regression tree
- Maintain exact belief (in execution step)
- Environment as simulator (2nd instance)
- Always splitting (MCTS, HOO) vs not always splitting (regression tree)
- Transposition table -> transposition tree
- Extend implementation of Tiger problem with continuous observations
- Write implementation plan for algorithm
- Implement the algorithm
Minutes 07-03-2012
Assignment Pitch
- Get to point earlier
- The existing technique already works on continuous states
- Belief space is already a continuous space
- RL-Glue (framework)
- Difference between rollouts (planning) and runs (execution)
- Adaptive range exploration factor C (UCT)
- Evaluation methods
- #Samples/max time vs performance
- Performance over time
- Performance over horizon
- Difference between approximate and exact results
- Error
- Continuous (multi-dimensional) action tree interface
- Aim: Common interface for all our different trees
- Methods: update, getGreedyAction, getUCTAction
- Meta tree interface
- Applicable for my tree?
- Model-based vs model-free learning
- Cloning of state information from the environment for the rollouts might be expensive
- Is there an analytical way to calculate C in the UCT-formula?
- Expected mean value convergence vs behavioral convergence (variance of greedy actions)
- Candidate split generation:
- First n samples
- Grid (equally spaced sample ranges)
- Splitting: first optimality condition (SDR, f-test), then sufficiency condition (f-test)
- Write about adaptive C in LaTeX
- (Prepare for) meeting with Frans and Michael tomorrow (08-03-2011, 14:00) (done)
Tuesday, March 6, 2012
Minutes 06-03-2012
D. Silver's paper
- Discussed belief state approximation with particles
- Discussed pseudo code in detail
- Worked out tree construction
- Continuous observations
- Store observation and corresponding reward for all roll-outs
- Continuous actions
- How to transform a large number of discrete actions to one (or a few) continuous action(s)?
- Combinations of both
- Particle filter could be replaced by (extended) Kalman filter
- Expected reward computation (RDiff)
- Compare approximate results to exact results (e.g. from J. Hoey [1])
[1] Jesse Hoey and Pascal Poupart. Solving POMDPs with continuous or large discrete observation spaces. In Proceedings on the International Joint Conference on Artificial Intelligence. pages 1332 - 1338. 2005.
Monday, March 5, 2012
Feedback 05-03-2012
Parser
- Since the problems do not contain continuous actions / observations, I think it's redundant to implement a parser for such POMDP problems for RL-Glue
- Combination of:
- MCTS (UCT) for optimal action selection
- Particle filter for belief state approximation
- Same simulations used for both techniques
- Search tree is based on:
- Histories in the nodes
- Actions and observations in the edges (i.e. "action" edges followed by "observation" edges)
- Each node is coupled to a set of particles which approximate the belief state
- The algorithm (called POMCP) can make use of domain knowledge
- POMCP performs on a high level in discrete state spaces with up to 10^56 states and beats other online and offline planning algorithms
- Possibility 1: Extend to continuous observations
- Possibility 2: Build the tree on action-observation pairs
- Possiblity 3: Use a Bayes or a Kalman filter for the belief state approximation (although I think a particle filter is already the best choice)
- Preparation for meeting on Wednesday (March 7, 2012, 11:00)
Subscribe to:
Posts (Atom)