**Progress**

- Implemented exact belief state update
- Implemented .properties-file loader to load parameter settings for algorithms
- Implemented
*algorithm for POMDPs with continuous observations*(see prev. post)

- Implemented exact belief state update
- Implemented .properties-file loader to load parameter settings for algorithms
- Implemented
*algorithm for POMDPs with continuous observations*(see prev. post)

- Steps colored in blue
- Observations need to be obtained from the environment simulator and stored while traversing the tree because they are needed to update the regression trees in the backpropagation step
- Using the observation receveived in the final state
*t=max_roll_out_depth*of the roll-out is__not__correct because the action and the state at time*t*influence the observation and not the action and the state at time*t=max_roll_out_depth*

- Start at root.
- If there are children below this node and the max. search depth is not reached, select one according to some selection strategy, e.g., UCT. Otherwise, continue with Step (6).
- Apply the action corresponding to the selected child to receive an observation
*o*from the environment simulator and store^{t}*o*in a list^{t}*O=(o*.^{1}, o^{2}, ..., o^{t}) - Use o
^{t}to traverse the regression tree until your reach a leaf. - Repeat Steps (2) - (4).
- Return that leaf
*l*.

- Choose an action
*a**at leaf*l*. - Connect
*l*to a new regression tree with only one node (the root).

- Apply the selected action
*a**to receive an observation*o**from the environment simulator and store*o**in*O=(o*^{1}, o^{2}, ...,*o***)*. - Play randomly or according to some other roll-out strategy until the max. roll-out depth or a terminal state is reached.
- Receive a reward
*r*.

- Use
*r*to to update all encountered nodes up to the root of the tree. - Use
*(o*, r)*to update the statistics of the leafs of the regression tree corresponding to action*a**. - Use
*(o*to update the statistics of the leafs of the regression tree encountered at time step^{t}, r)*t*back up on the way to the root (for*t = 1, 2, ...*). - Split leafs of a regression tree if necessary.

(remark:

- I've implemented some general structures for nodes and trees
- These structures are
__based__

- I've
- It's based on the
*FIMT*(see references from previous post) and*TG*[1,2] algorithms - Input examples are instances of type
*Observation*from RL-Glue - Handles only one-dimensional observations at the moment
- Uses the F-Test for finding splits
- Critical values for the F-distribution from 1 to 1000 are stored in an array
- There are three parameters to influence the splitting:
- The minimum number of examples which need to be seen before the algorithm tests for possible splits
- The maximum number of split points that can be stored in a node
- The minimum number of examples which need to be seen for each "side" of a split point before the algorithm considers this split point
- For split points the first
*n*examples are taken (where*n*is equal to the 2nd parameter described above)

- Implement simulator for the Tiger Problem
- Combine regression tree with MCTS

[2] Cristina Hohan Yu Chang. Relational Reinforcement Learning with the Algorithm TG. Politecnico di Milano. 2010.

- I've changed my existing Tiger Problem environment to continuous observations
- One-dimensional observation: -1.0 (tiger is left) to 1.0 (tiger is right)
- Observation is the position of the tiger corrupted by some zero-mean, normally distributed noise with a low standard deviation σ

- General structures for nodes and trees (incorporating the shared interfaces)
- Tree for observations [1,2]
- Simulation to generate examples (for tree for observations)
- Combine tree for observations with remainder of algorithm

[2] Geoff Hulten, Pedro Domingos, Laurie Spencer. Mining Massive Data Streams. Journal of Machine Learning Research (2005).

[3] P. Domingos and G. Hulten. Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’00 (2000) 71–80.3.

- Discretizing the observations should give an approximation of the value function
- Keep advanced questions for later meetings
- Algorithm for horizon 2
- MCTS + regression tree
- Maintain exact belief (in execution step)
- Environment as simulator (2nd instance)
- Always splitting (MCTS, HOO) vs not always splitting (regression tree)
- Transposition table -> transposition tree

- Extend implementation of Tiger problem with continuous observations
- Write implementation plan for algorithm
- Implement the algorithm

- Get to point earlier
- The existing technique already works on continuous states
- Belief space is already a continuous space

- RL-Glue (framework)
- Difference between rollouts (planning) and runs (execution)
- Adaptive range exploration factor C (UCT)
- Evaluation methods
- #Samples/max time vs performance
- Performance over time
- Performance over horizon
- Difference between approximate and exact results
- Error
- Continuous (multi-dimensional) action tree interface
- Aim: Common interface for all our different trees
- Methods: update, getGreedyAction, getUCTAction
- Meta tree interface
- Applicable for my tree?

- Model-based vs model-free learning
- Cloning of state information from the environment for the rollouts might be expensive
- Is there an analytical way to calculate C in the UCT-formula?
- Expected mean value convergence vs behavioral convergence (variance of greedy actions)
- Candidate split generation:
- First n samples
- Grid (equally spaced sample ranges)
- Splitting: first optimality condition (SDR, f-test), then sufficiency condition (f-test)

- Write about adaptive C in LaTeX
- (Prepare for) meeting with Frans and Michael tomorrow (08-03-2011, 14:00) (
*done*)

- Discussed belief state approximation with particles
- Discussed pseudo code in detail
- Worked out tree construction

- Continuous observations
- Store observation and corresponding reward for all roll-outs
- Continuous actions
- How to transform a large number of discrete actions to one (or a few) continuous action(s)?
- Combinations of both

- Particle filter could be replaced by (extended) Kalman filter
- Expected reward computation (
*R*)_{Diff}

- Compare approximate results to exact results (e.g. from J. Hoey [1])

[1] Jesse Hoey and Pascal Poupart. Solving POMDPs with continuous or large discrete observation spaces. In Proceedings on the International Joint Conference on Artificial Intelligence. pages 1332 - 1338. 2005.

- Since the problems do not contain continuous actions / observations, I think it's redundant to implement a parser for such POMDP problems for RL-Glue

- Combination of:
- MCTS (UCT) for optimal action selection
- Particle filter for belief state approximation
- Same simulations used for both techniques
- Search tree is based on:
- Histories in the nodes
- Actions and observations in the edges (i.e. "action" edges
*followed*by "observation" edges) - Each node is coupled to a set of particles which approximate the belief state
- The algorithm (called POMCP) can make use of domain knowledge
- POMCP performs on a high level in discrete state spaces with up to 10^56 states and beats other online and offline planning algorithms
__Possibility 1:__Extend to continuous observations__Possibility 2:__Build the tree on action-observation pairs__Possiblity 3:__Use a Bayes or a Kalman filter for the belief state approximation (although I think a particle filter is already the best choice)

- Preparation for meeting on Wednesday (
*March 7, 2012, 11:00*)

Subscribe to:
Posts (Atom)