- Implemented complete recall for COMCTS:
- Stores all examples (observation(t),reward(t)) together with their "future" {<action(t+1),observation(t+1),reward(t+1)>,<action(t+2),observation(t+2),reward(t+2)>,...)}
- Uses this knowledge to reconstruct the tree after a split
- Implemented a 4x3 grid world:
- Actions: up, down, left right
- Observations: number of adjacent walls corrupted by some normally distributed noise with standard deviation = 0.965
- Rewards: +1 for goal state, -1 for penalty state, movement costs -0.04
Tiger Problem: Horizon 2
Grid World: max. 22 steps
Very good, can you also show the regret plots to zoom in on the behavior in the end?