Implemented heuristic and magic agent for the Tiger Problem (see below)
Changed the choice of actions in COMCTS to a more randomchoice in both the final action selection for the stepin the "real world" and for the selection step in the simulation
This made the algorithm more stable and removed a lot of dips in the plots (see below)
Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10 and 100
MC: Monte Carlo with uniformly sampled actions
COMCTS with deletion: deletes everything below a node which is split in the regression tree
COMCTS with local recall:
stores all examples
when a split occurs at a node w and two new children l and r are created, the values stored at the winning test for the split point at w are given to l and r such that l and r start with some mean and count
the stored examples are used to create new tests (for split points) at l and r
MC and COMCTS both use ε-accurate horizon time [Michael J. Kearns,
Yishay Mansour,
Andrew Y. Ng:
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes.
Machine Learning 49(2-3): 193-208 (2002)]
Heuristic: listens until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
Magic: can magically observe the real state of the environment but waits until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
good progress, I now have the impression I can judge the performance of your algorithm. Unfortunately, it does not much better than the heuristic. Maybe you can elicit some advantage in the long run by looking at the regret plot (see last plot of Colin's blog entry from 27 April).
You may also want to try varying the tiger reset probabilities. Now comes the exciting experimentation phase. Good luck :)
Dear Andreas,
ReplyDeletegood progress, I now have the impression I can judge the performance of your algorithm. Unfortunately, it does not much better than the heuristic. Maybe you can elicit some advantage in the long run by looking at the regret plot (see last plot of Colin's blog entry from 27 April).
You may also want to try varying the tiger reset probabilities. Now comes the exciting experimentation phase. Good luck :)
Best regards, Michael