Wednesday, May 16, 2012

Feedback 16-05-12

Progress
  • Implemented heuristic and magic agent for the Tiger Problem (see below)
  • Changed the choice of actions in COMCTS to a more random choice in both the final action selection for the step in the "real world" and for the selection step in the simulation
    • This made the algorithm more stable and removed a lot of dips in the plots (see below)
Experimental Set-up
  • Mean value is taken over:
    • 10,000 runs for Horizon 2
    • 5,000 runs for Horizon 10
    • 1,000 runs for Horizon 100
  • Error bars indicate standard error
  • Environment is continuing
  • Execution is stopped after Horizon=k actions
  • Discount factor = 0.5, discount horizon = 0.000001
Algorithms
  • Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10 and 100
  • MC: Monte Carlo with uniformly sampled actions
  • COMCTS with deletion: deletes everything below a node which is split in the regression tree
  • COMCTS with local recall
    • stores all examples
    • when a split occurs at a node w and two new children l and r are createdthe values stored at the winning test for the split point at w are given to l and r such that l and r start with some mean and count
    • the stored examples are used to create new tests (for split points) at l and r
  • MC and COMCTS both use ε-accurate horizon time [Michael J. Kearns, Yishay Mansour, Andrew Y. Ng: A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49(2-3): 193-208 (2002)]
  • Heuristic: listens until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
  • Magic: can magically observe the real state of the environment but waits until its belief is sufficiently high (b(s)<0.1, b(s)>0.9) and then opens one of the doors
Results
  • Horizon 2:








  


  • Horizon 10:













  • Horizon 100:

1 comment:

  1. Dear Andreas,

    good progress, I now have the impression I can judge the performance of your algorithm. Unfortunately, it does not much better than the heuristic. Maybe you can elicit some advantage in the long run by looking at the regret plot (see last plot of Colin's blog entry from 27 April).

    You may also want to try varying the tiger reset probabilities. Now comes the exciting experimentation phase. Good luck :)

    Best regards, Michael

    ReplyDelete