- Mean value is taken over:
- 10,000 runs for Horizon 2
- 5,000 runs for Horizon 10
- Error bars indicate standard error
- Environment is continuing
- Execution is stopped after Horizon=k actions
- Discount factor = 0.5, discount horizon = 0.000001
- Random: computed by hand for Horizon 2; using an agent which selects actions randomly for Horizon 10
- Optimal: represents the MDP solution (assuming that the correct action is always taken and assuming a perfect observation model)
- COMCTS with local recall: stores all examples and uses them to rebuild the regression tree
- MC and COMCTS both use e-accurate horizon time [Michael J. Kearns, Yishay Mansour, Andrew Y. Ng: A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49(2-3): 193-208 (2002)]
- Horizon 2:
First 1000 samples |
First 1000 samples (x-axis in log scale) |
First 200 samples |
First 200 samples (x-axis in log scale) |
- Horizon 10:
First 200 Samples |
First 200 Samples (x-axis in log scale) |
First 500 Samples |
First 500 Samples (x-axis in log scale) |
No comments:
Post a Comment