reinforcement learning, reward, and architecture
Optimal control approaches to understanding behavior start with some measure of reward or utility that the agent is hypothesized to maximize. But where do rewards come from? Satinder Singh and I, along with Andy Barto at UMass, have been developing a computational theory of reward that seeks to answer this question in a way that has implications for both cognitive science and artificial agent design. The answer is consistent with the observation that for biological agents, reward is a function that must be computed internal to the agent.
The key idea is the formulation of an optimal reward problem. Rather than starting with a reward function that provides a signal to the agent to maximize, we start with an objective fitness function and ask the following question: given this fitness function and some computationally limited agent, what is the best reward function to give this agent so that fitness is maximized over some environments of interest? Surprisingly, the answer for limited agents can be a reward function that differs markedly from the objective function. This optimal reward function is "adapted" to the agent architecture in ways that mitigate its bounds.
At the University of Michigan, a group of us (including Satinder Singh, John Laird, and Thad Polk, see left panel below) is also working on a related project to develop computational agents that operate for extended periods of time in rich and dynamic environments, and achieve mastery of many aspects of their environments without task-specific programming. To accomplish these goals, our research is exploring a space of cognitive architectures that incorporate four fundamental features of real neural circuitry: (1) reinforcing behaviors that lead to intrinsic (and possibly optimal) rewards (2) executing and learning over mental, as well as, motor actions, (3) extracting regularities in mental representations, whether derived from perception or cognitive operations, and (4) continuously encoding and retrieving episodic memories of past events.
To learn more about reinforcement learning, check out Satinder Singh's website here at Michigan.
Attend to the copyright notice.
Guo, X., Singh, S., Lewis, R. L., and Lee, H. (2016). Deep learning for reward design to improve monte carlo tree search in ATARI games. In 25th International Joint Conference on Artificial Intelligence (IJCAI). [ ]
Jiang, N., Kulesza, A., Singh, S., and Lewis, R. L. (2015). The dependence of effective planning horizon on model accuracy. In 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS2015). Best Paper Award. [ ]
Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X. (2014). Deep learning for real-time Atari game play using offline monte-carlo tree search planning. In Advances in Neural Information Processing Systems (NIPS). [ ]
Jian, N., Singh, S., and Lewis, R. L. (2014). Improving UCT planning via approximate homomorphisms. In Lomuscio, A., Scerri, P., Bazzan, A., and Huhns, M., editors, Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014), Paris, France. International Foundation for Autonomous Agents and Multiagent Systems. [ ]
Shvartsman, M., Lewis, R. L., and Singh, S. (2014). Computationally rational saccadic control: An explanation of spillover effects based on sampling from noisy perception and memory. In Demberg, V. and O'Donnell, T. J., editors, Proceedings of the 5th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2014), Baltimore, MD. Association for Computational Linguistics. Best Student Paper Award. [ ]
Lewis, R. L., Shvartsman, M., and Singh, S. (2013). The adaptive nature of eye-movements in linguistic tasks: How payoff and architecture shape speed-accuracy tradeoffs. Topics in Cognitive Science, 5(3):583-610. [ ]
Bratman, J., Singh, S., Lewis, R. L., and Sorg, J. (2012). Strong mitigation: Nesting search for good policies within search for good reward. In Proceedings of the11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012). [ ]
Sorg, J., Singh, S., and Lewis, R. L. (2011). Optimal rewards versus leaf-evaluation heuristics in planning agents. In Proceedings of AAAI-2011 (Conference of the Association for the Advancement Artificial Intelligence). [ ]
Bratman, J., Shvartsman, M., Lewis, R. L., and Singh, S. (2010). A new approach to exploring language emergence as boundedly optimal control in the face of environmental and cognitive constraints. In Salvucci, D. and Gunzelmann, G., editors, Proceedings of the 10th International Conference on Cognitive Modeling. To appear. [ ]
Sorg, J., Singh, S., and Lewis, R. L. (2010c). Variance-based rewards for approximate Bayesian reinforcement learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence. Also available at http://event.cwi.nl/uai2010/. [ ]
Pearson, D., Gorski, N. A., Lewis, R. L., and Laird, J. E. (2007). Storm: A framework for biologically-inspired cognitive architecture research. In Lewis, R., Polk, T., and Laird, J., editors, The Proceedings of the 8th International Conference on Cognitive Modeling. Psychology Press/Taylor & Francis. [ ]
These references were generated by bibtex2html 1.93.