reinforcement learning, reward, and architecture

Optimal control approaches to understanding behavior start with some measure of reward or utility that the agent is hypothesized to maximize. But where do rewards come from? Satinder Singh and I, along with Andy Barto at UMass, have been developing a computational theory of reward that seeks to answer this question in a way that has implications for both cognitive science and artificial agent design. The answer is consistent with the observation that for biological agents, reward is a function that must be computed internal to the agent.

The key idea is the formulation of an optimal reward problem. Rather than starting with a reward function that provides a signal to the agent to maximize, we start with an objective fitness function and ask the following question: given this fitness function and some computationally limited agent, what is the best reward function to give this agent so that fitness is maximized over some environments of interest? Surprisingly, the answer for limited agents can be a reward function that differs markedly from the objective function. This optimal reward function is "adapted" to the agent architecture in ways that mitigate its bounds.

At the University of Michigan, a group of us (including Satinder Singh, John Laird, and Thad Polk, see left panel below) is also working on a related project to develop computational agents that operate for extended periods of time in rich and dynamic environments, and achieve mastery of many aspects of their environments without task-specific programming. To accomplish these goals, our research is exploring a space of cognitive architectures that incorporate four fundamental features of real neural circuitry: (1) reinforcing behaviors that lead to intrinsic (and possibly optimal) rewards (2) executing and learning over mental, as well as, motor actions, (3) extracting regularities in mental representations, whether derived from perception or cognitive operations, and (4) continuously encoding and retrieving episodic memories of past events.

To learn more about reinforcement learning, check out Satinder Singh's website here at Michigan.

relevent publications

Attend to the copyright notice.

Guo, X., Singh, S., Lewis, R. L., and Lee, H. (2016). Deep learning for reward design to improve monte carlo tree search in ATARI games. In 25th International Joint Conference on Artificial Intelligence (IJCAI). [ PDF ]

Jiang, N., Kulesza, A., Singh, S., and Lewis, R. L. (2015). The dependence of effective planning horizon on model accuracy. In 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS2015). Best Paper Award. [ PDF ]

Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems (NIPS). [ PDF ]

Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X. (2014). Deep learning for real-time Atari game play using offline monte-carlo tree search planning. In Advances in Neural Information Processing Systems (NIPS). [ PDF ]

Jian, N., Singh, S., and Lewis, R. L. (2014). Improving UCT planning via approximate homomorphisms. In Lomuscio, A., Scerri, P., Bazzan, A., and Huhns, M., editors, Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014), Paris, France. International Foundation for Autonomous Agents and Multiagent Systems. [ PDF ]

Liu, B., Singh, S., Lewis, R. L., and Qin, S. (2014). Optimal rewards for cooperative agents. IEEE Transactions on Autonomous Mental Development, 6(4):286-297.

Shvartsman, M., Lewis, R. L., and Singh, S. (2014). Computationally rational saccadic control: An explanation of spillover effects based on sampling from noisy perception and memory. In Demberg, V. and O'Donnell, T. J., editors, Proceedings of the 5th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2014), Baltimore, MD. Association for Computational Linguistics. Best Student Paper Award. [ PDF ]

Guo, X., Singh, S., and Lewis, R. L. (2013). Reward mapping for transfer in long-lived agents. In Advances in Neural Information Processing Systems 26 (NIPS). [ PDF ]

Lewis, R. L., Shvartsman, M., and Singh, S. (2013). The adaptive nature of eye-movements in linguistic tasks: How payoff and architecture shape speed-accuracy tradeoffs. Topics in Cognitive Science, 5(3):583-610. [ PDF ]

Bratman, J., Singh, S., Lewis, R. L., and Sorg, J. (2012). Strong mitigation: Nesting search for good policies within search for good reward. In Proceedings of the11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012). [ PDF ]

Liu, B., Singh, S., Lewis, R. L., and Quin, S. (2012). Optimal rewards in multiagent teams. In Proceedings of the IEEE Conference on Development and Learning. Paper of Excellence Award. [ PDF ]

Sorg, J., Singh, S., and Lewis, R. L. (2011). Optimal rewards versus leaf-evaluation heuristics in planning agents. In Proceedings of AAAI-2011 (Conference of the Association for the Advancement Artificial Intelligence). [ PDF ]

Bratman, J., Shvartsman, M., Lewis, R. L., and Singh, S. (2010). A new approach to exploring language emergence as boundedly optimal control in the face of environmental and cognitive constraints. In Salvucci, D. and Gunzelmann, G., editors, Proceedings of the 10th International Conference on Cognitive Modeling. To appear. [ PDF ]

Singh, S., Lewis, R. L., Barto, A. G., and Sorg, J. (2010). Instrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development. [ PDF ]

Sorg, J., Singh, S., and Lewis, R. L. (2010b). Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, volume 23. [ PDF ]

Sorg, J., Singh, S., and Lewis, R. L. (2010c). Variance-based rewards for approximate Bayesian reinforcement learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence. Also available at http://event.cwi.nl/uai2010/. [ PDF ]

Sorg, J., Singh, S., and Lewis, R. L. (2010a). Internal rewards mitigate agent boundedness. In International Conference on Machine Learning, Haifa, Israel. [ PDF ]

Singh, S., Lewis, R. L., and Barto, A. G. (2009). Where do rewards come from? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601-2606, Amsterdam. [ PDF ]

Pearson, D., Gorski, N. A., Lewis, R. L., and Laird, J. E. (2007). Storm: A framework for biologically-inspired cognitive architecture research. In Lewis, R., Polk, T., and Laird, J., editors, The Proceedings of the 8th International Conference on Cognitive Modeling. Psychology Press/Taylor & Francis. [ PDF ]

Lewis, R. L. (2001). Cognitive theory, Soar. In International Encylopedia of the Social and Behavioral Sciences, pages 2178-2183. Pergamon (Elsevier Science), Amsterdam. [ PDF ]

These references were generated by bibtex2html 1.93.

Richard Lewis professor
Michael Shvartsman phd student
Monjira Biswas Hajira Choudry Yasaman Kazerooni Sooin Lee Mehgha Shyam		research assistants

Matthias Schlesewsky	universität mainz
Andrew Howes	university of birmingham
Andrew Barto	university of massachusetts
Ina Bornkessel-Schlesewsky	universität marburg
Alonso Vera Michael Feary Collin Green	nasa ames research center
William Badecker	national science foundation
Shravan Vasishth	universität potsdam
Xiaoxiao Guo Nan Jiang John Laird Satinder Singh	michigan computer science
Samuel Epstein	michigan linguistics
Marc Berman	university of toronto
Guadalupe De Los Santos Julie Boland John Jonides David Meyer Thad Polk	michigan psychology