\dm_csml_event_details
Speaker |
Remi Munos |
---|---|
Affiliation |
Google DeepMind |
Date |
Friday, 04 November 2016 |
Time |
13:00-14:00 |
Location |
Zoom |
Link |
Roberts Building 508 |
Event series |
Jump Trading/ELLIS CSML Seminar Series |
Abstract |
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) low variance; (2) safety, as it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) efficiency, as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. To our knowledge, this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(λ), which was still an open problem. We illustrate the benefits of Retrace(λ) on a standard suite of Atari 2600 games. Bio: Remi Munos is currently research scientist at Google DeepMind and on leave from Inria. He worked on topics related to reinforcement learning, bandit theory, optimisation, and statistical learning. |
Biography |