Timothy P. Lillicraplowast;, Jonathan J. Huntlowast;, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver amp; Daan Wierstra

Google Deepmind London, UK

{countzero, jjhunt, apritzel, heess,

etom, tassa, davidsilver, wierstra} @ google.com

ABSTRACT

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the de- terministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our al- gorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is com- petitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies “end-to-end”: directly from raw pixel in- puts.

INTRODUCTION

One of the primary goals of the field of artificial intelligence is to solve complex tasks from unpro- cessed, high-dimensional, sensory input. Recently, significant progress has been made by combin- ing advances in deep learning for sensory processing (Krizhevsky et al., 2012) with reinforcement learning, resulting in the “Deep Q Network” (DQN) algorithm (Mnih et al., 2015) that is capable of human level performance on many Atari video games using unprocessed pixels for input. To do so, deep neural network function approximators were used to estimate the action-value function.

However, while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces. DQN cannot be straight- forwardly applied to continuous domains since it relies on a finding the action that maximizes the action-value function, which in the continuous valued case requires an iterative optimization process at every step.

An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous domains is to to simply discretize the action space. However, this has many limitations, most no- tably the curse of dimensionality: the number of actions increases exponentially with the number of degrees of freedom. For example, a 7 degree of freedom system (as in the human arm) with the coarsest discretization a_i k, 0, k for each joint leads to an action space with dimensionality: 3⁷ = 2187. The situation is even worse for tasks that require fine control of actions as they require a correspondingly finer grained discretization, leading to an explosion of the number of discrete actions. Such large action spaces are difficult to explore efficiently, and thus successfully training DQN-like networks in this context is likely intractable. Additionally, naive discretization of action spaces needlessly throws away information about the structure of the action domain, which may be essential for solving many problems.

isin; {minus; }

In this work we present a model-free, off-policy actor-critic algorithm using deep function approx- imators that can learn policies in high-dimensional, continuous action spaces. Our work is based

lowast;These authors contributed equally.

on the deterministic policy gradient (DPG) algorithm (Silver et al., 2014) (itself similar to NFQCA (Hafner amp; Riedmiller, 2011), and similar ideas can be found in (Prokhorov et al., 1997)). However, as we show below, a naive application of this actor-critic method with neural function approximators is unstable for challenging problems.

Here we combine the actor-critic approach with insights from the recent success of Deep Q Network (DQN) (Mnih et al., 2013; 2015). Prior to DQN, it was generally believed that learning value functions using large, non-linear function approximators was difficult and unstable. DQN is able to learn value functions using such function approximators in a stable and robust way due to two innovations: 1. the network is trained off-policy with samples from a replay buffer to minimize correlations between samples; 2. the network is trained with a target Q network to give consistent targets during temporal difference backups. In this work we make use of the same ideas, along with batch normalization (Ioffe amp; Szegedy, 2015), a recent advance in deep learning.

In order to evaluate our method we constructed a variety of challenging physical control problems that involve complex multi-joint movements, unstable and rich contact dynamics, and gait behavior. Among these are classic problems such as the cartpole swing-up problem, as well as many new domains. A long-standing challenge of robotic control is to learn an action policy directly from raw sensory input such as video. Accordingly, we place a fixed viewpoint camera in the simulator and attempted all tasks using both low-dimensional observations (e.g. joint angles) and directly from pixels.

Our model-free approach which we call Deep DPG (DDPG) can learn competitive policies for all of our tasks using low-dimensional observations (e.g. cartesian coordinates or joint angles) using the same hyper-parameters and network structure. In many cases, we are also able to learn good poli

英语原文共 10 页

资料编号：[4672]

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码

基于强化学习的迷宫寻宝策略及APP设计与实现外文翻译资料

Timothy P. Lillicraplowast;, Jonathan J. Huntlowast;, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver amp; Daan Wierstra

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于强化学习的迷宫寻宝策略及APP设计与实现外文翻译资料

Timothy P. Lillicraplowast;, Jonathan J. Huntlowast;, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver amp; Daan Wierstra

您可能感兴趣的文章

最新文档

推荐栏目