深度强化学习综述论文 A Brief Survey of Deep Reinforcement Learning

A Brief Survey of Deep Reinforcement Learning


Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath

摘要 Abstract

Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning (RL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels. DRL algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of RL, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via RL. To conclude, we describe several current areas of research within the field.

深度强化学习(DRL)马上就会彻底改变人工智能(AI)领域,它在构建对视觉世界有更高层次理解的自主系统中是一座里程碑。目前,深度学习正使得强化学习(RL)可以扩展到以前难以解决的问题,如直接从屏幕像素学习玩电子游戏。DRL算法也适用于机器人,允许机器人的控制策略直接从现实世界的摄像头中输入学习。在本篇调查报告中,我们首先介绍RL的一般方法,然后过渡到基于价值和基于策略的主流方法。本篇调查报告将涵盖Deep RL中的主要算法,包括深度Q网络(deep Q-network, DQN)、信任域策略优化算法(trust region policy optimization, TRPO)和异步优势演员-评论家算法(asynchronous advantage actor-critical, A3C)。同时,我们强调了深度神经网络的独特优势,重点关注RL的视觉理解。最后,我们描述了深度强化学习领域目前的几个研究重点。

1. 引言 Introduction

One of the primary goals of the field of artificial intelligence(AI) is to produce fully autonomous agents that interact with their environments to learn optimal behaviours, improving over time through trial and error. Crafting AI systems that are responsive and can effectively learn has been a long-standing challenge, ranging from robots, which can sense and react to the world around them, to purely software-based agents, which can interact with natural language and multimedia. A principled mathematical framework for experience-driven autonomous learning is reinforcement learning (RL) [135]. Al-though RL had some successes in the past [141, 129, 62, 93], previous approaches lacked scalability and were inherently limited to fairly low-dimensional problems. These limitations exist because RL algorithms share the same complexity issues as other algorithms: memory complexity, computational complexity, and in the case of machine learning algorithms, sample complexity [133]. What we have witnessed in recent years—the rise of deep learning, relying on the powerful function approximation and representation learning properties of deep neural networks—has provided us with new tools to overcoming these problems.


The advent of deep learning has had a significant impact on many areas in machine learning, dramatically improving the state of the art in tasks such as object detection, speech recognition, and language translation [39]. The most important property of deep learning is that deep neural networks can automatically find compact low-dimensional representations (features) of high-dimensional data (e.g., images, text, and audio). Through crafting inductive biases into neural network architectures, particularly that of hierarchical representations, machine-learning practitioners have made effective progress in addressing the curse of dimensionality [7]. Deep learning has similarly accelerated progress in RL, with the use of deep-learning algorithms within RL defining the field of DRL. The aim of this survey is to cover both seminal and recent developments in DRL, conveying the innovative ways in which neural networks can be used to bring us closer toward developing autonomous agents. For a more comprehensive survey of recent efforts in DRL, we refer readers to the overview by Li [43].


Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. Among recent work in the field of DRL, there have been two outstanding success stories. The first, kick starting the revolution in DRL, was the development of an algorithm that could learn to play a range of Atari 2600 video games at a superhuman level, directly from image pixels [47]. Providing solutions for the instability of function approximation techniques in RL, this work was the first to convincingly demonstrate that RL agents could be trained on raw, high-dimensional observations, solely based on a reward signal. The second standout success was the development of a hybrid DRL system, AlphaGo, that defeated a human world champion in Go [73], paralleling the historic achievement of IBM’s Deep Blue in chess two decades earlier [9]. Unlike the handcrafted rules that have dominated chess-playing systems, AlphaGo comprised neural networks that were trained using supervised learning and RL, in combination with a traditional heuristic search algorithm.

深度学习使RL能够扩展到以前难以处理的决策问题,即具有高维状态和动作空间的环境。在DRL领域近期的工作中,有两个杰出的成功故事。首先,启动DRL革命的是一种算法的开发,该算法可以直接从图像像素学习以“superhuman level”的水平玩一系列雅达利2600的视频游戏[47]。这项工作为RL中函数近似技术的不稳定性提供了解决方案,它首次令人信服地证明了RL的Agent可以在原始的高维观察上仅基于奖励信号进行训练。第二个成就是开发了一种混合DRL系统——AlphaGo,它在围棋中击败了一位人类世界冠军,与20年前IBM的深蓝(Deep Blue)在国际象棋中的历史性成就相媲美[9]。与主宰国际象棋系统的手工规则不同,AlphaGo由神经网络组成,这些神经网络使用监督学习和RL,并结合传统的启发式搜索算法进行训练。

DRL algorithms have already been applied to a wide range of problems, such as robotics, where control policies for robots can now be learned directly from camera inputs in the real world [41], [42], succeeding controllers that used to be hand-engineered or learned from low-dimensional features of the robot’s state. In Figure 1, we showcase just some of the domains that DRL has been applied to, ranging from playing video games [47] to indoor navigation [100].



图 1. 一系列的视觉RL应用领域。
(a) 来自街机学习环境(Arcade Learning Environment, ALE)的三款经典的雅达利2600电子游戏,Enduro、Freeway和Seaquest[5]。由于受支持的游戏类型、视觉效果和难度各不相同,ALE已成为DRL算法[20、47、48、55、70、75、92]的标准测试平台。ALE是目前用来标准化RL评估的几个基准之一。
(b) TORCS赛车模拟器,用于测试可以输出连续动作的DRL算法[33、44、48] (因为来自ALE的游戏只支持离散动作)。
(c ) 利用机器人模拟器中可能积累的无限数量的训练数据,有几种方法可以用于将知识从模拟器转移到真实世界[11、64、84]。
(d) Levine 等设计的四种机器人任务中的两种:拧上瓶盖,把一个形状的块放在正确的孔里,能够以端到端的方式训练视觉运动策略[41],这表明视觉伺服可以通过使用深度神经网络直接从原始摄像头的输入中进行学习。
(e) 一个真实的房间,在这个房间里,一个被训练来导航建筑物的轮式机器人被给予一个视觉提示作为输入,并且必须找到相应的位置[100]。
(f) 一幅自然图像,由神经网络使用RL来选择看哪里[99]。通过对生成的每个单词的一小部分图像进行处理,网络可以将注意力集中在最显著的点上。
(b)-(f) 分别从[41]、[44]、[84]、[99]和[100]复制。

2. 奖励驱动行为 Reward-Driven Behavior

Before examining the contributions of deep neural networks to RL, we will introduce the field of RL in general. The essence of RL is learning through interaction. An RL agent interacts with its environment and, upon observing the consequences of its actions, can learn to alter its own behavior in response to rewards received. This paradigm of trial-and-error learning has its roots in behaviorist psychology and is one of the main foundations of RL [78]. The other key influence on RL is optimal control, which has lent the mathematical formalisms (most notably dynamic programming [6]) that underpin the field.


In the RL setup, an autonomous agent, controlled by a machine-learning algorithm, observes a state s t s_t st from its environment at time step t t t. The agent interacts with the environment by taking an action a t a_t at in state s t s_t st. When the agent takes an action, the environment and the agent transition to a new state, s t + 1 s_{t+1} st+1, based on the current state and the chosen action. The state is a sufficient statistic of the environment and thereby comprises all the necessary information for the agent to take the best action, which can include parts of the agent such as the position of its actuators and sensors. In the optimal control literature, states and actions are often denoted by x t x_t xt and u t u_t ut, respectively.

在RL环境中,一个自主的Agent,由机器学习算法控制,在时间 t t t 从当前环境中观察到状态 s t s_t st。Agent通过在状态为 s t s_t st 时执行动作 a t a_t at 与环境交互。当Agent采取一个动作时,环境和Agent将根据当前状态和选择的动作转换到一个新状态 s t + 1 s_{t+1} st+1。状态是对环境的充分统计,因此包括了Agent采取最佳行动的所有必要信息,其中可以包括Agent的部分,如执行器和传感器的位置。在最优控制的文献中,状态和动作通常分别用 x t x_t xt u t u_t ut 表示。

The best sequence of actions is determined by the rewards provided by the environment. Every time the environment transitions to a new state, it also provides a scalar reward r t + 1 r_{t+1} rt+1 to the agent as feedback. The goal of the agent is to learn a policy (control strategy) π \pi π that maximizes the expected return (cumulative, discounted reward). Given a state, a policy returns an action to perform; an optimal policy is any policy that maximizes the expected return in the environment. In this respect, RL aims to solve the same problem as optimal control. However, the challenge in RL is that the agent needs to learn about the consequences of actions in the environment by trial and error, as, unlike in optimal control, a model of the state transition dynamics is not available to the agent. Every interaction with the environment yields information, which the agent uses to update its knowledge. This perception-action-learning loop is illustrated in Figure 2.

最佳行动顺序是由环境所提供的奖励所决定的。每当环境转换到一个新状态时,它还会向Agent提供一个奖励 r t + 1 r_{t+1} rt+1作为反馈。Agent的目标是学习一种策略(控制策略) π \pi π,使期望回报(累积,折扣奖励)最大化。给定一个状态,则策略会返回一个要执行的动作;最优策略是在环境中使期望收益最大化的策略。在这方面,RL的目标是解决与最优控制相同的问题。然而,RL的挑战在于,Agent需要通过试错来了解环境中行为的后果,因为与最优控制不同,Agent无法获得状态转移动态模型。与环境的每次交互都会产生信息,Agent使用这些信息来更新其知识。这个感知-动作-学习循环如图2所示。


图 2. 感知-动作-学习循环。在时刻 t t t,Agent从环境接收状态 s t s_t st。Agent通过其策略选择一个动作 a t a_t at。一旦动作被执行,环境就会转变为一个步骤,提供下一个状态 s t + 1 s_{t+1} st+1,并且返回奖励 r t + 1 r_{t+1} rt+1。Agent使用状态转换的知识,形式为 ( s t , a t , s t + 1 , r t + 1 ) (s_t, a_t, s_{t+1}, r_{t+1}) (st,at,st+1,rt+1),以学习和改进其策略。

2.1. 马尔科夫决策过程 Markov Decision Processes

Formally, RL can be described as a Markov decision process (MDP), which consists of

  • a set of states S \mathcal{S} S, plus a distribution of starting states p ( s 0 ) p(s_0) p(s0)
  • a set of actions A \mathcal{A} A
  • transition dynamics T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t, a_t) T(st+1st,at) that map a state-action pair at time t t t onto a distribution of states at time t + 1 t+1 t+1
  • an immediate/instantaneous reward function R ( s t , a t , s t + 1 ) R(s_t, a_t, s_{t+1}) R(st,at,st+1)
    a discount factor γ ∈ [ 0 , 1 ] \gamma\in[0,1] γ[0,1], where lower values place more emphasis on immediate rewards.


  • 一组状态 S \mathcal{S} S,加上一个初始状态分布 p ( s 0 ) p(s_0) p(s0)
  • 一组动作 A \mathcal{A} A
  • 转移动态 T ( s T + 1 ∣ s t , a t ) T(s_{T +1}|s_t, a_t) T(sT+1st,at),它映射一个状态-动作对从时间 t t t到时间 t + 1 t+1 t+1的状态分布
  • 瞬时奖励函数 R ( s t , a t , s t + 1 ) R(s_t, a_t, s_{t+1}) R(st,at,st+1)
  • 一个折扣因子 γ ∈ [ 0 , 1 ] \gamma\in[0,1] γ[0,1],折扣因子的值越低越强调即时奖励。

In general, the policy π \pi π is a mapping from states to a probability distribution over actions π : S → p ( A = a ∣ S ) \pi:\mathcal{S}\to p(\mathcal{A}=a|\mathcal{S}) π:Sp(A=aS). If the MDP is episodic, i.e., the state is reset after each episode of length T T T, then the sequence of states, actions, and rewards in an episode constitutes a trajectory or rollout of the policy. Every rollout of a policy accumulates rewards from the environment, resulting in the return R = ∑ t = 0 T − 1 γ t r t + 1 R=\sum_{t=0}^{T−1}\gamma^tr_{t+1} R=t=0T1γtrt+1. The goal of RL is to find an optimal policy, π ∗ \pi^∗ π that achieves the maximum expected return from all states: π ∗ = arg max ⁡ π E [ R ∣ π ] \pi^*=\argmax_{\pi}\mathbb{E}[R|\pi] π=πargmaxE[Rπ]

一般来说,策略 π \pi π是从状态到动作 π : S → p ( A = a ∣ S ) \pi:\mathcal{S}\to p(\mathcal{A}=a|\mathcal{S}) π:Sp(A=aS)概率分布的映射。如果MDP是情景性的,即在每一长度为 T T T的episode之后状态会重置,那么每一episode的状态、行为和奖励的序列可以构成轨迹或推出策略。每次推出策略都会从环境中积累奖励,结果是 R = ∑ t = 0 T − 1 γ t r t + 1 R=\sum_{t=0}^{T−1}\gamma^tr_{t+1} R=t=0T1γtrt+1。RL的目标是找到一个最优策略 π ∗ \pi^∗ π,它能从所有状态中获得最大的期望收益: π ∗ = arg max ⁡ π E [ R ∣ π ] \pi^*=\argmax_{\pi}\mathbb{E}[R|\pi] π=πargmaxE[Rπ]

It is also possible to consider nonepisodic MDPs, where T = ∞ T=\infty T=. In this situation, γ < 1 \gamma<1 γ<1 prevents an infinite sum of rewards from being accumulated. Furthermore, methods that rely on complete trajectories are no longer applicable, but those that use a finite set of transitions still are.

也可以考虑 T = ∞ T=\infty T= 的非情景MDP。在这种情况下, γ < 1 \gamma<1 γ<1阻止了奖励的无限累积。此外,依赖于完整轨迹的方法不再适用,但那些使用有限过渡集的方法仍然适用。

A key concept underlying RL is the Markov property—only the current state affects the next state, or, in other words, the future is conditionally independent of the past given the present state. This means that any decisions made at s t s_t st can be based solely on s t − 1 s_{t-1} st1, rather than { s 0 , s 1 , … , s t − 1 } \{s_0, s_1, …, s_{t−1}\} {s0,s1,,st1}. Although this assumption is held by the majority of RL algorithms, it is somewhat unrealistic, as it requires the states to be fully observable. A generalization of MDPs are partially observable MDPs (POMDPs), in which the agent receives an observation o t ∈ Ω o_t\in\Omega otΩ, where the distribution of the observation p ( o t + 1 ∣ s t + 1 , a t ) p(o_{t+1}|s_{t+1}, a_t) p(ot+1st+1,at) is dependent on the current state and the previous action [27]. In a control and signal processing context, the observation would be described by a measurement/observation mapping in a state-space model that depends on the current state and the previously applied action.

RL的一个关键概念是马尔科夫属性——只有当前状态才会影响下一个状态,换句话说,在当前状态下,未来是有条件独立于过去的。这意味着在 s t s_t st上做出的任何决定都可以仅基于 s t − 1 s_{t-1} st1,而不是 { s 0 , s 1 , … , s t − 1 } \{s_0, s_1, …, s_{t−1}\} {s0,s1,,st1}。尽管大多数RL算法都持有这个假设,但它有些不现实,因为它要求状态是完全可观察的。其中agent在 o t ∈ Ω o_t\in\Omega otΩ中接收到一个观测,其中观测的分布 p ( o t + 1 ∣ s t + 1 , a t ) p(o_{t+1}|s_{t+1}, a_t) p(ot+1st+1,at) 依赖于当前状态和前一个动作[27]。在控制和信号处理上下文中,观测将由状态空间模型中的测量/观测映射来描述,该模型依赖于当前状态和先前执行的动作。

POMDP algorithms typically maintain a belief over the current state given the previous belief state, the action taken, and the current observation. A more common approach in deep learning is to utilize recurrent neural networks (RNNs) [20], [21], [48], [96], which, unlike feedforward neural networks, are dynamical systems.

部分可观察马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)算法通常在给定前一个信念状态、所采取的行动和当前观察,在这个当前状态上维护一个信念。深度学习中更常见的方法是利用递归神经网络[20、21、48、96],与前馈神经网络不同,递归神经网络是动态系统。

2.2. 强化学习的挑战 Challenges in RL

It is instructive to emphasize some challenges faced in RL:

  • The optimal policy must be inferred by trial-and-error interaction with the environment. The only learning signal the agent receives is the reward.
  • The observations of the agent depend on its actions and can contain strong temporal correlations.
  • Agents must deal with long-range time dependencies: often the consequences of an action only materialize after many transitions of the environment. This is known as the (temporal) credit assignment problem [78].


  • 最优策略必须通过与环境的反复试错才能找出。Agent收到的唯一学习信号就是奖励。
  • 对Agent的观察依赖于其行为,并可能包含强烈的时间相关性。
  • Agent必须处理长期的依赖关系:通常一个操作的结果只在多次环境转移之后才会具体化。这就是所谓的(时间)信用分配问题[78]。

We will illustrate these challenges in the context of an indoor robotic visual navigation task: if the goal location is specified, we may be able to estimate the distance remaining (and use it as a reward signal), but it is unlikely that we will know exactly what series of actions the robot needs to take to reach the goal. As the robot must choose where to go as it navigates the building, its decisions influence which rooms it sees and, hence, the statistics of the visual sequence captured. Finally, after navigating several junctions, the robot may find itself in a dead end. There is a range of problems, from learning the consequences of actions to balancing exploration versus exploitation, but ultimately these can all be addressed formally within the framework of RL.


3. 强化学习算法 RL Algorithms

So far, we have introduced the key formalism used in RL, the MDP, and briefly noted some challenges in RL. In the following, we will distinguish between different classes of RL algorithms. There are two main approaches to solving RL problems: methods based on value functions and methods based on policy search. There is also a hybrid actor-critic approach that employs both value functions and policy search. Next, we will explain these approaches and other useful concepts for solving RL problems.


3.1. 价值函数 Value Functions

Value function methods are based on estimating the value (expected return) of being in a given state. The state-value function V π ( s ) V^\pi(s) Vπ(s) is the expected return when starting in state s s s and following π \pi π subsequently: V π ( s ) = E [ R ∣ s , π ] V^\pi(s)=\mathbb{E}[R|s,\pi] Vπ(s)=E[Rs,π]
The optimal policy, π ∗ \pi^* π, has a corresponding state-value function V ∗ ( s ) V^∗(s) V(s), and vice versa; the optimal state-value function can be defined as V ∗ ( s ) = max ⁡ π V π ( s ) ∀ s ∈ S V^*(s)=\max_{\pi}V^\pi(s)\quad \forall s \in \mathcal{S} V(s)=πmaxVπ(s)sS
If we had V ∗ ( s ) V^∗(s) V(s) available, the optimal policy could be retrieved by choosing among all actions available at s t s_t st and picking the action a a a that maximizes E s t + 1 ∼ τ ( s t + 1 ∣ s t , a ) [ V ∗ ( S t + 1 ) ] \mathbb{E}_{s_{t+1}\sim\tau(s_{t+1}|s_t,a)}[V^*(S_{t+1})] Est+1τ(st+1st,a)[V(St+1)]
In the RL setting, the transition dynamics τ \tau τ are unavailable. Therefore, we construct another function, the state-action value or quality function Q π ( s , a ) Q^\pi(s,a) Qπ(s,a), which is similar to V π V^\pi Vπ, except that the initial action a a a is provided and π \pi π is only followed from the succeeding state onward: Q π ( s , a ) = E [ R ∣ s , a , π ] Q^\pi(s,a)=\mathbb{E}[R|s,a,\pi] Qπ(s,a)=E[Rs,a,π]
The best policy, given Q π ( s , a ) Q^\pi(s,a) Qπ(s,a), can be found by choosing a a a greedily at every state: arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a). Under this policy, we can also define V π ( s ) V^\pi(s) Vπ(s): by maximizing Q π ( s , a ) Q^\pi(s,a) Qπ(s,a): V π ( s ) = max ⁡ a Q π ( s , a ) V^\pi(s)=\max_{a}Q^\pi(s,a) Vπ(s)=amaxQπ(s,a)

值函数方法基于对给定状态的值(预期回报)的估计。状态-值函数 V π ( s ) V^\pi(s) Vπ(s)是在状态 s s s时,特定策略 π \pi π的预期反馈: V π ( s ) = E [ R ∣ s , π ] V^\pi(s)=\mathbb{E}[R|s,\pi] Vπ(s)=E[Rs,π]

最优策略 π ∗ \pi^* π对应的状态值函数是 V ∗ ( s ) V^∗(s) V(s),反之亦然,最优状态值函数可以定义为: V ∗ ( s ) = max ⁡ π V π ( s ) ∀ s ∈ S V^*(s)=\max_{\pi}V^\pi(s)\quad \forall s \in \mathcal{S} V(s)=πmaxVπ(s)sS

如果我们有 V ∗ ( s ) V^∗(s) V(s)可用,可以在 s t s_t st时的所有可用动作中选择使 E s t + 1 ∼ τ ( s t + 1 ∣ s t , a ) [ V ∗ ( S t + 1 ) ] \mathbb{E}_{s_{t+1}\sim\tau(s_{t+1}|s_t,a)}[V^*(S_{t+1})] Est+1τ(st+1st,a)[V(St+1)]最大化的动作 a a a来检索最优策略

在RL环境中,转移动态方法 τ \tau τ 不可用。因此,我们构造另一个函数,状态-动作值函数或称质量函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),它类似于 V π V^\pi Vπ,只是提供了初始动作 a a a,并且 π \pi π只从后续状态开始: Q π ( s , a ) = E [ R ∣ s , a , π ] Q^\pi(s,a)=\mathbb{E}[R|s,a,\pi] Qπ(s,a)=E[Rs,a,π]

在给定 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)的情况下,通过在每个状态下都贪婪地选择 a a a,可以得到最优策略: arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a)。在这个策略下,我们还可以通过最大化 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)定义 V π ( s ) V^\pi(s) Vπ(s) V π ( s ) = max ⁡ a Q π ( s , a ) V^\pi(s)=\max_{a}Q^\pi(s,a) Vπ(s)=amaxQπ(s,a)

3.2. 动态规划 Dynamic Programming

To actually learn Q π Q^\pi Qπ, we exploit the Markov property and define the function as a Bellman equation [6], which has the following recursive form: Q π ( s t , a t ) = E s t + 1 [ r t + 1 + γ Q π ( s t + 1 , π ( s t + 1 ) ) ] Q^\pi(s_t,a_t)=\mathbb{E}_{s_{t+1}}[r_{t+1}+\gamma Q^\pi(s_{t+1},\pi(s_{t+1}))] Qπ(st,at)=Est+1[rt+1+γQπ(st+1,π(st+1))]

This means that Q π Q^\pi Qπ can be improved by bootstrapping, i.e., we can use the current values of our estimate of Q π Q^\pi Qπ to improve our estimate. This is the foundation of Q-learning [94] and the state-action-reward-state-action (SARSA) algorithm [62]: Q π ( s t , a t ) ← Q π ( s t , a t ) + α δ Q^\pi(s_t,a_t)\leftarrow Q^\pi(s_t,a_t)+\alpha\delta Qπ(st,at)Qπ(st,at)+αδ

where α α α is the learning rate and δ = Y − Q π ( s t , a t ) \delta=Y−Q^\pi(s_t,a_t) δ=YQπ(st,at) the temporal difference (TD) error; here, Y Y Y is a target as in a standard regression problem. SARSA, an on-policy learning algorithm, is used to improve the estimate of Q π Q^\pi Qπ by using transitions generated by the behavioral policy (the policy derived from Q π Q^\pi Qπ), which results in setting Y = r t + γ Q π ( s t + 1 , a t + 1 ) Y=r_t+\gamma Q^\pi(s_{t+1}, a_{t+1}) Y=rt+γQπ(st+1,at+1).Q-learning is off-policy, as Q π Q^\pi Qπ is instead updated by transitions that were not necessarily generated by the derived policy. Instead, Q-learning uses Y = r t + γ max ⁡ a Q π ( s t + 1 , a ) Y=r_t+\gamma\max_{a}Q^\pi(s_{t+1}, a) Y=rt+γmaxaQπ(st+1,a), which directly approximates Q ∗ Q^∗ Q.

为了实际学习 Q π Q^\pi Qπ,我们利用马尔可夫性质,将函数定义为贝尔曼方程[6],它有如下递归形式: Q π ( s t , a t ) = E s t + 1 [ r t + 1 + γ Q π ( s t + 1 , π ( s t + 1 ) ) ] Q^\pi(s_t,a_t)=\mathbb{E}_{s_{t+1}}[r_{t+1}+\gamma Q^\pi(s_{t+1},\pi(s_{t+1}))] Qπ(st,at)=Est+1[rt+1+γQπ(st+1,π(st+1))]

这意味着 Q π Q^\pi Qπ可以通过引导来改进,也就是说,我们可以使用 Q π Q^\pi Qπ的当前估计值来改进我们的估计值。这是Q-learning[94]和state-action-reward-state-action (SARSA)算法[62]的基础: Q π ( s t , a t ) ← Q π ( s t , a t ) + α δ Q^\pi(s_t,a_t)\leftarrow Q^\pi(s_t,a_t)+\alpha\delta Qπ(st,at)Qπ(st,at)+αδ

其中 α α α为学习率, δ = Y − Q π ( s t , a t ) \delta=Y−Q^\pi(s_t,a_t) δ=YQπ(st,at)为时序差分(TD)误差;这里, Y Y Y是标准回归问题中的目标。SARSA是一种策略学习算法,它通过使用行为策略(由 Q π Q^\pi Qπ派生的策略)产生的转换来改进 Q π Q^\pi Qπ的估计,结果是设置 Y = r t + γ Q π ( s t + 1 , a t + 1 ) Y=r_t+\gamma Q^\pi(s_{t+1}, a_{t+1}) Y=rt+γQπ(st+1,at+1)。Q-learning是不基于策略的,因为 Q π Q^\pi Qπ是通过转换更新的,而这些转换不一定由派生策略生成。相反,Q-learning使用 Y = r t + γ max ⁡ a Q π ( s t + 1 , a ) Y=r_t+\gamma\max_{a}Q^\pi(s_{t+1}, a) Y=rt+γmaxaQπ(st+1,a),它直接近似于 Q ∗ Q^∗ Q

To find Q ∗ Q^∗ Q from an arbitrary Q π Q^\pi Qπ, we use generalized policy iteration, where policy iteration consists of policy evaluation and policy improvement. Policy evaluation improves the estimate of the value function, which can be achieved by minimizing TD errors from trajectories experienced by following the policy. As the estimate improves, the policy can naturally be improved by choosing actions greedily based on the updated value function. Instead of performing these steps separately to convergence (as in policy iteration), generalized policy iteration allows for interleaving the steps, such that progress can be made more rapidly.

为了从任意的 Q π Q^\pi Qπ中找到 Q ∗ Q^∗ Q,我们使用了广义策略迭代,其中策略迭代包括策略评估和策略改进。策略评估可以改善价值函数的估计,这可以通过最小化遵循策略轨迹的TD误差来实现。随着估计值的提高,策略自然可以通过基于更新后的值函数贪婪地选择动作来改进。与单独执行这些步骤以实现收敛(就像在策略迭代中那样)不同,广义策略迭代允许交叉执行这些步骤,这样可以更快地取得进展。

3.3. 采样 Sampling

Instead of bootstrapping value functions using dynamic programming methods, Monte Carlo methods estimate the expected return (2) from a state by averaging the return from multiple rollouts of a policy. Because of this, pure Monte Carlo methods can also be applied in non-Markovian environments. On the other hand, they can only be used in episodic MDPs, as a rollout has to terminate for the return to be calculated. It is possible to get the best of both methods by combining T D TD TD learning and Monte Carlo policy evaluation, as is done in the T D ( λ ) TD(\lambda) TD(λ) algorithm [78]. Similarly to the discount factor, the λ \lambda λ in T D ( λ ) TD(\lambda) TD(λ) is used to interpolate between Monte Carlo evaluation and bootstrapping. As demonstrated in Figure 3, this results in an entire spectrum of RL methods based around the amount of sampling utilized.

蒙特卡罗方法不是使用动态规划方法引导值函数,而是通过平均策略多次推出的收益来估计状态的预期收益(2)。正因为如此,纯蒙特卡罗方法也可以应用于非马尔可夫环境。另一方面,它们只能在偶发的MDP中使用,因为轨迹必须终止才能计算收益。通过结合 T D TD TD学习和蒙特卡罗策略评估,有可能获得这两种方法的最佳效果,正如在 T D ( λ ) TD(\lambda) TD(λ)算法中所做的那样[78]。类似于折扣因子, T D ( λ ) TD(\lambda) TD(λ)中的 λ \lambda λ用于在蒙特卡罗计算和自举检验之间进行插值。如图3所示,这将基于所使用的采样量产生整个RL方法范围。


图3. 基于用于学习或构建策略备份的两个RL算法维度。这些维度的极端情况是(a)动态规划、(b)穷举搜索、©单步TD学习、(d)蒙特卡罗方法。Bootstrapping从©单步TD学习扩展到n步TD学习方法[78],(d)纯蒙特卡罗方法完全不依赖Bootstrapping。变化的另一个可能维度是(c )和(d)选择对动作进行抽样,而不是像(a)和(b)那样在所有选择中选择期望。(本图根据[78]重建)。

Another major value-function-based method relies on learning the advantage function A π ( s , a ) A^\pi(s,a) Aπ(s,a) [3]. Unlike producing absolute state-action values, as with Q π Q^\pi Qπ, A π A^\pi Aπ instead represents relative state-action values. Learning relative values is akin to removing a baseline or average level of a signal; more intuitively, it is easier to learn that one action has better consequences than another than it is to learn the actual return from taking the action. A π A^\pi Aπ represents a relative advantage of actions through the simple relationship A π = Q π − V π A^\pi=Q^\pi−V^\pi Aπ=QπVπ and is also closely related to the baseline method of variance reduction within gradient-based policy search methods [97]. The idea of advantage updates has been utilized in many recent DRL algorithms [19], [48], [71], [92].

另一个主要的基于值函数的方法依赖于学习优势函数 A π ( s , a ) A^\pi(s, a) Aπ(s,a)[3]。与生成绝对的状态-动作值(如 Q π Q^\pi Qπ)不同, A π A^\pi Aπ代表的是相对的状态-动作值。学习相对值类似于去除信号的基线或平均水平;更直观地说,人们更容易知道一个动作比另一个动作有更好的结果,而不是从动作中了解实际的回报。 A π A^\pi Aπ通过简单的关系 A π = Q π − V π A^\pi=Q^\pi−V^\pi Aπ=QπVπ代表了动作的相对优势,也与基于梯度的策略搜索方法中方差缩减的基线方法密切相关[97]。最近的很多DRL算法[19、48、71、92]都采用了优势更新的思想。

3.4. 策略搜索 Policy Search

Policy search methods do not need to maintain a value function model but directly search for an optimal policy π ∗ \pi^∗ π. Typically, a parameterized policy π θ \pi_\theta πθ is chosen, whose parameters are updated to maximize the expected return E [ R ∣ θ ] \mathbb{E}[R|\theta] E[Rθ] using either gradient-based or gradient-free optimization [12]. Neural networks that encode policies have been successfully trained using both gradient-free [17], [33] and gradient-based [22], [41], [44], [70], [71], [96], [97] methods. Gradient-free optimization can effectively cover low-dimensional parameter spaces, but, despite some successes in applying them to large networks [33], gradient-based training remains the method of choice for most DRL algorithms, being more sample efficient when policies possess a large number of parameters.

策略搜索方法不需要维持值函数模型,而是直接搜索最优策略。通常,选择一个参数化策略 π θ \pi_\theta πθ,使用基于梯度或无梯度的优化更新其参数,以最大化预期的返回值 E [ R ∣ θ ] \mathbb{E}[R|\theta] E[Rθ] [12]。使用无梯度[17、33]和基于梯度[22、41、44、70、71、96、97]方法成功训练了编码策略的神经网络。无梯度优化可以有效覆盖低维参数空间,但尽管在大型网络上有一些成功的应用[33],但基于梯度的训练仍然是大多数DRL算法所选择的方法,当策略具有大量参数时,这种方法具有更高的样本效率。

When constructing the policy directly, it is common to output parameters for a probability distribution; for continuous actions, this could be the mean and standard deviations of Gaussian distributions, while for discrete actions this could be the individual probabilities of a multinomial distribution. The result is a stochastic policy from which we can directly sample actions. With gradient-free methods, finding better policies requires a heuristic search across a predefined class of models. Methods such as evolution strategies essentially perform hill climbing in a subspace of policies [65], while more complex methods, such as compressed network search, impose additional inductive biases [33]. Perhaps the greatest advantage of gradient-free policy search is that it can also optimize nondifferentiable policies.


3.4.1. 策略梯度 Policy Gradients

Gradients can provide a strong learning signal as to how to improve a parameterized policy. However, to compute the expected return (1) we need to average over plausible trajectories induced by the current policy parameterization. This averaging requires either deterministic approximations (e.g., linearization) or stochastic approximations via sampling [12]. Deterministic approximations can be only applied in a model-based setting where a model of the underlying transition dynamics is available. In the more common model-free RL setting, a Monte Carlo estimate of the expected return is determined. For gradient-based learning, this Monte Carlo approximation poses a challenge since gradients cannot pass through these samples of a stochastic function. Therefore, we turn to an estimator of the gradient, known in RL as the REINFORCE rule [97]. Intuitively, gradient ascent using the estimator increases the log probability of the sampled action, weighted by the return. More formally, the REINFORCE rule can be used to compute the gradient of an expectation over a function f f f of a random variable X X X with respect to parameters θ \theta θ: ∇ θ E X [ f ( X ; θ ) ] = E X [ f ( X ; θ ) ∇ θ log ⁡ p ( X ) ] \nabla_\theta\mathbb{E}_X[f(X;\theta)]=\mathbb{E}_X[f(X;\theta)\nabla_\theta\log p(X)] θEX[f(X;θ)]=EX[f(X;θ)θlogp(X)]

梯度可以为如何改进参数化策略提供一个强大的学习信号。然而,为了计算预期收益(1),我们需要对当前策略参数化得出的合理轨迹进行平均。这种平均需要确定性的近似(例如线性化)或通过采样的随机近似[12]。确定性近似只能应用于基于模型的设置,其中底层转移动态的模型是可用的。在更常见的无模型RL环境下,确定了预期收益的蒙特卡罗估计。对于基于梯度的学习,这种蒙特卡罗近似提出了一个挑战,因为梯度不能通过随机函数的这些样本得到。因此,我们求助于梯度的估计,在RL中称此为加固规则[97]。直观地说,使用估计值的梯度上升增加了抽样行为的对数概率,并由收益加权。更正式地说,加强规则可用于计算随机变量 X X X对参数 θ \theta θ函数 f f f的期望梯度: ∇ θ E X [ f ( X ; θ ) ] = E X [ f ( X ; θ ) ∇ θ log ⁡ p ( X ) ] \nabla_\theta\mathbb{E}_X[f(X;\theta)]=\mathbb{E}_X[f(X;\theta)\nabla_\theta\log p(X)] θEX[f(X;θ)]=EX[f(X;θ)θlogp(X)]

As this computation relies on the empirical return of a trajectory, the resulting gradients possess a high variance. By introducing unbiased estimates that are less noisy, it is possible to reduce the variance. The general methodology for performing this is to subtract a baseline, which means weighting updates by an advantage rather than the pure return. The simplest baseline is the average return taken over several episodes [97], but there are many more options available [71].


3.4.2. 演员-评论家算法 Actor-Critic Methods

It is possible to combine value functions with an explicit representation of the policy, resulting in actor-critic methods, as shown in Figure 4. The “actor” (policy) learns by using feedback from the “critic” (value function). In doing so, these methods trade off variance reduction of policy gradients with bias introduction from value function methods [32], [71].



Actor-critic methods use the value function as a baseline for policy gradients, such that the only fundamental difference between actor-critic methods and other baseline methods is that actor-critic methods utilize a learned value function. For this reason, we will later discuss actor-critic methods as a subset of policy gradient methods.


3.5. 规划和学习 Planning and Learning

Given a model of the environment, it is possible to use dynamic programming over all possible actions [Figure 3(a)], sample trajectories for heuristic search (as was done by AlphaGo [73]), or even perform an exhaustive search [Figure 3(b)]. Sutton and Barto [78] define planning as any method that utilizes a model to produce or improve a policy. This includes distribution models, which include T \mathcal{T} T and R \mathcal{R} R, and sample models, from which only samples of transitions can be drawn.

给定一个环境模型,可以对所有可能的动作使用动态编程[图3(a)]、启发式搜索的样本轨迹(如AlphaGo所做的[73]),甚至可以执行穷举搜索[图3(b)]。Sutton和Barto将规划定义为任何利用模型来产生或改进策略的方法[78]。这包括分布模型,其中包括 T \mathcal{T} T R \mathcal{R} R,以及采样模型,从这些模型中只能提取转换的样本。

In RL, we focus on learning without access to the underlying model of the environment. However, interactions with the environment could be used to learn value functions, policies, and also a model. Model-free RL methods learn directly from interactions with the environment, but model-based RL methods can simulate transitions using the learned model, resulting in increased sample efficiency. This is particularly important in domains where each interaction with the environment is expensive. However, learning a model introduces extra complexities, and there is always the danger of suffering from model errors, which in turn affects the learned policy. Although deep neural networks can potentially produce very complex and rich models [14], [55], [75], sometimes simpler, more data-efficient methods are preferable [19]. These considerations also play a role in actor-critic methods with learned value functions [32], [71].


3.6. 深度强化学习的兴起 The Rise of DRL

Many of the successes in DRL have been based on scaling up prior work in RL to high-dimensional problems. This is due to the learning of low-dimensional feature representations and the powerful function approximation properties of neural networks. By means of representation learning, DRL can deal efficiently with the curse of dimensionality, unlike tabular and traditional nonparametric methods [7]. For instance, convolutional neural networks (CNNs) can be used as components of RL agents, allowing them to learn directly from raw, high-dimensional visual inputs. In general, DRL is based on training deep neural networks to approximate the optimal policy π ∗ \pi^∗ π and/or the optimal value functions V ∗ V^∗ V, Q ∗ Q^∗ Q, and A ∗ A^∗ A.

DRL的许多成功都建立在将RL的工作扩展到高维问题的基础上。这是由于神经网络具有低维特征表示的学习和强大的函数近似特性。与传统的非参数方法和表格方法不同,DRL通过表示学习可以有效地处理维数灾难[7]。例如,卷积神经网络(CNNs)可以用作RL中Agent的组件,允许它们直接从原始的高维视觉输入中学习。一般来说,DRL基于训练深度神经网络来得到最优策略 π ∗ \pi^∗ π和最优值函数 V ∗ V^∗ V Q ∗ Q^∗ Q A ∗ A^∗ A

4. 价值函数 Value Functions

The well-known function approximation properties of neural networks led naturally to the use of deep learning to regress functions for use in RL agents. Indeed, one of the earliest success stories in RL is TD-Gammon, a neural network that reached expert-level performance in backgammon in the early 1990s [81]. Using TD methods, the network took in the state of the board to predict the probability of black or white winning. Although this simple idea has been echoed in later work [73], progress in RL research has favored the explicit use of value functions, which can capture the structure underlying the environment. From early value function methods in DRL, which took simple states as input [61], current methods are now able to tackle visually and conceptually complex environments [47], [48], [70], [100].


4.1. 函数估计和深度Q学习 Function Approximation and the DQN

We begin our survey of value-function-based DRL algorithms with the DQN [47], illustrated in Figure 5, which achieved scores across a wide range of classic Atari 2600 video games [5] that were comparable to that of a professional video games tester. The inputs to the DQN are four gray-scale frames of the game, concatenated over time, which are initially processed by several convolutional layers to extract spatiotemporal features, such as the movement of the ball in Pong or Breakout. The final feature map from the convolutional layers is processed by several fully connected layers, which more implicitly encode the effects of actions. This contrasts with more traditional controllers that use fixed preprocessing steps, which are therefore unable to adapt their processing of the state in response to the learning signal.

我们从图5所示的DQN[47]开始调查基于价值函数的DRL算法,它在经典的Atari 2600电子游戏中获得了与专业电子游戏测评员相当的分数[5]。DQN的输入是游戏的四个灰度帧,随着时间的推移连接在一起,最初由几个卷积层进行处理,以提取时空特征,例如乒乓球或突破球的运动。卷积层的最终特征图由几个全连接层处理,这些层更隐式地编码动作的效果。这与使用固定预处理步骤的传统控制器形成了鲜明的对比,后者因此无法适应其状态处理以响应学习信号。



A forerunner of the DQN—neural-fitted Q (NFQ) iteration—involved training a neural network to return the Q-value given a state-action pair [61]. NFQ was later extended to train a network to drive a slot car using raw visual inputs from a camera over the race track, by combining a deep autoencoder to reduce the dimensionality of the inputs with a separate branch to predict Q-values [38]. Although the previous network could have been trained for both reconstruction and RL tasks simultaneously, it was both more reliable and computationally efficient to train the two parts of the network sequentially.

DQN的前身是神经拟合Q(neural-fitted Q, NFQ)迭代训练神经网络,以返回给定状态-动作对的Q值[61]。随后,NFQ得到扩展,其训练的网络用来驾驶一辆槽车,使用赛道上的摄像机作为原始视觉输入,通过结合一个深度自动编码器来减少输入的维度,并使用一个单独的分支来预测Q值[38]。虽然之前的网络可以同时训练重构和RL任务,但顺序训练这两个部分的网络更可靠,计算效率更高。

The DQN [47] is closely related to the model proposed by Lange et al. [38] but was the first RL algorithm that was demonstrated to work directly from raw visual inputs and on a wide variety of environments. It was designed such that the final fully connected layer outputs Q π ( s , ⋅ ) Q^\pi(s, ⋅) Qπ(s,) for all action values in a discrete set of actions—in this case, the various directions of the joystick and the fire button. This not only enables the best action, arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a), to be chosen after a single forward pass of the network, but also allows the network to more easily encode action-independent knowledge in the lower, convolutional layers. With merely the goal of maximizing its score on a video game, the DQN learns to extract salient visual features, jointly encoding objects, their movements, and, most importantly, their interactions. Using techniques originally developed for explaining the behavior of CNNs in object recognition tasks, we can also inspect what parts of its view the agent considers important (see Figure 6).

DQN[47]与Lange等人提出的模型[38]密切相关,但它是第一个被证明可以直接从原始视觉输入和各种环境中工作的RL算法。在设计上,最终的全连接层将输出 Q π ( s , ⋅ ) Q^\pi(s, ⋅) Qπ(s,),用于一组离散动作(在本例中是操纵杆和射击按钮的各个方向)中的所有动作值。这不仅允许在一次网络前向传递后选择最佳动作 arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a),而且允许网络在较低的卷积层中更容易编码与动作无关的知识。为了在电子游戏中获得最大的分数,DQN学习提取显著的视觉特征、共同编码对象、它们的运动,以及最重要的交互作用。使用最初为解释CNN在对象识别任务中的行为而开发的技术,我们还可以审查代理认为其视图的哪些部分是重要的(参见图6)。


The true underlying state of the game is contained within 128 bytes of Atari 2600 random-access memory. However, the DQN was designed to directly learn from visual inputs (210×160 pixel 8-bit RGB images), which it takes as the state s s s. It is impractical to represent Q π ( s , a ) Q^\pi(s, a) Qπ(s,a) exactly as a lookup table: when combined with 18 possible actions, we obtain a Q-table of size ∣ S ∣ × ∣ A ∣ = 18 × 2 5 63 × 210 × 160 |S|×|A|=18×25^{63×210×160} S×A=18×2563×210×160. Even if it were feasible to create such a table, it would be sparsely populated, and information gained from one state-action pair cannot be propagated to other state-action pairs. The strength of the DQN lies in its ability to compactly represent both high-dimensional observations and the Q-function using deep neural networks. Without this ability, tackling the discrete Atari domain from raw visual inputs would be impractical.

游戏真正的底层状态包含在128字节的Atari 2600随机存取内存中。而DQN的设计是直接从视觉输入(210×160像素8位RGB图像)中学习,并将其作为状态 s s s。将 Q π ( s , a ) Q^\pi(s, a) Qπ(s,a) 完全表示为一个查找表是不切实际的:因为当结合18种可能的动作时,我们将得到一个大小为 ∣ S ∣ × ∣ A ∣ = 18 × 2 5 63 × 210 × 160 |S|×|A|=18×25^{63×210×160} S×A=18×2563×210×160的Q表格。即使创建这样一个Q表格是可行的,它也会是稀疏的,并且从一个状态-动作对获得的信息不能传播到其他状态-动作对。DQN的优点在于它能够使用深度神经网络紧凑地表示高维观测和Q函数。如果没有这种能力,从原始的视觉输入中处理离散的Atari域是不切实际的。

The DQN addressed the fundamental instability problem of using function approximation in RL [83] by the use of two techniques: experience replay [45] and target networks. Experience replay memory stores transitions of the form ( s t , a t , s t + 1 , r t + 1 ) (s_t, a_t, s_{t+1}, r_{t+1}) (st,at,st+1,rt+1) in a cyclic buffer, enabling the RL agent to sample from and train on previously observed data offline. Not only does this massively reduce the number of interactions needed with the environment, but batches of experience can be sampled, reducing the variance of learning updates. Furthermore, by sampling uniformly from a large memory, the temporal correlations that can adversely affect RL algorithms are broken. Finally, from a practical perspective, batches of data can be efficiently processed in parallel by modern hardware, increasing throughput. While the original DQN algorithm used uniform sampling [47], later work showed that prioritizing samples based on TD errors is more effective for learning [67].

DQN通过使用两种技术:经验重放[45]和目标网络,解决了RL中使用函数拟合的基本不稳定问题[83]。在经验重放中内存将 ( s t , a t , s t + 1 , r t + 1 ) (s_t, a_t, s_{t+1},r_{t+1}) (st,at,st+1,rt+1)形式的转换存储在循环缓冲区中,使RL代理能够脱机采样和训练先前观察到的数据。这不仅大大减少了与环境的交互,而且可以对经验进行批量采样,减少学习更新的方差。此外,通过从大内存中均匀采样,可以打破对RL算法产生不利影响的时间相关性。最后,从实用的角度来看,现代硬件可以有效地并行处理多批数据,从而提高吞吐量。虽然原来的DQN算法使用的是均匀采样[47],但后来的研究表明,基于TD误差给样本排序对学习更有效[67]。

The second stabilizing method, introduced by Mnih et al. [47], is the use of a target network that initially contains the weights of the network enacting the policy but is kept frozen for a large period of time. Rather than having to calculate the TD error based on its own rapidly fluctuating estimates of the Q-values, the policy network uses the fixed target network. During training, the weights of the target network are updated to match the policy network after a fixed number of steps. Both experience replay and target networks have gone on to be used in subsequent DRL works [19], [44], [50], [93].


4.2. Q函数的修改 Q-Function Modifications

Considering that one of the key components of the DQN is a function approximator for the Q-function, it can benefit from fundamental advances in RL. In [86], van Hasselt showed that the single estimator used in the Q-learning update rule overestimates the expected return due to the use of the maximum action value as an approximation of the maximum expected action value. Double-Q learning provides a better estimate through the use of a double estimator [86]. While double-Q learning requires an additional function to be learned, later work proposed using the already available target network from the DQN algorithm, resulting in significantly better results with only a small change in the update step [87].

考虑到DQN的一个关键组成部分是Q函数的函数近似器,它可以从RL的基本进展中受益。在[86]中,van Hasselt证明了Q学习中由于使用了最大作用值作为最大预期作用值的近似值,导致更新规则中使用的单一估计量高估了预期收益。双Q学习通过使用两个估计器提供了更好的估计值[86]。虽然双Q学习需要学习额外的函数,但后续的工作使用了DQN算法中已经可用的目标网络,在更新步骤中只做了很小的改变就使得结果明显更好[87]。

Yet another way to adjust the DQN architecture is to decompose the Q-function into meaningful functions, such as constructing Q π Q^\pi Qπ by adding together separate layers that compute the state-value function V π V^\pi Vπ and advantage function A π A^\pi Aπ [92]. Rather than having to come up with accurate Q-values for all actions, the duelling DQN [92] benefits from a single baseline for the state in the form of V π V^\pi Vπ and easier-to-learn relative values in the form of A π A^\pi Aπ. The combination of the duelling DQN with prioritized experience replay [67] is one of the state-of-the-art techniques in discrete action settings. Further insight into the properties of A π A^\pi Aπ by Gu et al. [19] led them to modify the DQN with a convex advantage layer that extended the algorithm to work over sets of continuous actions, creating the normalized advantage function (NAF) algorithm. Benefiting from experience replay, target networks, and advantage updates, NAF is one of several state-of-the-art techniques in continuous control problems [19].

另一种调整DQN架构的方法是将Q函数分解为有意义的函数,例如通过将计算状态-价值函数 V π V^\pi Vπ 和优势函数 A π A^\pi Aπ 的不同层相加来构造 Q π Q^\pi Qπ [92]。DQN不需要为所有行动提出精确的Q值[92],而是从以 V π V^\pi Vπ 形式的单一基准状态和以 A π A^\pi Aπ 形式的更容易学习的相对值中受益。对抗DQN与优先经验重放的结合是离散动作设置中最先进的技术之一[67]。Gu等人对 A π A^\pi Aπ属性的进一步深入研究促使他们使用凸优势层修改DQN[19],该层将算法扩展到连续动作集合上,创建了标准化优势函数(NAF)算法。得益于经验重放、目标网络和优势更新,NAF是[19]连续控制问题中的几个最先进的技术之一。

5. 策略搜索 Policy Search

Policy search methods aim to directly find policies by means of gradient-free or gradient-based methods. Prior to the current surge of interest in DRL, several successful methods in DRL eschewed the commonly used back propagation algorithm in favor of evolutionary algorithms [17], [33], which are gradient-free policy search algorithms. Evolutionary methods rely on evaluating the performance of a population of agents. Hence, they are expensive for large populations or agents with many parameters. However, as black-box optimization methods, they can be used to optimize arbitrary, non-differentiable models and naturally allow for more exploration in the parameter space. In combination with a compressed representation of neural network weights, evolutionary algorithms can even be used to train large networks; such a technique resulted in the first deep neural network to learn an RL task, straight from high-dimensional visual inputs [33]. Recent work has reignited interest in evolutionary methods for RL as they can potentially be distributed at larger scales than techniques that rely on gradients [65].


5.1. 通过随机函数的反向传播 Backpropagation Through Stochastic Functions

The workhorse of DRL, however, remains backpropagation. The previously discussed REINFORCE rule [97] allows neural networks to learn stochastic policies in a task-dependent manner, such as deciding where to look in an image to track [69] or caption [99] objects. In these cases, the stochastic variable would determine the coordinates of a small crop of the image and hence reduce the amount of computation needed. This usage of RL to make discrete, stochastic decisions over inputs is known in the deep-learning literature as hard attention and is one of the more compelling uses of basic policysearch methods in recent years, having many applications outside of traditional RL domains.


5.2. 复合误差 Compounding Errors

Searching directly for a policy represented by a neural network with very many parameters can be difficult and suffer from severe local minima. One way around this is to use guided policy search (GPS), which takes a few sequences of actions from another controller (which could be constructed using a separate method, such as optimal control). GPS learns from them by using supervised learning in combination with importance sampling, which corrects for off-policy samples [40]. This approach effectively biases the search toward a good (local) optimum. GPS works in a loop, by optimizing policies to match sampled trajectories and optimizing trajectory distributions to match the policy and minimize costs. Levine et al. [41] showed that it was possible to train visuo-motor policies for a robot “end to end,” straight from the RGB pixels of the camera to motor torques, and, hence, provide one of the seminal works in DRL.


A more commonly used method is to use a trust region, in which optimization steps are restricted to lie within a region where the approximation of the true cost function still holds. By preventing updated policies from deviating too wildly from previous policies, the chance of a catastrophically bad update is lessened, and many algorithms that use trust regions guarantee or practically result in monotonic improvement in policy performance. The idea of constraining each policy gradient update, as measured by the Kullback-Leibler (KL) divergence between the current and proposed policy, has a long history in RL [28]. One of the newer algorithms in this line of work, TRPO, has been shown to be relatively robust and applicable to domains with high-dimensional inputs [70]. To achieve this, TRPO optimizes a surrogate objective function-specifically, it optimizes an (importance sampled) advantage estimate, constrained using a quadratic approximation of the KL divergence. While TRPO can be used as a pure policy gradient method with a simple baseline, later work by Schulman et al. [71] introduced generalized advantage estimation (GAE), which proposed several, more advanced variance reduction baselines. The combination of TRPO and GAE remains one of the state-of-the-art RL techniques in continuous control.

一种更常用的方法是使用信赖域,在信赖域中,优化步骤被限制在一个区域内,在这个区域内,真实代价函数的近似值仍然成立。通过防止更新后的策略与之前的策略过度偏离,减少了灾难性错误得到更新的机会,并且许多使用信赖域的算法保证或实际上导致了策略性能的单调改进。约束每个策略梯度更新的思想在RL中有很长的历史[28],这是由当前策略和建议策略之间的Kullback-Leibler(KL)散度来衡量的。在这一行中,有一种较新的算法,置信域策略梯度(TRPO),为了实现这一点,TRPO优化了一个替代目标函数,具体来说,它优化了(重要性抽样)优势估计,使用KL散度的二次近似进行约束。虽然TRPO可以作为一种带有简单基线的纯策略梯度方法,但Schulman等人[71]后来的工作引入了广义优势估计(generalized advantage estimation, GAE),该方法提出了几个更先进的方差减少基线。TRPO和GAE的结合仍然是连续控制中最先进的RL技术之一。

5.3. 演员-评论家算法 Actor-Critic Methods

Actor-critic approaches have grown in popularity as an effective means of combining the benefits of policy search methods with learned value functions, which are able to learn from full returns and/or TD errors. They can benefit from improvements in both policy gradient methods, such as GAE [71], and value function methods, such as target networks [47]. In the last few years, DRL actor-critic methods have been scaled up from learning simulated physics tasks [22], [44] to real robotic visual navigation tasks [100], directly from image pixels.


One recent development in the context of actor-critic algorithms is deterministic policy gradients (DPGs) [72], which extend the standard policy gradient theorems for stochastic policies [97] to deterministic policies. One of the major advantages of DPGs is that, while stochastic policy gradients integrate over both state and action spaces, DPGs only integrate over the state space, requiring fewer samples in problems with large action spaces. In the initial work on DPGs, Silver et al. [72] introduced and demonstrated an off-policy actor-critic algorithm that vastly improved upon a stochastic policy gradient equivalent in high-dimensional continuous control problems. Later work introduced deep DPG, which utilized neural networks to operate on high-dimensional, visual state spaces [44]. In the same vein as DPGs, Heess et al. [22] devised a method for calculating gradients to optimize stochastic policies by “reparameterizing” [30], [60] the stochasticity away from the network, thereby allowing standard gradients to be used (instead of the high-variance REINFORCE estimator [97]). The resulting stochastic value gradient (SVG) methods are flexible and can be used both with (SVG(0) and SVG(1)) and without (SVG( ∞ \infty ) value function critics, and with (SVG ( ∞ \infty ) and SVG(1)) and without (SVG(0)) models. Later work proceeded to integrate DPGs and SVGs with RNNs, allowing them to solve continuous control problems in POMDPs, learning directly from pixels [21]. Together, DPGs and SVGs can be considered algorithmic approaches for improving learning efficiency in DRL.

演员-评论家算法的一个最新成果是确定性策略梯度(DPG)[72],它将随机策略的标准策略梯度定理[97]扩展到确定性策略。DPG的一个主要优点是,当随机策略梯度在状态空间和动作空间上集成时,只会在状态空间上集成,在动作空间较大的问题中只需要较少的样本。在关于DPG的初步工作中,Silver等人[72]介绍并演示了一种离策略的演员-评论家算法,该算法大大改进了高维连续控制问题中的随机策略梯度等价算法。后来的研究引入了深度DPG(DDPG),它利用神经网络在高维可视状态空间上进行操作[44]。与DPG相同,Heess等人[22]设计了一种计算梯度的方法,通过“重新参数化”[30],[60],减少网络的随机性来优化随机策略,从而允许使用标准梯度(而不是高方差强化估计器[97])。产生的随机值梯度(SVG)方法是灵活的,既可以用(SVG(0)和SVG(1)),也可以不使用(SVG( ∞ \infty )价值函数评论家,也可以使用(SVG( ∞ \infty )和SVG(1))模型,也可以不使用(SVG(0))模型。随后的工作继续将DPG和SVG与RNN集成,使它们能够解决POMDP中的连续控制问题,直接从像素进行学习[21]。DPG和SVG可以作为提高DRL学习效率的方法。

An orthogonal approach to speeding up learning is to exploit parallel computation. By keeping a canonical set of parameters that are read by and updated in an asynchronous fashion by multiple copies of a single network, computation can be efficiently distributed over both processing cores in a single central processing unit (CPU), and across CPUs in a cluster of machines. Using a distributed system, Nair et al. [51] developed a framework for training multiple DQNs in parallel, achieving both better performance and a reduction in training time. However, the simpler asynchronous advantage actor-critic (A3C) algorithm [48], developed for both single and distributed machine settings, has become one of the most popular DRL techniques in recent times. A3C combines advantage updates with the actor-critic formulation and relies on asynchronously updated policy and value function networks trained in parallel over several processing threads. The use of multiple agents, situated in their own, independent environments, not only stabilizes improvements in the parameters, but conveys an additional benefit in allowing for more exploration to occur. A3C has been used as a standard starting point in many subsequent works, including the work of Zhu et al. [100], who applied it to robotic navigation in the real world through visual inputs.


There have been several major advancements on the original A3C algorithm that reflect various motivations in the field of DRL. The first is actor-critic with experience replay [93], which adds off-policy bias correction to A3C, allowing it to use experience replay to improve sample complexity. Others have attempted to bridge the gap between value and policy-based RL, utilizing theoretical advancements to improve upon the original A3C [50], [54]. Finally, there is a growing trend toward exploiting auxiliary tasks to improve the representations learned by DRL agents and, hence, improve both the learning speed and final performance of these agents [26], [46].


6. 当前的研究和挑战 Current Research and Challenges

To conclude, we will highlight some current areas of research in DRL and the challenges that still remain. Previously, we have focused mainly on model-free methods, but we will now examine a few model-based DRL algorithms in more detail. Model-based RL algorithms play an important role in making RL data efficient and in trading off exploration and exploitation. After tackling exploration strategies, we shall then address hierarchical RL (HRL), which imposes an inductive bias on the final policy by explicitly factorizing it into several levels. When available, trajectories from other controllers can be used to bootstrap the learning process, leading us to imitation learning and inverse RL (IRL). For the final topic, we will look at multiagent systems, which have their own special considerations.

最后,我们将强调目前DRL的一些研究领域和仍然存在的挑战。以前,我们主要关注无模型方法,但是现在我们将更详细地研究一些基于模型的DRL算法。基于模型的RL算法在提高RL数据的效率和平衡挖掘和利用方面发挥着重要作用。在解决探索策略之后,我们将解决分层的RL (HRL),它通过明确地将最终策略分解为几个层次来施加归纳偏置。当其他控制器的轨迹可用时,可用于引导学习过程,引导我们进行模仿学习和逆RL(IRL)。对于最后一个主题,我们将看看多Agent系统,它们有自己的特殊考虑事项。

6.1. 基于模型的强化学习 Model-Based RL

The key idea behind model-based RL is to learn a transition model that allows for simulation of the environment without interacting with the environment directly. Model-based RL does not assume specific prior knowledge. However, in practice, we can incorporate prior knowledge (e.g., physics-based models [29]) to speed up learning. Model learning plays an important role in reducing the number of required interactions with the (real) environment, which may be limited in practice. For example, it is unrealistic to perform millions of experiments with a robot in a reasonable amount of time and without significant hardware wear and tear. There are various approaches to learn predictive models of dynamical systems using pixel information. Based on the deep dynamical model [90], where high-dimensional observations are embedded into a lower-dimensional space using autoencoders, several model-based DRL algorithms have been proposed for learning models and policies from pixel information [55], [91], [95]. If a sufficiently accurate model of the environment can be learned, then even simple controllers can be used to control a robot directly from camera images [14]. Learned models can also be used to guide exploration purely based on simulation of the environment, with deep models allowing these techniques to be scaled up to high-dimensional visual domains [75].


Although deep neural networks can make reasonable predictions in simulated environments over hundreds of time steps [10], they typically require many samples to tune the large number of parameters they contain. Training these models often requires more samples (interaction with the environment) than simpler models. For this reason, Gu et al. [19] train locally linear models for use with the NAF algorithm—the continuous equivalent of the DQN [47]—to improve the algorithm’s sample complexity in the robotic domain where samples are expensive. It seems likely that the usage of deep models in model-based DRL could be massively spurred by general advances in improving the data efficiency of neural networks.


6.2. 探索和利用 Exploration Versus Exploitation

One of the greatest difficulties in RL is the fundamental dilemma of exploration versus exploitation: When should the agent try out (perceived) nonoptimal actions to explore the environment (and potentially improve the model), and when should it exploit the optimal action to make useful progress? Off-policy algorithms, such as the DQN [47], typically use the simple ϵ \epsilon ϵ-greedy exploration policy, which chooses a random action with probability ϵ ∈ [ 0 , 1 ] \epsilon\in[0,1] ϵ[0,1], and the optimal action otherwise. By decreasing ϵ \epsilon ϵ over time, the agent progresses toward exploitation. Although adding independent noise for exploration is usable in continuous control problems, more sophisticated strategies inject noise that is correlated over time (e.g., from stochastic processes) to better preserve momentum [44].

RL中最大的困难之一是探索与开发的基本困境:Agent何时应该尝试(已经感知到的)非最优行动来探索环境(并潜在地改进模型),何时应该利用最优行动来取得有用的进展?离策略算法,如DQN[47],通常使用简单的 ϵ \epsilon ϵ贪心勘探策略,选择概率为 ϵ ∈ [ 0 , 1 ] \epsilon\in[0,1] ϵ[0,1]的随机动作,否则选择最佳动作。随着时间的推移逐渐减少 ϵ \epsilon ϵ,Agent将从探索演变为利用。虽然在连续控制问题中添加额外的探索噪声是有用的,但更复杂的策略会注入随时间相关的噪声(例如来自随机过程的噪声),以更好地保持动量[44]。

The observation that temporal correlation is important led Osband et al. [56] to propose the bootstrapped DQN, which maintains several Q-value “heads” that learn different values through a combination of different weight initializations and bootstrapped sampling from experience replay memory. At the beginning of each training episode, a different head is chosen, leading to temporally extended exploration. Usunier et al. [85] later proposed a similar method that performed exploration in policy space by adding noise to a single output head, using zero-order gradient estimates to allow back propagation through the policy.


One of the main principled exploration strategies is the upper confidence bound (UCB) algorithm, based on the principle of “optimism in the face of uncertainty” [36]. The idea behind UCB is to pick actions that maximize E [ R ] + k σ [ R ] \mathbb{E}[R]+k\sigma[R] E[R]+kσ[R], where σ [ R ] \sigma[R] σ[R] is the standard deviation of the return and k > 0 k>0 k>0. UCB therefore encourages exploration in regions with high uncertainty and moderate expected return. While easily achievable in small tabular cases, the use of powerful density models has allowed this algorithm to scale to high-dimensional visual domains with DRL [4].

主要的探索策略原则之一是基于“面对不确定性的乐观主义”原则的置信区间上界(upper confidence bound, UCB)算法[36]。UCB背后的思想是选择最大化 E [ R ] + k σ [ R ] \mathbb{E}[R]+k\sigma[R] E[R]+kσ[R] 的操作,其中 σ [ R ] \sigma[R] σ[R] 是返回的标准差, k > 0 k>0 k>0。因此,UCB鼓励在不确定性高、预期收益适中的情况下进行勘探。虽然很容易在小表格的情况下实现,但使用强大的密度模型能使该算法能够扩展到具有DRL的高维视觉领域。

UCB can also be considered one way of implementing intrinsic motivation, which is a general concept that advocates decreasing uncertainty/making progress in learning about the environment [68]. There have been several DRL algorithms that try to implement intrinsic motivation via minimizing model prediction error [57], [75] or maximizing information gain [25], [49].


6.3. 层次化强化学习 Hierarchical RL

In the same way that deep learning relies on hierarchies of features’ HRL relies on hierarchies of policies. Early work in this area introduced options, in which, apart from primitive actions (single time-step actions), policies could also run other policies (multitime-step “actions”) [79]. This approach allows top-level policies to focus on higher-level goals, while subpolicies are responsible for fine control. Several works in DRL have attempted HRL by using one top-level policy that chooses between subpolicies, where the division of states or goals in to subpolicies is achieved either manually [1], [34], [82] or automatically [2], [88], [89]. One way to help construct subpolicies is to focus on discovering and reaching goals, which are specific states in the environment; they may often be locations, to which an agent should navigate. Whether utilized with HRL or not, the discovery and generalization of goals is also an important area of ongoing research [35], [66], [89].


6.4. 模仿学习与逆强化学习 Imitation Learning and Inverse RL

One may ask why, if given a sequence of “optimal” actions from expert demonstrations, it is not possible to use supervised learning in a straightforward manner—a case of “learning from demonstration.” This is indeed possible and is known as behavioral cloning in traditional RL literature. Taking advantage of the stronger signals available in supervised learning problems, behavioral cloning enjoyed success in earlier neural network research, with the most notable success being ALVINN, one of the earliest autonomous cars [59]. However, behavioral cloning cannot adapt to new situations, and small deviations from the demonstration during the execution of the learned policy can compound and lead to scenarios where the policy is unable to recover. A more generalizable solution is to use provided trajectories to guide the learning of suitable state-action pairs but fine-tune the agent using RL [23].


The goal of IRL is to estimate an unknown reward function from observed trajectories that characterize a desired solution [52]; IRL can be used in combination with RL to improve upon demonstrated behavior. Using the power of deep neural networks, it is now possible to learn complex, nonlinear reward functions for IRL [98]. Ho and Ermon [24] showed that policies are uniquely characterized by their occupancies (visited state and action distributions) allowing IRL to be reduced to the problem of measure matching. With this insight, they were able to use generative adversarial training [18] to facilitate reward-function learning in a more flexible manner, resulting in the generative adversarial imitation learning algorithm.


6.5. 多Agent强化学习 Multiagent RL

Usually, RL considers a single learning agent in a stationary environment. In contrast, multiagent RL (MARL) considers multiple agents learning through RL and often the nonstationarity introduced by other agents changing their behaviors as they learn [8]. In DRL, the focus has been on enabling (differentiable) communication between agents, which allows them to cooperate. Several approaches have been proposed for this purpose, including passing messages to agents sequentially [15], using a bidirectional channel (providing ordering with less signal loss) [58], and an all-to-all channel [77]. The addition of communication channels is a natural strategy to apply to MARL in complex scenarios and does not preclude the usual practice of modeling cooperative or competing agents as applied elsewhere in the MARL literature [8].

通常,RL在静态环境中只有一个学习主体。与此相反,多Agent强化学习(multiagent RL, MARL)在RL中通过多个Agent进行学习,以及其他Agent在学习过程中改变其行为所引入的非平稳性[8]。在DRL中,重点是实现Agent之间的(可区分的)通信,这使得它们可以协同工作。为此,已经提出了几种方法,包括按顺序将消息传递给Agent[15]、使用半双工(以较少的信号损失提供排序)[58]和全双工[77]。添加信道是在复杂场景中应用于MARL的一种自然策略,并且不排除对合作或竞争Agent建模的通常做法,正如在MARL文献中应用的其他地方一样[8]。

7. 结论:超越模式识别 Conclusion: Beyond Pattern Recognition

Despite the successes of DRL, many problems need to be addressed before these techniques can be applied to a wide range of complex real-world problems [37]. Recent work with (nondeep) generative causal models demonstrated superior generalization over standard DRL algorithms [48], [63] in some benchmarks [5], achieved by reasoning about causes and effects in the environment [29]. For example, the schema networks of Kanksy et al. [29] trained on the game Breakout immediately adapted to a variant where a small wall was placed in front of the target blocks, while progressive (A3C) networks [63] failed to match the performance of the schema networks even after training on the new domain. Although DRL has already been combined with AI techniques, such as search [73] and planning [80], a deeper integration with other traditional AI approaches promises benefits such as better sample complexity, generalization, and interpretability [16]. In time, we also hope that our theoretical understanding of the properties of neural networks (particularly within DRL) will improve, as it currently lags far behind practice.


To conclude, it is worth revisiting the overarching goal of all of this research: the creation of general-purpose AI systems that can interact with and learn from the world around them. Interaction with the environment is simultaneously the advantage and disadvantage of RL. While there are many challenges in seeking to understand our complex and ever-changing world, RL allows us to choose how we explore it. In effect, RL endows agents with the ability to perform experiments to better understand their surroundings, enabling them to learn even high-level causal relationships. The availability of high-quality visual renderers and physics engines now enables us to take steps in this direction, with works that try to learn intuitive models of physics in visual environments [13]. Challenges remain before this will be possible in the real world, but steady progress is being made in agents that learn the fundamental principles of the world through observation and action. Perhaps, then, we are not too far away from AI systems that learn and act in more human-like ways in increasingly complex environments.



Kai Arulkumaran would like to acknowledge Ph.D. funding from the Department of Bioengineering at Imperial College London. This research has been partially funded by a Google Faculty Research Award to Marc Deisenroth.

Kai Arulkumaran感谢伦敦帝国理工学院生物工程系的博士资助。这项研究的部分资金来源于谷歌授予马克·戴森罗斯(marcdeisenroth)的研究奖

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 撸撸猫 设计师:C马雯娟 返回首页