深度强化学习综述论文 A Brief Survey of Deep Reinforcement Learning

A Brief Survey of Deep Reinforcement Learning

深度强化学习的简要概述

作者:
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath

摘要 Abstract

Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning (RL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels. DRL algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of RL, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via RL. To conclude, we describe several current areas of research within the field.

深度强化学习(DRL)马上就会彻底改变人工智能(AI)领域,它在构建对视觉世界有更高层次理解的自主系统中是一座里程碑。目前,深度学习正使得强化学习(RL)可以扩展到以前难以解决的问题,如直接从屏幕像素学习玩电子游戏。DRL算法也适用于机器人,允许机器人的控制策略直接从现实世界的摄像头中输入学习。在本篇调查报告中,我们首先介绍RL的一般方法,然后过渡到基于价值和基于策略的主流方法。本篇调查报告将涵盖Deep RL中的主要算法,包括深度Q网络(deep Q-network, DQN)、信任域策略优化算法(trust region policy optimization, TRPO)和异步优势演员-评论家算法(asynchronous advantage actor-critical, A3C)。同时,我们强调了深度神经网络的独特优势,重点关注RL的视觉理解。最后,我们描述了深度强化学习领域目前的几个研究重点。

1. 引言 Introduction

One of the primary goals of the field of artificial intelligence(AI) is to produce fully autonomous agents that interact with their environments to learn optimal behaviours, improving over time through trial and error. Crafting AI systems that are responsive and can effectively learn has been a long-standing challenge, ranging from robots, which can sense and react to the world around them, to purely software-based agents, which can interact with natural language and multimedia. A principled mathematical framework for experience-driven autonomous learning is reinforcement learning (RL) [135]. Al-though RL had some successes in the past [141, 129, 62, 93], previous approaches lacked scalability and were inherently limited to fairly low-dimensional problems. These limitations exist because RL algorithms share the same complexity issues as other algorithms: memory complexity, computational complexity, and in the case of machine learning algorithms, sample complexity [133]. What we have witnessed in recent years—the rise of deep learning, relying on the powerful function approximation and representation learning properties of deep neural networks—has provided us with new tools to overcoming these problems.

人工智能(AI)领域的主要目标之一是生产完全自主的Agent,这些Agent与环境交互,学习最佳的行为,并通过反复试验不断改进。从能够感知周围世界并做出反应的机器人,到能够与自然语言和多媒体交互的纯软件Agent,打造具有响应能力和有效学习能力的人工智能系统一直是一个长期存在的挑战。经验驱动式自主学习的一个基本数学框架是强化学习(RL)[135]。尽管RL在过去取得了一些成功[141、129、62、93],但以前的方法缺乏可伸缩性,并且天生局限于相当低维的问题。这些局限性的存在是因为RL算法与其他算法有相同的复杂性问题:内存复杂性、计算复杂性,以及在机器学习算法中的样本复杂性[133]。近年来,我们所看到的深度学习的兴起,依靠深度神经网络强大的函数逼近和表征学习的特性,为我们提供了克服这些问题的新工具。

The advent of deep learning has had a significant impact on many areas in machine learning, dramatically improving the state of the art in tasks such as object detection, speech recognition, and language translation [39]. The most important property of deep learning is that deep neural networks can automatically find compact low-dimensional representations (features) of high-dimensional data (e.g., images, text, and audio). Through crafting inductive biases into neural network architectures, particularly that of hierarchical representations, machine-learning practitioners have made effective progress in addressing the curse of dimensionality [7]. Deep learning has similarly accelerated progress in RL, with the use of deep-learning algorithms within RL defining the field of DRL. The aim of this survey is to cover both seminal and recent developments in DRL, conveying the innovative ways in which neural networks can be used to bring us closer toward developing autonomous agents. For a more comprehensive survey of recent efforts in DRL, we refer readers to the overview by Li [43].

深度学习的出现对机器学习的许多领域产生了重大影响,极大地提高了目标检测、语音识别和语言翻译[39]等任务的技术水平。深度学习最重要的特性是,深度神经网络可以自动找到高维数据(如图像、文本和音频)的相关低维表示(特征)。通过将归纳偏差融入神经网络架构,特别是层次表示的架构,机器学习实践者在解决维度的诅咒方面取得了有效进展[7]。深度学习同样加速了RL的发展,RL内部的深度学习算法定义了DRL领域。本调查报告涵盖了DRL开创性的和近期的发展,传达了神经网络用于使我们更可能开发出自主Agent的新方法。为了更全面地调查近期在DRL方面的努力,我们建议读者参考Li[43]的概述。

Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. Among recent work in the field of DRL, there have been two outstanding success stories. The first, kick starting the revolution in DRL, was the development of an algorithm that could learn to play a range of Atari 2600 video games at a superhuman level, directly from image pixels [47]. Providing solutions for the instability of function approximation techniques in RL, this work was the first to convincingly demonstrate that RL agents could be trained on raw, high-dimensional observations, solely based on a reward signal. The second standout success was the development of a hybrid DRL system, AlphaGo, that defeated a human world champion in Go [73], paralleling the historic achievement of IBM’s Deep Blue in chess two decades earlier [9]. Unlike the handcrafted rules that have dominated chess-playing systems, AlphaGo comprised neural networks that were trained using supervised learning and RL, in combination with a traditional heuristic search algorithm.

深度学习使RL能够扩展到以前难以处理的决策问题,即具有高维状态和动作空间的环境。在DRL领域近期的工作中,有两个杰出的成功故事。首先,启动DRL革命的是一种算法的开发,该算法可以直接从图像像素学习以“superhuman level”的水平玩一系列雅达利2600的视频游戏[47]。这项工作为RL中函数近似技术的不稳定性提供了解决方案,它首次令人信服地证明了RL的Agent可以在原始的高维观察上仅基于奖励信号进行训练。第二个成就是开发了一种混合DRL系统——AlphaGo,它在围棋中击败了一位人类世界冠军,与20年前IBM的深蓝(Deep Blue)在国际象棋中的历史性成就相媲美[9]。与主宰国际象棋系统的手工规则不同,AlphaGo由神经网络组成,这些神经网络使用监督学习和RL,并结合传统的启发式搜索算法进行训练。

DRL algorithms have already been applied to a wide range of problems, such as robotics, where control policies for robots can now be learned directly from camera inputs in the real world [41], [42], succeeding controllers that used to be hand-engineered or learned from low-dimensional features of the robot’s state. In Figure 1, we showcase just some of the domains that DRL has been applied to, ranging from playing video games [47] to indoor navigation [100].

DRL算法已经应用于各种各样的问题,如机器人技术,机器人控制策略现在可以直接从摄像头输入的现实场景进行训练[41、42],接替过去手动设计或从机器人状态的低维特征中学习的控制器。在图1中,我们展示了DRL的一些应用领域,从玩电子游戏[47]到室内导航[100]。

在这里插入图片描述

图 1. 一系列的视觉RL应用领域。
(a) 来自街机学习环境(Arcade Learning Environment, ALE)的三款经典的雅达利2600电子游戏,Enduro、Freeway和Seaquest[5]。由于受支持的游戏类型、视觉效果和难度各不相同,ALE已成为DRL算法[20、47、48、55、70、75、92]的标准测试平台。ALE是目前用来标准化RL评估的几个基准之一。
(b) TORCS赛车模拟器,用于测试可以输出连续动作的DRL算法[33、44、48] (因为来自ALE的游戏只支持离散动作)。
(c ) 利用机器人模拟器中可能积累的无限数量的训练数据,有几种方法可以用于将知识从模拟器转移到真实世界[11、64、84]。
(d) Levine 等设计的四种机器人任务中的两种:拧上瓶盖,把一个形状的块放在正确的孔里,能够以端到端的方式训练视觉运动策略[41],这表明视觉伺服可以通过使用深度神经网络直接从原始摄像头的输入中进行学习。
(e) 一个真实的房间,在这个房间里,一个被训练来导航建筑物的轮式机器人被给予一个视觉提示作为输入,并且必须找到相应的位置[100]。
(f) 一幅自然图像,由神经网络使用RL来选择看哪里[99]。通过对生成的每个单词的一小部分图像进行处理,网络可以将注意力集中在最显著的点上。
(b)-(f) 分别从[41]、[44]、[84]、[99]和[100]复制。

2. 奖励驱动行为 Reward-Driven Behavior

Before examining the contributions of deep neural networks to RL, we will introduce the field of RL in general. The essence of RL is learning through interaction. An RL agent interacts with its environment and, upon observing the consequences of its actions, can learn to alter its own behavior in response to rewards received. This paradigm of trial-and-error learning has its roots in behaviorist psychology and is one of the main foundations of RL [78]. The other key influence on RL is optimal control, which has lent the mathematical formalisms (most notably dynamic programming [6]) that underpin the field.

在考察深度神经网络对RL的贡献之前,我们将对RL领域进行总体介绍。RL的本质是在互动中学习。一个RL的Agent与它处的环境相互作用,在观察其行为造成的结果后,可以通过学习改变自己的行为来响应所获得的奖励。这种试错学习的范式起源于行为心理学,是RL的主要基础之一[78]。对RL的另一个关键影响是最优控制,它提供了支持该领域的数学形式(最著名的是动态规划[6])。

In the RL setup, an autonomous agent, controlled by a machine-learning algorithm, observes a state s t s_t st from its environment at time step t t t. The agent interacts with the environment by taking an action a t a_t at in state s t s_t st. When the agent takes an action, the environment and the agent transition to a new state, s t + 1 s_{t+1} st+1, based on the current state and the chosen action. The state is a sufficient statistic of the environment and thereby comprises all the necessary information for the agent to take the best action, which can include parts of the agent such as the position of its actuators and sensors. In the optimal control literature, states and actions are often denoted by x t x_t xt and u t u_t ut, respectively.

在RL环境中,一个自主的Agent,由机器学习算法控制,在时间 t t t 从当前环境中观察到状态 s t s_t st。Agent通过在状态为 s t s_t st 时执行动作 a t a_t at 与环境交互。当Agent采取一个动作时,环境和Agent将根据当前状态和选择的动作转换到一个新状态 s t + 1 s_{t+1} st+1。状态是对环境的充分统计,因此包括了Agent采取最佳行动的所有必要信息,其中可以包括Agent的部分,如执行器和传感器的位置。在最优控制的文献中,状态和动作通常分别用 x t x_t xt u t u_t ut 表示。

The best sequence of actions is determined by the rewards provided by the environment. Every time the environment transitions to a new state, it also provides a scalar reward r t + 1 r_{t+1} rt+1 to the agent as feedback. The goal of the agent is to learn a policy (control strategy) π \pi π that maximizes the expected return (cumulative, discounted reward). Given a state, a policy returns an action to perform; an optimal policy is any policy that maximizes the expected return in the environment. In this respect, RL aims to solve the same problem as optimal control. However, the challenge in RL is that the agent needs to learn about the consequences of actions in the environment by trial and error, as, unlike in optimal control, a model of the state transition dynamics is not available to the agent. Every interaction with the environment yields information, which the agent uses to update its knowledge. This perception-action-learning loop is illustrated in Figure 2.

最佳行动顺序是由环境所提供的奖励所决定的。每当环境转换到一个新状态时,它还会向Agent提供一个奖励 r t + 1 r_{t+1} rt+1作为反馈。Agent的目标是学习一种策略(控制策略) π \pi π,使期望回报(累积,折扣奖励)最大化。给定一个状态,则策略会返回一个要执行的动作;最优策略是在环境中使期望收益最大化的策略。在这方面,RL的目标是解决与最优控制相同的问题。然而,RL的挑战在于,Agent需要通过试错来了解环境中行为的后果,因为与最优控制不同,Agent无法获得状态转移动态模型。与环境的每次交互都会产生信息,Agent使用这些信息来更新其知识。这个感知-动作-学习循环如图2所示。

在这里插入图片描述

图 2. 感知-动作-学习循环。在时刻 t t t,Agent从环境接收状态 s t s_t st。Agent通过其策略选择一个动作 a t a_t at。一旦动作被执行,环境就会转变为一个步骤,提供下一个状态 s t + 1 s_{t+1} st+1,并且返回奖励 r t + 1 r_{t+1} rt+1。Agent使用状态转换的知识,形式为 ( s t , a t , s t + 1 , r t + 1 ) (s_t, a_t, s_{t+1}, r_{t+1}) (st,at,st+1,rt+1),以学习和改进其策略。

2.1. 马尔科夫决策过程 Markov Decision Processes

Formally, RL can be described as a Markov decision process (MDP), which consists of

  • a set of states S \mathcal{S} S, plus a distribution of starting states p ( s 0 ) p(s_0) p(s0)
  • a set of actions A \mathcal{A} A
  • transition dynamics T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t, a_t) T(st+1st,at) that map a state-action pair at time t t t onto a distribution of states at time t + 1 t+1 t+1
  • an immediate/instantaneous reward function R ( s t , a t , s t + 1 ) R(s_t, a_t, s_{t+1}) R(st,at,st+1)
    a discount factor γ ∈ [ 0 , 1 ] \gamma\in[0,1] γ[0,1], where lower values place more emphasis on immediate rewards.

在形式上,RL可以描述为一个马尔科夫决策过程(MDP),它由以下元素组成

  • 一组状态 S \mathcal{S} S,加上一个初始状态分布 p ( s 0 ) p(s_0) p(s0)
  • 一组动作 A \mathcal{A} A
  • 转移动态 T ( s T + 1 ∣ s t , a t ) T(s_{T +1}|s_t, a_t) T(sT+1st,at),它映射一个状态-动作对从时间 t t t到时间 t + 1 t+1 t+1的状态分布
  • 瞬时奖励函数 R ( s t , a t , s t + 1 ) R(s_t, a_t, s_{t+1}) R(st,at,st+1)
  • 一个折扣因子 γ ∈ [ 0 , 1 ] \gamma\in[0,1] γ[0,1],折扣因子的值越低越强调即时奖励。

In general, the policy π \pi π is a mapping from states to a probability distribution over actions π : S → p ( A = a ∣ S ) \pi:\mathcal{S}\to p(\mathcal{A}=a|\mathcal{S}) π:Sp(A=aS). If the MDP is episodic, i.e., the state is reset after each episode of length T T T, then the sequence of states, actions, and rewards in an episode constitutes a trajectory or rollout of the policy. Every rollout of a policy accumulates rewards from the environment, resulting in the return R = ∑ t = 0 T − 1 γ t r t + 1 R=\sum_{t=0}^{T−1}\gamma^tr_{t+1} R=t=0T1γtrt+1. The goal of RL is to find an optimal policy, π ∗ \pi^∗ π that achieves the maximum expected return from all states: π ∗ = arg max ⁡ π E [ R ∣ π ] \pi^*=\argmax_{\pi}\mathbb{E}[R|\pi] π=πargmaxE[Rπ]

一般来说,策略 π \pi π是从状态到动作 π : S → p ( A = a ∣ S ) \pi:\mathcal{S}\to p(\mathcal{A}=a|\mathcal{S}) π:Sp(A=aS)概率分布的映射。如果MDP是情景性的,即在每一长度为 T T T的episode之后状态会重置,那么每一episode的状态、行为和奖励的序列可以构成轨迹或推出策略。每次推出策略都会从环境中积累奖励,结果是 R = ∑ t = 0 T − 1 γ t r t + 1 R=\sum_{t=0}^{T−1}\gamma^tr_{t+1} R=t=0T1γtrt+1。RL的目标是找到一个最优策略 π ∗ \pi^∗ π,它能从所有状态中获得最大的期望收益: π ∗ = arg max ⁡ π E [ R ∣ π ] \pi^*=\argmax_{\pi}\mathbb{E}[R|\pi] π=πargmaxE[Rπ]

It is also possible to consider nonepisodic MDPs, where T = ∞ T=\infty T=. In this situation, γ < 1 \gamma<1 γ<1 prevents an infinite sum of rewards from being accumulated. Furthermore, methods that rely on complete trajectories are no longer applicable, but those that use a finite set of transitions still are.

也可以考虑 T = ∞ T=\infty T= 的非情景MDP。在这种情况下, γ < 1 \gamma<1 γ<1阻止了奖励的无限累积。此外,依赖于完整轨迹的方法不再适用,但那些使用有限过渡集的方法仍然适用。

A key concept underlying RL is the Markov property—only the current state affects the next state, or, in other words, the future is conditionally independent of the past given the present state. This means that any decisions made at s t s_t st can be based solely on s t − 1 s_{t-1} st1, rather than { s 0 , s 1 , … , s t − 1 } \{s_0, s_1, …, s_{t−1}\} {s0,s1,,st1}. Although this assumption is held by the majority of RL algorithms, it is somewhat unrealistic, as it requires the states to be fully observable. A generalization of MDPs are partially observable MDPs (POMDPs), in which the agent receives an observation o t ∈ Ω o_t\in\Omega otΩ, where the distribution of the observation p ( o t + 1 ∣ s t + 1 , a t ) p(o_{t+1}|s_{t+1}, a_t) p(ot+1st+1,at) is dependent on the current state and the previous action [27]. In a control and signal processing context, the observation would be described by a measurement/observation mapping in a state-space model that depends on the current state and the previously applied action.

RL的一个关键概念是马尔科夫属性——只有当前状态才会影响下一个状态,换句话说,在当前状态下,未来是有条件独立于过去的。这意味着在 s t s_t st上做出的任何决定都可以仅基于 s t − 1 s_{t-1} st1,而不是 { s 0 , s 1 , … , s t − 1 } \{s_0, s_1, …, s_{t−1}\} {s0,s1,,st1}。尽管大多数RL算法都持有这个假设,但它有些不现实,因为它要求状态是完全可观察的。其中agent在 o t ∈ Ω o_t\in\Omega otΩ中接收到一个观测,其中观测的分布 p ( o t + 1 ∣ s t + 1 , a t ) p(o_{t+1}|s_{t+1}, a_t) p(ot+1st+1,at) 依赖于当前状态和前一个动作[27]。在控制和信号处理上下文中,观测将由状态空间模型中的测量/观测映射来描述,该模型依赖于当前状态和先前执行的动作。

POMDP algorithms typically maintain a belief over the current state given the previous belief state, the action taken, and the current observation. A more common approach in deep learning is to utilize recurrent neural networks (RNNs) [20], [21], [48], [96], which, unlike feedforward neural networks, are dynamical systems.

部分可观察马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)算法通常在给定前一个信念状态、所采取的行动和当前观察,在这个当前状态上维护一个信念。深度学习中更常见的方法是利用递归神经网络[20、21、48、96],与前馈神经网络不同,递归神经网络是动态系统。

2.2. 强化学习的挑战 Challenges in RL

It is instructive to emphasize some challenges faced in RL:

  • The optimal policy must be inferred by trial-and-error interaction with the environment. The only learning signal the agent receives is the reward.
  • The observations of the agent depend on its actions and can contain strong temporal correlations.
  • Agents must deal with long-range time dependencies: often the consequences of an action only materialize after many transitions of the environment. This is known as the (temporal) credit assignment problem [78].

强调在RL中面临的一些挑战是有益的:

  • 最优策略必须通过与环境的反复试错才能找出。Agent收到的唯一学习信号就是奖励。
  • 对Agent的观察依赖于其行为,并可能包含强烈的时间相关性。
  • Agent必须处理长期的依赖关系:通常一个操作的结果只在多次环境转移之后才会具体化。这就是所谓的(时间)信用分配问题[78]。

We will illustrate these challenges in the context of an indoor robotic visual navigation task: if the goal location is specified, we may be able to estimate the distance remaining (and use it as a reward signal), but it is unlikely that we will know exactly what series of actions the robot needs to take to reach the goal. As the robot must choose where to go as it navigates the building, its decisions influence which rooms it sees and, hence, the statistics of the visual sequence captured. Finally, after navigating several junctions, the robot may find itself in a dead end. There is a range of problems, from learning the consequences of actions to balancing exploration versus exploitation, but ultimately these can all be addressed formally within the framework of RL.

我们将在室内机器人视觉导航任务的背景下说明这些挑战:如果指定目标位置,我们就可以估计剩余的距离(并使用它作为奖励的信号),但我们不太可能确切地知道机器人需要采取什么样的一系列行动才能达到目标。由于机器人在建筑中导航时必须选择要去的地方,所以它的决定会影响它看到的房间,从而影响捕获的视觉序列的统计数据。最后,在导航了几个路口之后,机器人可能会发现自己进入了一个死胡同。任务中存在一系列问题,从学习行动的结果到平衡探索与利用,但最终这些问题都可以在RL框架中得到正式解决。

3. 强化学习算法 RL Algorithms

So far, we have introduced the key formalism used in RL, the MDP, and briefly noted some challenges in RL. In the following, we will distinguish between different classes of RL algorithms. There are two main approaches to solving RL problems: methods based on value functions and methods based on policy search. There is also a hybrid actor-critic approach that employs both value functions and policy search. Next, we will explain these approaches and other useful concepts for solving RL problems.

到目前为止,我们已经介绍了RL中使用的关键形式,即MDP,并简要指出了RL中的一些挑战。下面,我们将区分不同类型的RL算法。解决RL问题主要有两种方法:基于价值函数的方法和基于策略搜索的方法。还有一种混合的演员-评论家算法,它同时使用价值函数和策略搜索。接下来,我们将解释这些方法和其他解决RL问题的有用概念。

3.1. 价值函数 Value Functions

Value function methods are based on estimating the value (expected return) of being in a given state. The state-value function V π ( s ) V^\pi(s) Vπ(s) is the expected return when starting in state s s s and following π \pi π subsequently: V π ( s ) = E [ R ∣ s , π ] V^\pi(s)=\mathbb{E}[R|s,\pi] Vπ(s)=E[Rs,π]
The optimal policy, π ∗ \pi^* π, has a corresponding state-value function V ∗ ( s ) V^∗(s) V(s), and vice versa; the optimal state-value function can be defined as V ∗ ( s ) = max ⁡ π V π ( s ) ∀ s ∈ S V^*(s)=\max_{\pi}V^\pi(s)\quad \forall s \in \mathcal{S} V(s)=πmaxVπ(s)sS
If we had V ∗ ( s ) V^∗(s) V(s) available, the optimal policy could be retrieved by choosing among all actions available at s t s_t st and picking the action a a a that maximizes E s t + 1 ∼ τ ( s t + 1 ∣ s t , a ) [ V ∗ ( S t + 1 ) ] \mathbb{E}_{s_{t+1}\sim\tau(s_{t+1}|s_t,a)}[V^*(S_{t+1})] Est+1τ(st+1st,a)[V(St+1)]
\quad
In the RL setting, the transition dynamics τ \tau τ are unavailable. Therefore, we construct another function, the state-action value or quality function Q π ( s , a ) Q^\pi(s,a) Qπ(s,a), which is similar to V π V^\pi Vπ, except that the initial action a a a is provided and π \pi π is only followed from the succeeding state onward: Q π ( s , a ) = E [ R ∣ s , a , π ] Q^\pi(s,a)=\mathbb{E}[R|s,a,\pi] Qπ(s,a)=E[Rs,a,π]
The best policy, given Q π ( s , a ) Q^\pi(s,a) Qπ(s,a), can be found by choosing a a a greedily at every state: arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a). Under this policy, we can also define V π ( s ) V^\pi(s) Vπ(s): by maximizing Q π ( s , a ) Q^\pi(s,a) Qπ(s,a): V π ( s ) = max ⁡ a Q π ( s , a ) V^\pi(s)=\max_{a}Q^\pi(s,a) Vπ(s)=amaxQπ(s,a)

值函数方法基于对给定状态的值(预期回报)的估计。状态-值函数 V π ( s ) V^\pi(s) Vπ(s)是在状态 s s s时,特定策略 π \pi π的预期反馈: V π ( s ) = E [ R ∣ s , π ] V^\pi(s)=\mathbb{E}[R|s,\pi] Vπ(s)=E[Rs,π]

最优策略 π ∗ \pi^* π对应的状态值函数是 V ∗ ( s ) V^∗(s) V(s),反之亦然,最优状态值函数可以定义为: V ∗ ( s ) = max ⁡ π V π ( s ) ∀ s ∈ S V^*(s)=\max_{\pi}V^\pi(s)\quad \forall s \in \mathcal{S} V(s)=πmaxVπ(s)sS

如果我们有 V ∗ ( s ) V^∗(s) V(s)可用,可以在 s t s_t st时的所有可用动作中选择使 E s t + 1 ∼ τ ( s t + 1 ∣ s t , a ) [ V ∗ ( S t + 1 ) ] \mathbb{E}_{s_{t+1}\sim\tau(s_{t+1}|s_t,a)}[V^*(S_{t+1})] Est+1τ(st+1st,a)[V(St+1)]最大化的动作 a a a来检索最优策略

在RL环境中,转移动态方法 τ \tau τ 不可用。因此,我们构造另一个函数,状态-动作值函数或称质量函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),它类似于 V π V^\pi Vπ,只是提供了初始动作 a a a,并且 π \pi π只从后续状态开始: Q π ( s , a ) = E [ R ∣ s , a , π ] Q^\pi(s,a)=\mathbb{E}[R|s,a,\pi] Qπ(s,a)=E[Rs,a,π]

在给定 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)的情况下,通过在每个状态下都贪婪地选择 a a a,可以得到最优策略: arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a)。在这个策略下,我们还可以通过最大化 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)定义 V π ( s ) V^\pi(s) Vπ(s) V π ( s ) = max ⁡ a Q π ( s , a ) V^\pi(s)=\max_{a}Q^\pi(s,a) Vπ(s)=amaxQπ(s,a)

3.2. 动态规划 Dynamic Programming

To actually learn Q π Q^\pi Qπ, we exploit the Markov property and define the function as a Bellman equation [6], which has the following recursive form: Q π ( s t , a t ) = E s t + 1 [ r t + 1 + γ Q π ( s t + 1 , π ( s t + 1 ) ) ] Q^\pi(s_t,a_t)=\mathbb{E}_{s_{t+1}}[r_{t+1}+\gamma Q^\pi(s_{t+1},\pi(s_{t+1}))] Qπ(st,at)=Est+1[rt+1+γQπ(st+1,π(st+1))]

This means that Q π Q^\pi Qπ can be improved by bootstrapping, i.e., we can use the current values of our estimate of Q π Q^\pi Qπ to improve our estimate. This is the foundation of Q-learning [94] and the state-action-reward-state-action (SARSA) algorithm [62]: Q π ( s t , a t ) ← Q π ( s t , a t ) + α δ Q^\pi(s_t,a_t)\leftarrow Q^\pi(s_t,a_t)+\alpha\delta Qπ(st,at)Qπ(st,at)+αδ

where α α α is the learning rate and δ = Y − Q π ( s t , a t ) \delta=Y−Q^\pi(s_t,a_t) δ=YQπ(st,at) the temporal difference (TD) error; here, Y Y Y is a target as in a standard regression problem. SARSA, an on-policy learning algorithm, is used to improve the estimate of Q π Q^\pi Qπ by using transitions generated by the behavioral policy (the policy derived from Q π Q^\pi Qπ), which results in setting Y = r t + γ Q π ( s t + 1 , a t + 1 ) Y=r_t+\gamma Q^\pi(s_{t+1}, a_{t+1}) Y=rt+γQπ(st+1,at+1).Q-learning is off-policy, as Q π Q^\pi Qπ is instead updated by transitions that were not necessarily generated by the derived policy. Instead, Q-learning uses Y = r t + γ max ⁡ a Q π ( s t + 1 , a ) Y=r_t+\gamma\max_{a}Q^\pi(s_{t+1}, a) Y=rt+γmaxaQπ(st+1,a), which directly approximates Q ∗ Q^∗ Q.

为了实际学习 Q π Q^\pi Qπ,我们利用马尔可夫性质,将函数定义为贝尔曼方程[6],它有如下递归形式: Q π ( s t , a t ) = E s t + 1 [ r t + 1 + γ Q π ( s t + 1 , π ( s t + 1 ) ) ] Q^\pi(s_t,a_t)=\mathbb{E}_{s_{t+1}}[r_{t+1}+\gamma Q^\pi(s_{t+1},\pi(s_{t+1}))] Qπ(st,at)=Est+1[rt+1+γQπ(st+1,π(st+1))]

这意味着 Q π Q^\pi Qπ可以通过引导来改进,也就是说,我们可以使用 Q π Q^\pi Qπ的当前估计值来改进我们的估计值。这是Q-learning[94]和state-action-reward-state-action (SARSA)算法[62]的基础: Q π ( s t , a t ) ← Q π ( s t , a t ) + α δ Q^\pi(s_t,a_t)\leftarrow Q^\pi(s_t,a_t)+\alpha\delta Qπ(st,at)Qπ(st,at)+αδ

其中 α α α为学习率, δ = Y − Q π ( s t , a t ) \delta=Y−Q^\pi(s_t,a_t) δ=YQπ(st,at)为时序差分(TD)误差;这里, Y Y Y是标准回归问题中的目标。SARSA是一种策略学习算法,它通过使用行为策略(由 Q π Q^\pi Qπ派生的策略)产生的转换来改进 Q π Q^\pi Qπ的估计,结果是设置 Y = r t + γ Q π ( s t + 1 , a t + 1 ) Y=r_t+\gamma Q^\pi(s_{t+1}, a_{t+1}) Y=rt+γQπ(st+1,at+1)。Q-learning是不基于策略的,因为 Q π Q^\pi Qπ是通过转换更新的,而这些转换不一定由派生策略生成。相反,Q-learning使用 Y = r t + γ max ⁡ a Q π ( s t + 1 , a ) Y=r_t+\gamma\max_{a}Q^\pi(s_{t+1}, a) Y=rt+γmaxaQπ(st+1,a),它直接近似于 Q ∗ Q^∗ Q

To find Q ∗ Q^∗ Q from an arbitrary Q π Q^\pi Qπ, we use generalized policy iteration, where policy iteration consists of policy evaluation and policy improvement. Policy evaluation improves the estimate of the value function, which can be achieved by minimizing TD errors from trajectories experienced by following the policy. As the estimate improves, the policy can naturally be improved by choosing actions greedily based on the updated value function. Instead of performing these steps separately to convergence (as in policy iteration), generalized policy iteration allows for interleaving the steps, such that progress can be made more rapidly.

为了从任意的 Q π Q^\pi Qπ中找到 Q ∗ Q^∗ Q,我们使用了广义策略迭代,其中策略迭代包括策略评估和策略改进。策略评估可以改善价值函数的估计,这可以通过最小化遵循策略轨迹的TD误差来实现。随着估计值的提高,策略自然可以通过基于更新后的值函数贪婪地选择动作来改进。与单独执行这些步骤以实现收敛(就像在策略迭代中那样)不同,广义策略迭代允许交叉执行这些步骤,这样可以更快地取得进展。

3.3. 采样 Sampling

Instead of bootstrapping value functions using dynamic programming methods, Monte Carlo methods estimate the expected return (2) from a state by averaging the return from multiple rollouts of a policy. Because of this, pure Monte Carlo methods can also be applied in non-Markovian environments. On the other hand, they can only be used in episodic MDPs, as a rollout has to terminate for the return to be calculated. It is possible to get the best of both methods by combining T D TD TD learning and Monte Carlo policy evaluation, as is done in the T D ( λ ) TD(\lambda) TD(λ) algorithm [78]. Similarly to the discount factor, the λ \lambda λ in T D ( λ ) TD(\lambda) TD(λ) is used to interpolate between Monte Carlo evaluation and bootstrapping. As demonstrated in Figure 3, this results in an entire spectrum of RL methods based around the amount of sampling utilized.

蒙特卡罗方法不是使用动态规划方法引导值函数,而是通过平均策略多次推出的收益来估计状态的预期收益(2)。正因为如此,纯蒙特卡罗方法也可以应用于非马尔可夫环境。另一方面,它们只能在偶发的MDP中使用,因为轨迹必须终止才能计算收益。通过结合 T D TD TD学习和蒙特卡罗策略评估,有可能获得这两种方法的最佳效果,正如在 T D ( λ ) TD(\lambda) TD(λ)算法中所做的那样[78]。类似于折扣因子, T D ( λ ) TD(\lambda) TD(λ)中的 λ \lambda λ用于在蒙特卡罗计算和自举检验之间进行插值。如图3所示,这将基于所使用的采样量产生整个RL方法范围。

在这里插入图片描述

图3. 基于用于学习或构建策略备份的两个RL算法维度。这些维度的极端情况是(a)动态规划、(b)穷举搜索、©单步TD学习、(d)蒙特卡罗方法。Bootstrapping从©单步TD学习扩展到n步TD学习方法[78],(d)纯蒙特卡罗方法完全不依赖Bootstrapping。变化的另一个可能维度是(c )和(d)选择对动作进行抽样,而不是像(a)和(b)那样在所有选择中选择期望。(本图根据[78]重建)。

Another major value-function-based method relies on learning the advantage function A π ( s , a ) A^\pi(s,a) Aπ(s,a) [3]. Unlike producing absolute state-action values, as with Q π Q^\pi Qπ, A π A^\pi Aπ instead represents relative state-action values. Learning relative values is akin to removing a baseline or average level of a signal; more intuitively, it is easier to learn that one action has better consequences than another than it is to learn the actual return from taking the action. A π A^\pi Aπ represents a relative advantage of actions through the simple relationship A π = Q π − V π A^\pi=Q^\pi−V^\pi Aπ=QπVπ and is also closely related to the baseline method of variance reduction within gradient-based policy search methods [97]. The idea of advantage updates has been utilized in many recent DRL algorithms [19], [48], [71], [92].

另一个主要的基于值函数的方法依赖于学习优势函数 A π ( s , a ) A^\pi(s, a) Aπ(s,a)[3]。与生成绝对的状态-动作值(如 Q π Q^\pi Qπ)不同, A π A^\pi Aπ代表的是相对的状态-动作值。学习相对值类似于去除信号的基线或平均水平;更直观地说,人们更容易知道一个动作比另一个动作有更好的结果,而不是从动作中了解实际的回报。 A π A^\pi Aπ通过简单的关系 A π = Q π − V π A^\pi=Q^\pi−V^\pi Aπ=QπVπ代表了动作的相对优势,也与基于梯度的策略搜索方法中方差缩减的基线方法密切相关[97]。最近的很多DRL算法[19、48、71、92]都采用了优势更新的思想。

3.4. 策略搜索 Policy Search

Policy search methods do not need to maintain a value function model but directly search for an optimal policy π ∗ \pi^∗ π. Typically, a parameterized policy π θ \pi_\theta πθ is chosen, whose parameters are updated to maximize the expected return E [ R ∣ θ ] \mathbb{E}[R|\theta] E[Rθ] using either gradient-based or gradient-free optimization [12]. Neural networks that encode policies have been successfully trained using both gradient-free [17], [33] and gradient-based [22], [41], [44], [70], [71], [96], [97] methods. Gradient-free optimization can effectively cover low-dimensional parameter spaces, but, despite some successes in applying them to large networks [33], gradient-based training remains the method of choice for most DRL algorithms, being more sample efficient when policies possess a large number of parameters.

策略搜索方法不需要维持值函数模型,而是直接搜索最优策略。通常,选择一个参数化策略 π θ \pi_\theta πθ,使用基于梯度或无梯度的优化更新其参数,以最大化预期的返回值 E [ R ∣ θ ] \mathbb{E}[R|\theta] E[Rθ] [12]。使用无梯度[17、33]和基于梯度[22、41、44、70、71、96、97]方法成功训练了编码策略的神经网络。无梯度优化可以有效覆盖低维参数空间,但尽管在大型网络上有一些成功的应用[33],但基于梯度的训练仍然是大多数DRL算法所选择的方法,当策略具有大量参数时,这种方法具有更高的样本效率。

When constructing the policy directly, it is common to output parameters for a probability distribution; for continuous actions, this could be the mean and standard deviations of Gaussian distributions, while for discrete actions this could be the individual probabilities of a multinomial distribution. The result is a stochastic policy from which we can directly sample actions. With gradient-free methods, finding better policies requires a heuristic search across a predefined class of models. Methods such as evolution strategies essentially perform hill climbing in a subspace of policies [65], while more complex methods, such as compressed network search, impose additional inductive biases [33]. Perhaps the greatest advantage of gradient-free policy search is that it can also optimize nondifferentiable policies.

当直接构造策略时,输出的参数通常是一个概率分布;对于连续行为,参数可以是高斯分布的均值和标准差,而对于离散行为,参数可以是多项式分布的个体概率。其结果是一个随机策略,我们可以直接从中抽取行为样本。使用无梯度方法,寻找更好的策略需要在预定义的模型类中进行启发式搜索。进化策略等方法本质上是在策略的子空间中进行爬山[65],而更复杂的方法,如压缩网络搜索,会施加额外的归纳偏差[33]。无梯度策略搜索的最大优点可能是它还可以优化不可微分策略。

3.4.1. 策略梯度 Policy Gradients

Gradients can provide a strong learning signal as to how to improve a parameterized policy. However, to compute the expected return (1) we need to average over plausible trajectories induced by the current policy parameterization. This averaging requires either deterministic approximations (e.g., linearization) or stochastic approximations via sampling [12]. Deterministic approximations can be only applied in a model-based setting where a model of the underlying transition dynamics is available. In the more common model-free RL setting, a Monte Carlo estimate of the expected return is determined. For gradient-based learning, this Monte Carlo approximation poses a challenge since gradients cannot pass through these samples of a stochastic function. Therefore, we turn to an estimator of the gradient, known in RL as the REINFORCE rule [97]. Intuitively, gradient ascent using the estimator increases the log probability of the sampled action, weighted by the return. More formally, the REINFORCE rule can be used to compute the gradient of an expectation over a function f f f of a random variable X X X with respect to parameters θ \theta θ: ∇ θ E X [ f ( X ; θ ) ] = E X [ f ( X ; θ ) ∇ θ log ⁡ p ( X ) ] \nabla_\theta\mathbb{E}_X[f(X;\theta)]=\mathbb{E}_X[f(X;\theta)\nabla_\theta\log p(X)] θEX[f(X;θ)]=EX[f(X;θ)θlogp(X)]

梯度可以为如何改进参数化策略提供一个强大的学习信号。然而,为了计算预期收益(1),我们需要对当前策略参数化得出的合理轨迹进行平均。这种平均需要确定性的近似(例如线性化)或通过采样的随机近似[12]。确定性近似只能应用于基于模型的设置,其中底层转移动态的模型是可用的。在更常见的无模型RL环境下,确定了预期收益的蒙特卡罗估计。对于基于梯度的学习,这种蒙特卡罗近似提出了一个挑战,因为梯度不能通过随机函数的这些样本得到。因此,我们求助于梯度的估计,在RL中称此为加固规则[97]。直观地说,使用估计值的梯度上升增加了抽样行为的对数概率,并由收益加权。更正式地说,加强规则可用于计算随机变量 X X X对参数 θ \theta θ函数 f f f的期望梯度: ∇ θ E X [ f ( X ; θ ) ] = E X [ f ( X ; θ ) ∇ θ log ⁡ p ( X ) ] \nabla_\theta\mathbb{E}_X[f(X;\theta)]=\mathbb{E}_X[f(X;\theta)\nabla_\theta\log p(X)] θEX[f(X;θ)]=EX[f(X;θ)θlogp(X)]

As this computation relies on the empirical return of a trajectory, the resulting gradients possess a high variance. By introducing unbiased estimates that are less noisy, it is possible to reduce the variance. The general methodology for performing this is to subtract a baseline, which means weighting updates by an advantage rather than the pure return. The simplest baseline is the average return taken over several episodes [97], but there are many more options available [71].

由于这种计算依赖于轨迹的经验回归,因此得到的梯度具有高方差。通过引入噪声较小的无偏估计,有可能减少方差。执行这一操作的一般方法是减去一个基线,这意味着以优势而不是纯回报来加权更新。最简单的基准是几个episode的平均收益[97],但还有更多的选择[71]。

3.4.2. 演员-评论家算法 Actor-Critic Methods

It is possible to combine value functions with an explicit representation of the policy, resulting in actor-critic methods, as shown in Figure 4. The “actor” (policy) learns by using feedback from the “critic” (value function). In doing so, these methods trade off variance reduction of policy gradients with bias introduction from value function methods [32], [71].

可以将值函数与策略的显式表示相结合,从而产生演员-评论家算法,如图4所示。演员(即政策)通过评论家(即价值函数)的反馈进行学习。在此过程中,这种方法权衡了降低策略梯度的方差和引入价值函数方法的偏差[32、71]。
在这里插入图片描述

图4。演员-评论家算法环境。演员(策略)从环境中接收状态,并选择要执行的操作。与此同时,评论家(价值函数)接受之前互动产生的状态和奖励。评论家使用从这些信息计算出TD误差来更新自己和演员(本图根据[78]重绘)。

Actor-critic methods use the value function as a baseline for policy gradients, such that the only fundamental difference between actor-critic methods and other baseline methods is that actor-critic methods utilize a learned value function. For this reason, we will later discuss actor-critic methods as a subset of policy gradient methods.

演员-评论家方法使用价值函数作为策略梯度的基线,这样演员-评论家方法和其他基线方法之间的唯一根本区别是演员-评论家方法利用了一个学习到的价值函数。出于这个原因,我们稍后将讨论作为策略梯度方法子集的演员-评论家方法。

3.5. 规划和学习 Planning and Learning

Given a model of the environment, it is possible to use dynamic programming over all possible actions [Figure 3(a)], sample trajectories for heuristic search (as was done by AlphaGo [73]), or even perform an exhaustive search [Figure 3(b)]. Sutton and Barto [78] define planning as any method that utilizes a model to produce or improve a policy. This includes distribution models, which include T \mathcal{T} T and R \mathcal{R} R, and sample models, from which only samples of transitions can be drawn.

给定一个环境模型,可以对所有可能的动作使用动态编程[图3(a)]、启发式搜索的样本轨迹(如AlphaGo所做的[73]),甚至可以执行穷举搜索[图3(b)]。Sutton和Barto将规划定义为任何利用模型来产生或改进策略的方法[78]。这包括分布模型,其中包括 T \mathcal{T} T R \mathcal{R} R,以及采样模型,从这些模型中只能提取转换的样本。

In RL, we focus on learning without access to the underlying model of the environment. However, interactions with the environment could be used to learn value functions, policies, and also a model. Model-free RL methods learn directly from interactions with the environment, but model-based RL methods can simulate transitions using the learned model, resulting in increased sample efficiency. This is particularly important in domains where each interaction with the environment is expensive. However, learning a model introduces extra complexities, and there is always the danger of suffering from model errors, which in turn affects the learned policy. Although deep neural networks can potentially produce very complex and rich models [14], [55], [75], sometimes simpler, more data-efficient methods are preferable [19]. These considerations also play a role in actor-critic methods with learned value functions [32], [71].

在RL中,我们专注于学习,而不需要访问环境的底层模型。然而,与环境的交互可以用来学习价值函数、策略和模型。无模型的RL方法直接从与环境的交互中学习,但基于模型的RL方法可以使用所学的模型模拟转换,从而提高了样本效率。在与环境的每次交互都需要付出极高代价的领域中,这一点尤其重要。然而,学习一个模型会带来额外的复杂性,而且总是有遭遇模型错误的风险,而这反过来又会影响所学习的策略。虽然深度神经网络可能产生非常复杂和丰富的模型[14、55、75],但有时更简单、数据效率更高的方法更可取[19]。这些考虑也在具有学习值函数[32]的演员-评论家方法中发挥作用[71]。

3.6. 深度强化学习的兴起 The Rise of DRL

Many of the successes in DRL have been based on scaling up prior work in RL to high-dimensional problems. This is due to the learning of low-dimensional feature representations and the powerful function approximation properties of neural networks. By means of representation learning, DRL can deal efficiently with the curse of dimensionality, unlike tabular and traditional nonparametric methods [7]. For instance, convolutional neural networks (CNNs) can be used as components of RL agents, allowing them to learn directly from raw, high-dimensional visual inputs. In general, DRL is based on training deep neural networks to approximate the optimal policy π ∗ \pi^∗ π and/or the optimal value functions V ∗ V^∗ V, Q ∗ Q^∗ Q, and A ∗ A^∗ A.

DRL的许多成功都建立在将RL的工作扩展到高维问题的基础上。这是由于神经网络具有低维特征表示的学习和强大的函数近似特性。与传统的非参数方法和表格方法不同,DRL通过表示学习可以有效地处理维数灾难[7]。例如,卷积神经网络(CNNs)可以用作RL中Agent的组件,允许它们直接从原始的高维视觉输入中学习。一般来说,DRL基于训练深度神经网络来得到最优策略 π ∗ \pi^∗ π和最优值函数 V ∗ V^∗ V Q ∗ Q^∗ Q A ∗ A^∗ A

4. 价值函数 Value Functions

The well-known function approximation properties of neural networks led naturally to the use of deep learning to regress functions for use in RL agents. Indeed, one of the earliest success stories in RL is TD-Gammon, a neural network that reached expert-level performance in backgammon in the early 1990s [81]. Using TD methods, the network took in the state of the board to predict the probability of black or white winning. Although this simple idea has been echoed in later work [73], progress in RL research has favored the explicit use of value functions, which can capture the structure underlying the environment. From early value function methods in DRL, which took simple states as input [61], current methods are now able to tackle visually and conceptually complex environments [47], [48], [70], [100].

众所周知的神经网络的函数近似特性自然导致使用深度学习回归函数用于RL的Agent。事实上,RL中最早的成功案例之一是TD-Gammon,这是一种神经网络,在20世纪90年代早期在双陆棋中达到了专家级别的性能[81]。该网络利用TD方法,在棋盘的状态下预测黑棋赢或白棋赢的概率。虽然这个简单的想法在后来的研究中得到了呼应[73],但RL研究的进展倾向于明确使用价值函数,这可以捕捉环境的底层结构。从早期的以简单状态作为输入的DRL值函数方法[61],现在的方法能够处理视觉上和概念上复杂的环境[47、48、70、100]。

4.1. 函数估计和深度Q学习 Function Approximation and the DQN

We begin our survey of value-function-based DRL algorithms with the DQN [47], illustrated in Figure 5, which achieved scores across a wide range of classic Atari 2600 video games [5] that were comparable to that of a professional video games tester. The inputs to the DQN are four gray-scale frames of the game, concatenated over time, which are initially processed by several convolutional layers to extract spatiotemporal features, such as the movement of the ball in Pong or Breakout. The final feature map from the convolutional layers is processed by several fully connected layers, which more implicitly encode the effects of actions. This contrasts with more traditional controllers that use fixed preprocessing steps, which are therefore unable to adapt their processing of the state in response to the learning signal.

我们从图5所示的DQN[47]开始调查基于价值函数的DRL算法,它在经典的Atari 2600电子游戏中获得了与专业电子游戏测评员相当的分数[5]。DQN的输入是游戏的四个灰度帧,随着时间的推移连接在一起,最初由几个卷积层进行处理,以提取时空特征,例如乒乓球或突破球的运动。卷积层的最终特征图由几个全连接层处理,这些层更隐式地编码动作的效果。这与使用固定预处理步骤的传统控制器形成了鲜明的对比,后者因此无法适应其状态处理以响应学习信号。

在这里插入图片描述

图5。深度Q网络[47]。该网络获取状态(电子游戏的灰度帧堆栈),并通过卷积层和全连接层来处理它,每层之间都有ReLU非线性层。在最后一层,网络输出一个离散的动作,对应于游戏某个可能的控制输入。给定当前状态和选择的动作,游戏将返回一个新分数。DQN使用的奖励是新分数和前一个分数之间的差值,这个奖励从它的决策中学习。更准确地说,奖励被用来更新它对Q值的估计,之前的估计和新估计之间的误差通过网络反向传播。

A forerunner of the DQN—neural-fitted Q (NFQ) iteration—involved training a neural network to return the Q-value given a state-action pair [61]. NFQ was later extended to train a network to drive a slot car using raw visual inputs from a camera over the race track, by combining a deep autoencoder to reduce the dimensionality of the inputs with a separate branch to predict Q-values [38]. Although the previous network could have been trained for both reconstruction and RL tasks simultaneously, it was both more reliable and computationally efficient to train the two parts of the network sequentially.

DQN的前身是神经拟合Q(neural-fitted Q, NFQ)迭代训练神经网络,以返回给定状态-动作对的Q值[61]。随后,NFQ得到扩展,其训练的网络用来驾驶一辆槽车,使用赛道上的摄像机作为原始视觉输入,通过结合一个深度自动编码器来减少输入的维度,并使用一个单独的分支来预测Q值[38]。虽然之前的网络可以同时训练重构和RL任务,但顺序训练这两个部分的网络更可靠,计算效率更高。

The DQN [47] is closely related to the model proposed by Lange et al. [38] but was the first RL algorithm that was demonstrated to work directly from raw visual inputs and on a wide variety of environments. It was designed such that the final fully connected layer outputs Q π ( s , ⋅ ) Q^\pi(s, ⋅) Qπ(s,) for all action values in a discrete set of actions—in this case, the various directions of the joystick and the fire button. This not only enables the best action, arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a), to be chosen after a single forward pass of the network, but also allows the network to more easily encode action-independent knowledge in the lower, convolutional layers. With merely the goal of maximizing its score on a video game, the DQN learns to extract salient visual features, jointly encoding objects, their movements, and, most importantly, their interactions. Using techniques originally developed for explaining the behavior of CNNs in object recognition tasks, we can also inspect what parts of its view the agent considers important (see Figure 6).

DQN[47]与Lange等人提出的模型[38]密切相关,但它是第一个被证明可以直接从原始视觉输入和各种环境中工作的RL算法。在设计上,最终的全连接层将输出 Q π ( s , ⋅ ) Q^\pi(s, ⋅) Qπ(s,),用于一组离散动作(在本例中是操纵杆和射击按钮的各个方向)中的所有动作值。这不仅允许在一次网络前向传递后选择最佳动作 arg max ⁡ a Q π ( s , a ) \argmax_{a}Q^\pi(s,a) aargmaxQπ(s,a),而且允许网络在较低的卷积层中更容易编码与动作无关的知识。为了在电子游戏中获得最大的分数,DQN学习提取显著的视觉特征、共同编码对象、它们的运动,以及最重要的交互作用。使用最初为解释CNN在对象识别任务中的行为而开发的技术,我们还可以审查代理认为其视图的哪些部分是重要的(参见图6)。
在这里插入图片描述

图6。训练后的DQN[47]玩《太空入侵者》[5]的显著图。通过将训练信号反向传播到图像空间,就有可能看到基于神经网络的Agent正在关注什么。在这一帧中,最突出的点(用红色的覆盖物显示)是Agent最近发射的激光,以及它预期在几个时间步骤内击中的敌人。

The true underlying state of the game is contained within 128 bytes of Atari 2600 random-access memory. However, the DQN was designed to directly learn from visual inputs (210×160 pixel 8-bit RGB images), which it takes as the state s s s. It is impractical to represent Q π ( s , a ) Q^\pi(s, a) Qπ(s,a) exactly as a lookup table: when combined with 18 possible actions, we obtain a Q-table of size ∣ S ∣ × ∣ A ∣ = 18 × 2 5 63 × 210 × 160 |S|×|A|=18×25^{63×210×160} S×A=18×2563×210×160. Even if it were feasible to create such a table, it would be sparsely populated, and information gained from one state-action pair cannot be propagated to other state-action pairs. The strength of the DQN lies in its ability to compactly represent both high-dimensional observations and the Q-function using deep neural networks. Without this ability, tackling the discrete Atari domain from raw visual inputs would be impractical.

游戏真正的底层状态包含在128字节的Atari 2600随机存取内存中。而DQN的设计是直接从视觉输入(210×160像素8位RGB图像)中学习,并将其作为状态 s s s。将 Q π ( s , a ) Q^\pi(s, a) Qπ(s,a) 完全表示为一个查找表是不切实际的:因为当结合18种可能的动作时,我们将得到一个大小为 ∣ S ∣ × ∣ A ∣ = 18 × 2 5 63 × 210 × 160 |S|×|A|=18×25^{63×210×160} S×A=18×2563×210×160的Q表格。即使创建这样一个Q表格是可行的,它也会是稀疏的,并且从一个状态-动作对获得的信息不能传播到其他状态-动作对。DQN的优点在于它能够使用深度神经网络紧凑地表示高维观测和Q函数。如果没有这种能力,从原始的视觉输入中处理离散的Atari域是不切实际的。

The DQN addressed the fundamental instability problem of using function approximation in RL [83] by the use of two techniques: experience replay [45] and target networks. Experience replay memory stores transitions of the form ( s t , a t , s t + 1 , r t + 1 ) (s_t, a_t, s_{t+1}, r_{t+1}) (st,at,st+1,rt+1) in a cyclic buffer, enabling the RL agent to sample from and train on previously observed data offline. Not only does this massively reduce the number of interactions needed with the environment, but batches of experience can be sampled, reducing the variance of learning updates. Furthermore, by sampling uniformly from a large memory, the temporal correlations that can adversely affect RL algorithms are broken. Finally, from a practical perspective, batches of data can be efficiently processed in parallel by modern hardware, increasing throughput. While the original DQN algorithm used uniform sampling [47], later work showed that prioritizing samples based on TD errors is more effective for learning [67].

DQN通过使用两种技术:经验重放[45]和目标网络,解决了RL中使用函数拟合的基本不稳定问题[83]。在经验重放中内存将 ( s t , a t , s t + 1 , r t + 1 ) (s_t, a_t, s_{t+1},r_{t+1}) (st,at,st+1,rt+1)形式的转换存储在循环缓冲区中,使RL代理能够脱机采样和训练先前观察到的数据。这不仅大大减少了与环境的交互,而且可以对经验进行批量采样,减少学习更新的方差。此外,通过从大内存中均匀采样,可以打破对RL算法产生不利影响的时间相关性。最后,从实用的角度来看,现代硬件可以有效地并行处理多批数据,从而提高吞吐量。虽然原来的DQN算法使用的是均匀采样[47],但后来的研究表明,基于TD误差给样本排序对学习更有效[67]。

The second stabilizing method, introduced by Mnih et al. [47], is the use of a target network that initially contains the weights of the network enacting the policy but is kept frozen for a large period of time. Rather than having to calculate the TD error based on its own rapidly fluctuating estimates of the Q-values, the policy network uses the fixed target network. During training, the weights of the target network are updated to match the policy network after a fixed number of steps. Both experience replay and target networks have gone on to be used in subsequent DRL works [19], [44], [50], [93].

第二种稳定方法是由Mnih等人引入的[47],即使用一个目标网络,该网络最初包含制定策略的网络的权值,但被长期冻结。策略网络不必基于其自身对Q值的快速波动估计来计算TD误差,而是使用固定目标网络。在训练过程中,目标网络的权值经过一定步数的更新后与策略网络匹配。经验重放和目标网络都已经在后续的DRL工作中使用[19],[44],[50]、[93]。

4.2. Q函数的修改 Q-Function Modifications

Considering that one of the key components of the DQN is a function approximator for the Q-function, it can benefit from fundamental advances in RL. In [86], van Hasselt showed that the single estimator used in the Q-learning update rule overestimates the expected return due to the use of the maximum action value as an approximation of the maximum expected action value. Double-Q learning provides a better estimate through the use of a double estimator [86]. While double-Q learning requires an additional function to be learned, later work proposed using the already available target network from the DQN algorithm, resulting in significantly better results with only a small change in the update step [87].

考虑到DQN的一个关键组成部分是Q函数的函数近似器,它可以从RL的基本进展中受益。在[86]中,van Hasselt证明了Q学习中由于使用了最大作用值作为最大预期作用值的近似值,导致更新规则中使用的单一估计量高估了预期收益。双Q学习通过使用两个估计器提供了更好的估计值[86]。虽然双Q学习需要学习额外的函数,但后续的工作使用了DQN算法中已经可用的目标网络,在更新步骤中只做了很小的改变就使得结果明显更好[87]。

Yet another way to adjust the DQN architecture is to decompose the Q-function into meaningful functions, such as constructing Q π Q^\pi Qπ by adding together separate layers that compute the state-value function V π V^\pi Vπ and advantage function A π A^\pi Aπ [92]. Rather than having to come up with accurate Q-values for all actions, the duelling DQN [92] benefits from a single baseline for the state in the form of V π V^\pi Vπ and easier-to-learn relative values in the form of A π A^\pi Aπ. The combination of the duelling DQN with prioritized experience replay [67] is one of the state-of-the-art techniques in discrete action settings. Further insight into the properties of A π A^\pi Aπ by Gu et al. [19] led them to modify the DQN with a convex advantage layer that extended the algorithm to work over sets of continuous actions, creating the normalized advantage function (NAF) algorithm. Benefiting from experience replay, target networks, and advantage updates, NAF is one of several state-of-the-art techniques in continuous control problems [19].

另一种调整DQN架构的方法是将Q函数分解为有意义的函数,例如通过将计算状态-价值函数 V π V^\pi Vπ 和优势函数 A π A^\pi Aπ 的不同层相加来构造 Q π Q^\pi Qπ [92]。DQN不需要为所有行动提出精确的Q值[92],而是从以 V π V^\pi Vπ 形式的单一基准状态和以 A π A^\pi Aπ 形式的更容易学习的相对值中受益。对抗DQN与优先经验重放的结合是离散动作设置中最先进的技术之一[67]。Gu等人对 A π A^\pi Aπ属性的进一步深入研究促使他们使用凸优势层修改DQN[19],该层将算法扩展到连续动作集合上,创建了标准化优势函数(NAF)算法。得益于经验重放、目标网络和优势更新,NAF是[19]连续控制问题中的几个最先进的技术之一。

5. 策略搜索 Policy Search

Policy search methods aim to directly find policies by means of gradient-free or gradient-based methods. Prior to the current surge of interest in DRL, several successful methods in DRL eschewed the commonly used back propagation algorithm in favor of evolutionary algorithms [17], [33], which are gradient-free policy search algorithms. Evolutionary methods rely on evaluating the performance of a population of agents. Hence, they are expensive for large populations or agents with many parameters. However, as black-box optimization methods, they can be used to optimize arbitrary, non-differentiable models and naturally allow for more exploration in the parameter space. In combination with a compressed representation of neural network weights, evolutionary algorithms can even be used to train large networks; such a technique resulted in the first deep neural network to learn an RL task, straight from high-dimensional visual inputs [33]. Recent work has reignited interest in evolutionary methods for RL as they can potentially be distributed at larger scales than techniques that rely on gradients [65].

策略搜索方法的目的是通过无梯度或基于梯度的方法直接找到策略。在DRL的研究成为热点之前,DRL中有几种成功的方法避开了常用的反向传播算法,而选择了进化算法[17]、[33],这是一种无梯度策略搜索算法。进化方法依赖于评估Agent集群的表现。因此,它们对于具有许多参数的大集群或Agent来说是不合适的。然而,作为黑箱优化方法,它们可以用于优化任意的、不可微的模型,也就自然允许在参数空间中进行更多的探索。与神经网络权重的压缩表示相结合后进化算法甚至可以用于训练大型网络;这样的技术导致了第一个深度神经网络学习RL任务,需要直接输入高维视觉数据[33]。最近的研究重新点燃了人们对RL进化方法的兴趣,因为相比与依赖梯度的技术,它们也许能分布在更大的尺度上[65]。

5.1. 通过随机函数的反向传播 Backpropagation Through Stochastic Functions

The workhorse of DRL, however, remains backpropagation. The previously discussed REINFORCE rule [97] allows neural networks to learn stochastic policies in a task-dependent manner, such as deciding where to look in an image to track [69] or caption [99] objects. In these cases, the stochastic variable would determine the coordinates of a small crop of the image and hence reduce the amount of computation needed. This usage of RL to make discrete, stochastic decisions over inputs is known in the deep-learning literature as hard attention and is one of the more compelling uses of basic policysearch methods in recent years, having many applications outside of traditional RL domains.

然而,反向传播依然在DRL中是主流。前面讨论的强化规则允许神经网络以任务相关的方式学习随机策略[97],例如决定在图像中寻找跟踪[69]或说明[99]对象。在这些情况下,随机变量将决定一小部分图像的坐标,从而减少所需的计算量。在深度学习文献中,这种使用RL对输入做出离散的、随机的决策被称为硬注意,是近年来最引人注目的基本策略搜索方法之一,在传统的RL领域之外有许多应用。

5.2. 复合误差 Compounding Errors

Searching directly for a policy represented by a neural network with very many parameters can be difficult and suffer from severe local minima. One way around this is to use guided policy search (GPS), which takes a few sequences of actions from another controller (which could be constructed using a separate method, such as optimal control). GPS learns from them by using supervised learning in combination with importance sampling, which corrects for off-policy samples [40]. This approach effectively biases the search toward a good (local) optimum. GPS works in a loop, by optimizing policies to match sampled trajectories and optimizing trajectory distributions to match the policy and minimize costs. Levine et al. [41] showed that it was possible to train visuo-motor policies for a robot “end to end,” straight from the RGB pixels of the camera to motor torques, and, hence, provide one of the seminal works in DRL.

直接搜索由一个参数非常多的神经网络所代表的策略可能是很困难的,并且会遇到严重的局部极小值。解决这一问题的一种方法是使用引导策略搜索(GPS),它从另一个控制器(可以使用单独的方法,如最优控制)中获取一些序列的动作。GPS通过将监督学习与重要性采样相结合来学习它们,从而修正了离策略样本[40]。这种方法有效地使搜索趋向于良好的(局部的)最优。GPS的工作方式是循环的,通过优化策略来匹配采样轨迹,并优化轨迹分布来匹配策略并使代价最小化。Levine等人的研究[41]表明,训练机器人的“端到端”视觉运动策略是可能的,直接从相机的RGB像素到运动力矩,因此,这在DRL领域提出了一项开创性工作。

A more commonly used method is to use a trust region, in which optimization steps are restricted to lie within a region where the approximation of the true cost function still holds. By preventing updated policies from deviating too wildly from previous policies, the chance of a catastrophically bad update is lessened, and many algorithms that use trust regions guarantee or practically result in monotonic improvement in policy performance. The idea of constraining each policy gradient update, as measured by the Kullback-Leibler (KL) divergence between the current and proposed policy, has a long history in RL [28]. One of the newer algorithms in this line of work, TRPO, has been shown to be relatively robust and applicable to domains with high-dimensional inputs [70]. To achieve this, TRPO optimizes a surrogate objective function-specifically, it optimizes an (importance sampled) advantage estimate, constrained using a quadratic approximation of the KL divergence. While TRPO can be used as a pure policy gradient method with a simple baseline, later work by Schulman et al. [71] introduced generalized advantage estimation (GAE), which proposed several, more advanced variance reduction baselines. The combination of TRPO and GAE remains one of the state-of-the-art RL techniques in continuous control.

一种更常用的方法是使用信赖域,在信赖域中,优化步骤被限制在一个区域内,在这个区域内,真实代价函数的近似值仍然成立。通过防止更新后的策略与之前的策略过度偏离,减少了灾难性错误得到更新的机会,并且许多使用信赖域的算法保证或实际上导致了策略性能的单调改进。约束每个策略梯度更新的思想在RL中有很长的历史[28],这是由当前策略和建议策略之间的Kullback-Leibler(KL)散度来衡量的。在这一行中,有一种较新的算法,置信域策略梯度(TRPO),为了实现这一点,TRPO优化了一个替代目标函数,具体来说,它优化了(重要性抽样)优势估计,使用KL散度的二次近似进行约束。虽然TRPO可以作为一种带有简单基线的纯策略梯度方法,但Schulman等人[71]后来的工作引入了广义优势估计(generalized advantage estimation, GAE),该方法提出了几个更先进的方差减少基线。TRPO和GAE的结合仍然是连续控制中最先进的RL技术之一。

5.3. 演员-评论家算法 Actor-Critic Methods

Actor-critic approaches have grown in popularity as an effective means of combining the benefits of policy search methods with learned value functions, which are able to learn from full returns and/or TD errors. They can benefit from improvements in both policy gradient methods, such as GAE [71], and value function methods, such as target networks [47]. In the last few years, DRL actor-critic methods have been scaled up from learning simulated physics tasks [22], [44] to real robotic visual navigation tasks [100], directly from image pixels.

作为一种将策略搜索方法的好处与学习到的价值函数相结合的有效手段,Actor-Critic方法越来越受欢迎。学习到的价值函数能够从全反馈或TD误差中学习。他们可以从策略梯度方法(如GAE[71])和价值函数方法(如目标网络[47])的改进中获益。在过去的几年中,DRL的演员-评论家方法已经从学习模拟物理任务[22],[44]扩展到了直接从图像像素进行的真实机器人视觉导航任务[100]。

One recent development in the context of actor-critic algorithms is deterministic policy gradients (DPGs) [72], which extend the standard policy gradient theorems for stochastic policies [97] to deterministic policies. One of the major advantages of DPGs is that, while stochastic policy gradients integrate over both state and action spaces, DPGs only integrate over the state space, requiring fewer samples in problems with large action spaces. In the initial work on DPGs, Silver et al. [72] introduced and demonstrated an off-policy actor-critic algorithm that vastly improved upon a stochastic policy gradient equivalent in high-dimensional continuous control problems. Later work introduced deep DPG, which utilized neural networks to operate on high-dimensional, visual state spaces [44]. In the same vein as DPGs, Heess et al. [22] devised a method for calculating gradients to optimize stochastic policies by “reparameterizing” [30], [60] the stochasticity away from the network, thereby allowing standard gradients to be used (instead of the high-variance REINFORCE estimator [97]). The resulting stochastic value gradient (SVG) methods are flexible and can be used both with (SVG(0) and SVG(1)) and without (SVG( ∞ \infty ) value function critics, and with (SVG ( ∞ \infty ) and SVG(1)) and without (SVG(0)) models. Later work proceeded to integrate DPGs and SVGs with RNNs, allowing them to solve continuous control problems in POMDPs, learning directly from pixels [21]. Together, DPGs and SVGs can be considered algorithmic approaches for improving learning efficiency in DRL.

演员-评论家算法的一个最新成果是确定性策略梯度(DPG)[72],它将随机策略的标准策略梯度定理[97]扩展到确定性策略。DPG的一个主要优点是,当随机策略梯度在状态空间和动作空间上集成时,只会在状态空间上集成,在动作空间较大的问题中只需要较少的样本。在关于DPG的初步工作中,Silver等人[72]介绍并演示了一种离策略的演员-评论家算法,该算法大大改进了高维连续控制问题中的随机策略梯度等价算法。后来的研究引入了深度DPG(DDPG),它利用神经网络在高维可视状态空间上进行操作[44]。与DPG相同,Heess等人[22]设计了一种计算梯度的方法,通过“重新参数化”[30],[60],减少网络的随机性来优化随机策略,从而允许使用标准梯度(而不是高方差强化估计器[97])。产生的随机值梯度(SVG)方法是灵活的,既可以用(SVG(0)和SVG(1)),也可以不使用(SVG( ∞ \infty )价值函数评论家,也可以使用(SVG( ∞ \infty )和SVG(1))模型,也可以不使用(SVG(0))模型。随后的工作继续将DPG和SVG与RNN集成,使它们能够解决POMDP中的连续控制问题,直接从像素进行学习[21]。DPG和SVG可以作为提高DRL学习效率的方法。

An orthogonal approach to speeding up learning is to exploit parallel computation. By keeping a canonical set of parameters that are read by and updated in an asynchronous fashion by multiple copies of a single network, computation can be efficiently distributed over both processing cores in a single central processing unit (CPU), and across CPUs in a cluster of machines. Using a distributed system, Nair et al. [51] developed a framework for training multiple DQNs in parallel, achieving both better performance and a reduction in training time. However, the simpler asynchronous advantage actor-critic (A3C) algorithm [48], developed for both single and distributed machine settings, has become one of the most popular DRL techniques in recent times. A3C combines advantage updates with the actor-critic formulation and relies on asynchronously updated policy and value function networks trained in parallel over several processing threads. The use of multiple agents, situated in their own, independent environments, not only stabilizes improvements in the parameters, but conveys an additional benefit in allowing for more exploration to occur. A3C has been used as a standard starting point in many subsequent works, including the work of Zhu et al. [100], who applied it to robotic navigation in the real world through visual inputs.

加速学习的一种正交方法是利用并行计算。通过维持一个网络的多个副本以异步的方式读取和更新的规范参数集,计算可以有效地分布在单个中央处理单元(CPU)的两个处理核心上,和跨机器集群中的CPU。使用分布式系统,Nair等人[51]开发了一个并行训练多个DQN的框架,既得到了更好的性能,又减少了训练时间。然而,针对单机和分布式环境开发的更简单的异步优势Actor-Critical算法(A3C)[48],已经成为最近最流行的DRL技术之一。A3C将优势更新与Actor-Critical算法相结合,并依赖于在多个处理线程上并行训练的异步更新策略和价值函数网络。在独立的环境中使用多个Agent,不仅稳定了参数的改进,而且带来了允许更多的勘探发生的额外好处。A3C在之后的许多著作中都被用作标准起点,包括Zhu等人的著作[100],他们通过视觉输入将其应用于现实世界中的机器人导航。

There have been several major advancements on the original A3C algorithm that reflect various motivations in the field of DRL. The first is actor-critic with experience replay [93], which adds off-policy bias correction to A3C, allowing it to use experience replay to improve sample complexity. Others have attempted to bridge the gap between value and policy-based RL, utilizing theoretical advancements to improve upon the original A3C [50], [54]. Finally, there is a growing trend toward exploiting auxiliary tasks to improve the representations learned by DRL agents and, hence, improve both the learning speed and final performance of these agents [26], [46].

在最初的A3C算法上有几个主要的进展,反映了DRL领域的各种动机。第一种是经验重放的演员-评论家算法[93],它在A3C中应用了离策略偏差校正,允许它使用经验重放来提高样本的复杂性。其他方法则试图弥合价值和基于策略的RL之间的差距,利用理论的进展来改进最初的A3C[50],[54]。最后,利用辅助任务来改善DRL中Agent的表征学习,从而提高这些Agent的学习速度和最终性能的趋势日益明显。

6. 当前的研究和挑战 Current Research and Challenges

To conclude, we will highlight some current areas of research in DRL and the challenges that still remain. Previously, we have focused mainly on model-free methods, but we will now examine a few model-based DRL algorithms in more detail. Model-based RL algorithms play an important role in making RL data efficient and in trading off exploration and exploitation. After tackling exploration strategies, we shall then address hierarchical RL (HRL), which imposes an inductive bias on the final policy by explicitly factorizing it into several levels. When available, trajectories from other controllers can be used to bootstrap the learning process, leading us to imitation learning and inverse RL (IRL). For the final topic, we will look at multiagent systems, which have their own special considerations.

最后,我们将强调目前DRL的一些研究领域和仍然存在的挑战。以前,我们主要关注无模型方法,但是现在我们将更详细地研究一些基于模型的DRL算法。基于模型的RL算法在提高RL数据的效率和平衡挖掘和利用方面发挥着重要作用。在解决探索策略之后,我们将解决分层的RL (HRL),它通过明确地将最终策略分解为几个层次来施加归纳偏置。当其他控制器的轨迹可用时,可用于引导学习过程,引导我们进行模仿学习和逆RL(IRL)。对于最后一个主题,我们将看看多Agent系统,它们有自己的特殊考虑事项。

6.1. 基于模型的强化学习 Model-Based RL

The key idea behind model-based RL is to learn a transition model that allows for simulation of the environment without interacting with the environment directly. Model-based RL does not assume specific prior knowledge. However, in practice, we can incorporate prior knowledge (e.g., physics-based models [29]) to speed up learning. Model learning plays an important role in reducing the number of required interactions with the (real) environment, which may be limited in practice. For example, it is unrealistic to perform millions of experiments with a robot in a reasonable amount of time and without significant hardware wear and tear. There are various approaches to learn predictive models of dynamical systems using pixel information. Based on the deep dynamical model [90], where high-dimensional observations are embedded into a lower-dimensional space using autoencoders, several model-based DRL algorithms have been proposed for learning models and policies from pixel information [55], [91], [95]. If a sufficiently accurate model of the environment can be learned, then even simple controllers can be used to control a robot directly from camera images [14]. Learned models can also be used to guide exploration purely based on simulation of the environment, with deep models allowing these techniques to be scaled up to high-dimensional visual domains [75].

基于模型的RL背后的关键思想是学习一个转移模型,该模型允许在不直接与环境交互的情况下模拟环境。基于模型的RL并不假设特定的先验知识。然而,在实践中,我们可以结合先验知识(如基于物理的模型[29])来加速学习。模型学习在减少与(真实)环境的交互中扮演着重要的角色,这在实际效果可能不尽如人意。例如,在合理的时间内用一个机器人进行数百万次实验而没有明显的硬件磨损是不现实的。有多种方法可以使用像素信息学习动态系统的预测模型。基于深度动力学模型[90],高维观测数据通过自编码器嵌入到低维空间,提出了几种基于模型的DRL算法,用于从像素信息中学习模型和策略[55],[91],[95]。如果能够学习到足够准确的环境模型,那么即使是简单的控制器也可以直接从摄像头图像控制机器人[14]。学习到的模型也可以用于指导纯粹基于环境模拟的探索过程,深度模型允许这些技术扩展到高维视觉领域[75]。

Although deep neural networks can make reasonable predictions in simulated environments over hundreds of time steps [10], they typically require many samples to tune the large number of parameters they contain. Training these models often requires more samples (interaction with the environment) than simpler models. For this reason, Gu et al. [19] train locally linear models for use with the NAF algorithm—the continuous equivalent of the DQN [47]—to improve the algorithm’s sample complexity in the robotic domain where samples are expensive. It seems likely that the usage of deep models in model-based DRL could be massively spurred by general advances in improving the data efficiency of neural networks.

虽然深度神经网络可以在数百个时间步骤的模拟环境中做出合理的预测[10],但它们通常需要很多样本来调整它们包含的大量参数。训练这些模型通常需要比简单模型更多的样本(通过与环境交互得到)。由于这个原因,Gu等人[19]用NAF算法(相当于连续的DQN[47])训练局部线性模型,以在样本昂贵的机器人领域中,提高算法的样本复杂度。似乎在基于模型的DRL中使用深层模型可能是由于在提高神经网络的数据效率方面取得了的普遍进展。

6.2. 探索和利用 Exploration Versus Exploitation

One of the greatest difficulties in RL is the fundamental dilemma of exploration versus exploitation: When should the agent try out (perceived) nonoptimal actions to explore the environment (and potentially improve the model), and when should it exploit the optimal action to make useful progress? Off-policy algorithms, such as the DQN [47], typically use the simple ϵ \epsilon ϵ-greedy exploration policy, which chooses a random action with probability ϵ ∈ [ 0 , 1 ] \epsilon\in[0,1] ϵ[0,1], and the optimal action otherwise. By decreasing ϵ \epsilon ϵ over time, the agent progresses toward exploitation. Although adding independent noise for exploration is usable in continuous control problems, more sophisticated strategies inject noise that is correlated over time (e.g., from stochastic processes) to better preserve momentum [44].

RL中最大的困难之一是探索与开发的基本困境:Agent何时应该尝试(已经感知到的)非最优行动来探索环境(并潜在地改进模型),何时应该利用最优行动来取得有用的进展?离策略算法,如DQN[47],通常使用简单的 ϵ \epsilon ϵ贪心勘探策略,选择概率为 ϵ ∈ [ 0 , 1 ] \epsilon\in[0,1] ϵ[0,1]的随机动作,否则选择最佳动作。随着时间的推移逐渐减少 ϵ \epsilon ϵ,Agent将从探索演变为利用。虽然在连续控制问题中添加额外的探索噪声是有用的,但更复杂的策略会注入随时间相关的噪声(例如来自随机过程的噪声),以更好地保持动量[44]。

The observation that temporal correlation is important led Osband et al. [56] to propose the bootstrapped DQN, which maintains several Q-value “heads” that learn different values through a combination of different weight initializations and bootstrapped sampling from experience replay memory. At the beginning of each training episode, a different head is chosen, leading to temporally extended exploration. Usunier et al. [85] later proposed a similar method that performed exploration in policy space by adding noise to a single output head, using zero-order gradient estimates to allow back propagation through the policy.

观测结果时间相关性很重要,这促使Osband等人[56]提出了自举DQN,它维持了几个Q值“head”,通过组合不同的权重初始化和从经验记忆中的自举采样来学习不同的值。在每个训练集的开始,选择不同的head,临时扩展探索。Usunier等人[85]后来提出了一种类似的方法,通过向单个输出head添加噪声,在策略空间中执行探索,使用零阶梯度估计来使策略完成反向传播。

One of the main principled exploration strategies is the upper confidence bound (UCB) algorithm, based on the principle of “optimism in the face of uncertainty” [36]. The idea behind UCB is to pick actions that maximize E [ R ] + k σ [ R ] \mathbb{E}[R]+k\sigma[R] E[R]+kσ[R], where σ [ R ] \sigma[R] σ[R] is the standard deviation of the return and k > 0 k>0 k>0. UCB therefore encourages exploration in regions with high uncertainty and moderate expected return. While easily achievable in small tabular cases, the use of powerful density models has allowed this algorithm to scale to high-dimensional visual domains with DRL [4].

主要的探索策略原则之一是基于“面对不确定性的乐观主义”原则的置信区间上界(upper confidence bound, UCB)算法[36]。UCB背后的思想是选择最大化 E [ R ] + k σ [ R ] \mathbb{E}[R]+k\sigma[R] E[R]+kσ[R] 的操作,其中 σ [ R ] \sigma[R] σ[R] 是返回的标准差, k > 0 k>0 k>0。因此,UCB鼓励在不确定性高、预期收益适中的情况下进行勘探。虽然很容易在小表格的情况下实现,但使用强大的密度模型能使该算法能够扩展到具有DRL的高维视觉领域。

UCB can also be considered one way of implementing intrinsic motivation, which is a general concept that advocates decreasing uncertainty/making progress in learning about the environment [68]. There have been several DRL algorithms that try to implement intrinsic motivation via minimizing model prediction error [57], [75] or maximizing information gain [25], [49].

UCB也可以被认为是实现内在动机的一种方式,内在动机是提倡减少不确定性,在学习环境方面取得进展的一般概念[68]。有几种DRL算法试图通过最小化模型预测误差[57],[75]或最大化信息增益[25],[49]来实现内在动机。

6.3. 层次化强化学习 Hierarchical RL

In the same way that deep learning relies on hierarchies of features’ HRL relies on hierarchies of policies. Early work in this area introduced options, in which, apart from primitive actions (single time-step actions), policies could also run other policies (multitime-step “actions”) [79]. This approach allows top-level policies to focus on higher-level goals, while subpolicies are responsible for fine control. Several works in DRL have attempted HRL by using one top-level policy that chooses between subpolicies, where the division of states or goals in to subpolicies is achieved either manually [1], [34], [82] or automatically [2], [88], [89]. One way to help construct subpolicies is to focus on discovering and reaching goals, which are specific states in the environment; they may often be locations, to which an agent should navigate. Whether utilized with HRL or not, the discovery and generalization of goals is also an important area of ongoing research [35], [66], [89].

正如深度学习依赖于特征的层次结构一样,HRL依赖于策略的层次结构。这一领域的早期工作引入了一些选项,除了原始动作(单时间步动作),策略还可以运行其他策略(多时间步“动作”)[79]。这种方法允许顶级策略关注更高级的目标,而子策略负责精细的控制。DRL中的一些文献尝试了HRL,使用顶级策略来选择子策略,在子策略中状态或目标的划分可以手动实现[1],[34],[82]或自动实现[2],[88],[89]。一种用于构建子策略的方法是专注于发现和实现目标,即环境中的特定状态,它们通常是Agent的目标状态。无论是否与HRL一起使用,目标的发现和概括也是正在进行的研究的一个重要方面[35],[66],[89]。

6.4. 模仿学习与逆强化学习 Imitation Learning and Inverse RL

One may ask why, if given a sequence of “optimal” actions from expert demonstrations, it is not possible to use supervised learning in a straightforward manner—a case of “learning from demonstration.” This is indeed possible and is known as behavioral cloning in traditional RL literature. Taking advantage of the stronger signals available in supervised learning problems, behavioral cloning enjoyed success in earlier neural network research, with the most notable success being ALVINN, one of the earliest autonomous cars [59]. However, behavioral cloning cannot adapt to new situations, and small deviations from the demonstration during the execution of the learned policy can compound and lead to scenarios where the policy is unable to recover. A more generalizable solution is to use provided trajectories to guide the learning of suitable state-action pairs but fine-tune the agent using RL [23].

有人可能会问,如果从专家演示中给出一个“最佳”行动序列,为什么不可能以一种直接的方式使用监督学习——一个“从演示中学习”的例子。这确实是可能的,在传统的RL文献中被称为行为克隆。利用监督学习问题中更强的可用信号,行为克隆在早期的神经网络研究中取得了成功,其中最显著的成功是ALVINN,即最早的自动驾驶汽车之一[59]。然而,行为克隆不能适应变化,并且在执行学习到的策略时,如果演示稍有偏差就会发生策略无法恢复的情况。一个更一般化的解决方案是使用所提供的轨迹来指导Agent学习合适的状态-动作对,并使用RL微调Agent[23]。

The goal of IRL is to estimate an unknown reward function from observed trajectories that characterize a desired solution [52]; IRL can be used in combination with RL to improve upon demonstrated behavior. Using the power of deep neural networks, it is now possible to learn complex, nonlinear reward functions for IRL [98]. Ho and Ermon [24] showed that policies are uniquely characterized by their occupancies (visited state and action distributions) allowing IRL to be reduced to the problem of measure matching. With this insight, they were able to use generative adversarial training [18] to facilitate reward-function learning in a more flexible manner, resulting in the generative adversarial imitation learning algorithm.

IRL的目标是根据期望解的观测轨迹来估计未知的奖励函数[52];IRL可以与RL结合使用来改进已演示的行为。借用深度神经网络的能力,现在IRL可以学习更复杂的、非线性的奖励函数[98]。Ho和Ermon[24]研究表明,策略具有独占性(访问状态和动作分布),从而使IRL简化为度量匹配问题。认识到了这点,他们就能够使用生成式对抗性训练以更灵活的方式促进奖励函数的学习[18],从而提出了生成式对抗性模仿学习算法。

6.5. 多Agent强化学习 Multiagent RL

Usually, RL considers a single learning agent in a stationary environment. In contrast, multiagent RL (MARL) considers multiple agents learning through RL and often the nonstationarity introduced by other agents changing their behaviors as they learn [8]. In DRL, the focus has been on enabling (differentiable) communication between agents, which allows them to cooperate. Several approaches have been proposed for this purpose, including passing messages to agents sequentially [15], using a bidirectional channel (providing ordering with less signal loss) [58], and an all-to-all channel [77]. The addition of communication channels is a natural strategy to apply to MARL in complex scenarios and does not preclude the usual practice of modeling cooperative or competing agents as applied elsewhere in the MARL literature [8].

通常,RL在静态环境中只有一个学习主体。与此相反,多Agent强化学习(multiagent RL, MARL)在RL中通过多个Agent进行学习,以及其他Agent在学习过程中改变其行为所引入的非平稳性[8]。在DRL中,重点是实现Agent之间的(可区分的)通信,这使得它们可以协同工作。为此,已经提出了几种方法,包括按顺序将消息传递给Agent[15]、使用半双工(以较少的信号损失提供排序)[58]和全双工[77]。添加信道是在复杂场景中应用于MARL的一种自然策略,并且不排除对合作或竞争Agent建模的通常做法,正如在MARL文献中应用的其他地方一样[8]。

7. 结论:超越模式识别 Conclusion: Beyond Pattern Recognition

Despite the successes of DRL, many problems need to be addressed before these techniques can be applied to a wide range of complex real-world problems [37]. Recent work with (nondeep) generative causal models demonstrated superior generalization over standard DRL algorithms [48], [63] in some benchmarks [5], achieved by reasoning about causes and effects in the environment [29]. For example, the schema networks of Kanksy et al. [29] trained on the game Breakout immediately adapted to a variant where a small wall was placed in front of the target blocks, while progressive (A3C) networks [63] failed to match the performance of the schema networks even after training on the new domain. Although DRL has already been combined with AI techniques, such as search [73] and planning [80], a deeper integration with other traditional AI approaches promises benefits such as better sample complexity, generalization, and interpretability [16]. In time, we also hope that our theoretical understanding of the properties of neural networks (particularly within DRL) will improve, as it currently lags far behind practice.

尽管DRL取得了成功,但在将这些技术应用于宽泛而复杂的现实问题之前,还需要解决许多问题[37]。最近使用(非深度)生成因果模型的工作表明,与标准DRL算法相比[48],在一些基准中[5],通过在环境中对原因和影响进行推理[29],可以实现更强的泛化。例如,Kanksy等人提出的模式网络[29]。模式网络在游戏中训练突围立即适应了一种变体,即在目标块前面放置一堵小墙,而渐进式网络(A3C)[63]即使在新的游戏中训练后也无法等价模式网络的性能。尽管DRL已经与人工智能技术相结合,如搜索[73]和规划[80],但与其他传统人工智能方法的深入集成有望带来诸如更好的样本复杂性、泛化和可解释性等好处[16]。随着时间的推移,我们也希望我们对神经网络特性的理论理解(特别是在DRL中)会得到改善,因为它目前远远落后于实践。

To conclude, it is worth revisiting the overarching goal of all of this research: the creation of general-purpose AI systems that can interact with and learn from the world around them. Interaction with the environment is simultaneously the advantage and disadvantage of RL. While there are many challenges in seeking to understand our complex and ever-changing world, RL allows us to choose how we explore it. In effect, RL endows agents with the ability to perform experiments to better understand their surroundings, enabling them to learn even high-level causal relationships. The availability of high-quality visual renderers and physics engines now enables us to take steps in this direction, with works that try to learn intuitive models of physics in visual environments [13]. Challenges remain before this will be possible in the real world, but steady progress is being made in agents that learn the fundamental principles of the world through observation and action. Perhaps, then, we are not too far away from AI systems that learn and act in more human-like ways in increasingly complex environments.

总之,我们有必要重新审视所有这些研究的总体目标:创造能够与周围世界互动并向其学习的通用AI系统。与环境的相互作用既是RL的优点,也是它的缺点。虽然在寻求理解我们复杂多变的世界中有许多挑战,RL让我们选择如何探索它。实际上,RL赋予了Agent进行实践的能力,以便更好地了解它们所处的环境,使学习甚至是高级的因果关系。高质量的视觉渲染器和物理引擎的可用性使我们现在能够在这个方向上采取步骤,即尝试学习直观的物理模型的视觉环境[13]。在这一目标前面仍然存在挑战,但通过观察和行动学习世界基本原则的Agent正在取得稳步进展。那么,或许我们离人工智能系统不远了,这些系统可以在日益复杂的环境中以更接近人类的方式进行学习和行动。

8. 致谢 ACKNOWLEDGMENTS

Kai Arulkumaran would like to acknowledge Ph.D. funding from the Department of Bioengineering at Imperial College London. This research has been partially funded by a Google Faculty Research Award to Marc Deisenroth.

Kai Arulkumaran感谢伦敦帝国理工学院生物工程系的博士资助。这项研究的部分资金来源于谷歌授予马克·戴森罗斯(marcdeisenroth)的研究奖

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 撸撸猫 设计师:C马雯娟 返回首页