一、基本概念
- Agent:操作主体;
- State:agent 所处状态(位置、速度等);State Space:状态集合;
- Action:可采取的行动,依赖于 state: \(\mathcal{A}(s_i)\);State Transition:采取 action 后 state 的转变;
- Policy:agent 在特定 state 采取的 action:概率表达为 \(\pi(a_i | s_1)\);
- Reward:采取某个 action 后得到的奖励(可负); \(p(r=1 | s1, a1)\);
- Trajectory:一条 state-action-reward 链;Return:一个 Trajectory 沿途所有 Reward 之和;Discount Rate:计算 Discounted Return 时,reward 随步数的衰减因子 \(\gamma\);
- Episode:走到终止态 Terminal State 的一个 Trajectory;没有 terminal state 的任务称为 Continuing Tasks;通过将 target state 设为 absorbing state(吸收态,采取行动仍回到自身且 reward 为 0),或仍直接视为普通状态,我们可以统一 episodic 和 continuing tasks;
- Markov Decision Process(MDP):
- 具有 State(\(\mathcal{S}\))、Action(\(\mathcal{A}(s)\))和 Reward(\(\mathcal{R}(s, a)\));
- 具有 state transition probability \(p(s’|s, a)\)、reward probability \(p(r|s, a)\);
- 具有 Policy \(\pi(a|s)\);
- 无记忆性:memoryless property:\(p(s_{t+1} | a_{t+1}, s_t, \cdots, a_1, s_0) = p(s_{t+1} | a_{t+1}, s_t)\), \(p(r_{t+1} | a_{t+1}, s_t, \cdots, a_1, s_0) = p(r_{t+1} | a_{t+1}, s_t)\)
二、贝尔曼公式
- 对一个特定的 Trajectory,设 \(v_i\) 为处于 \(s_i\) 时,此后得到的总 Return,则有 \(v_i = r_i + \gamma v_{next}\),写成向量形式有 \(v = r + \gamma \mathrm{P} v\),其中 \(v\)、\(r\) 均为向量,\(\mathrm{P}\) 为转移矩阵;这就是 deterministic problem(即无随机任务)的 Bellman equation;
- 对于有概率的任务:
- \(S_t \xrightarrow{A_t} R_{t+1}, S_{t+1} \xrightarrow{A_{t+1}} R_{t+2}, S_{t+2} \xrightarrow{A_{t+2}} R_{t+3}, \dots\)
- \(G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = R_{t+1} + \gamma G_{t+1}\)
- State Value:当前 Policy 在当前 State 处之后所获 Return 的期望,可拆成 immediate reward 和 future reward 两部分:
- \(v_{\pi}(s) = \mathbb{E}[G_t|S_t = s] = \mathbb{E}[R_{t+1}|S_t = s] + \gamma\mathbb{E}[R_{t+1}|S_t = s] \\= \sum_a \pi(a|s) \sum_r p(r|s, a) r + \gamma\sum_{s'} v_\pi(s') \sum_a p(s'|s, a) \pi(a|s) \\= \sum_a \pi(a|s) \left[ \sum_r p(r|s, a) r + \gamma \sum_{s'} p(s'|s, a) v_\pi(s') \right], \quad \forall s \in \mathcal{S}.\)
- 联立所有状态的 Bellman 等式,最终得到 \(v_\pi = r_\pi + \gamma P_\pi v_\pi\),其中 \(v_\pi\) 和 \(r_\pi\) 均为长度为状态数 \(n\) 的列向量;
- Bellman 公式的两种解法:直接解方程(必然有解,但复杂度较高),以及迭代法 [1]: \(v_{k+1} = r_\pi + \gamma P_\pi v_k\),可以证明最终 \(v\) 收敛到解析解;
- Action Value: \(q_\pi(s, a) = \mathbb{E}[G_t|S_t = s, A_t = a]\),可以得到 \(v_\pi(s) = \sum_a \pi(a|s) q_\pi(s, a)\);
三、贝尔曼最优公式
- 若所有 \(s\),\(v_{\pi_1}(s) \geq v_{\pi_2}(s)\),则称 \(\pi_1\) 比 \(\pi_2\) 更好;
- Bellman Optimality Equation: \(v = \mathrm{max}_\pi (r_\pi + \gamma P_\pi v)\),就是求使得 Bellman Equation 的 \(v_\pi(s)\) 最大的 \(\pi\);
- 使用 Contraction Mapping Theorem 可以证明:每次都将 Policy 改为固定走 Action Value 最大的那个 Action [2],多轮迭代后一定指数收敛至最优解,且此解是存在且唯一的、deterministic 的;
- 对于给定 \(\pi\) \(\pi\)\(v_{k+1} = f(v_k) = max_\pi(r_\pi + \gamma P_\pi v_k)\) 得到 \(v^* = max_\pi(r_\pi = \gamma P_\pi v^*)\),再求 \(\pi^* = \mathrm{argmax}_\pi (r_\pi + \gamma P_\pi v^*)\),此时得到最优策略;
- 定理:将所有 r 作正线性变换(ar+b,a 为正数)之后,最优策略不变;
- 注意:\(\gamma\) 的衰减使得模型会求尽快到达 target state 的策略;
四、值迭代与策略迭代
- 值迭代:先得到初始 \(v_0\),求得当前策略下的 \(q_k(s, a)\),并以此得到 \(a^*_k(s) = \mathrm{argmax}_a q_k(a, s)\),然后做一次 [2] 得到 \(\pi_{k+1}\),再做一次 [1] 得到 \(v_{k+1}(s)\);
- 策略迭代:先随机得到 \(\pi_0\),再用 [1] 求当前 \(v_{\pi_k}\)(这一步包含多次迭代),再用 [2] 得到 \(\pi_{k+1}\),如此循环迭代;
- truncated(截断策略迭代):二者折中,每轮迭代数步 \(v\);
- Policy Evaluation: \(v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}\);
- Policy Improvement:\(\pi_{k+1} = \mathrm{argmax}_\pi (r_\pi + \gamma P_\pi v_{\pi_k})\);
五、蒙特卡洛方法
- Model-free 的、Offline 的 Policy Update 方法:需要等 episode 走完才能更新;
- MC-Basic:求 \(q_k(s, a)\) 时,用回第二节中的期望式,而非需要概率函数的形式;即使用蒙特卡洛随机走到终点(episode)做一次采样,最后统合采样结果得到期望;
- MC Exploring Starts:每个 episode 的后缀同样是一个 episode,一次采样可更新多个 (s, a) 对的值;从每个 (s, a) 都要出发一次;
- MC \(\varepsilon\)-Greedy:非最优行动也会有概率取到;一次 episode 可以非常长,不再需要每个 (s, a) 都出发;平衡 exploitation(充分利用)和 exploration(探索);
六、随机近似与随机梯度下降
- 增量式求平均数: \(w_{k+1} = w_k - \frac{1}{k} (w_k - x_k)\);
- Stochastic Approximation:随机迭代类算法;
- Robbins-Monro 算法: \(w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k)\);有严格的条件;
- 想求 \(\theta^*\) 使得 \(\mathbb{E}[f(\theta^*, X)] = 0\) 时,可使用 \(w_{k+1} = w_k + a_k f(w_k, X_k)\),其中 \(X\) 为随机变量, \(X_k\) 为第 k 个时间步的采样值;
- Stochastic Gradient Descent :随机梯度下降算法,RM 算法的一个特例;
- GD(需要知道函数):\(w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]\);
- BGD(Batch):\(\mathbb{E}[\nabla_w f(w_k, X)] \approx \frac{1}{n} \sum_{i=1}^{n} \nabla_w f(w_k, x_i)\);
- SGD:Batch 中 n = 1;MBGD:把采样分成多组,每次用一组(即 Batch 中 n = m);
七、时序差分
TD 算法:求解 Bellman 公式的 RM 算法
- 对于 \(s_t\):\(v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \left[ v_t(s_t) - \left[ r_{t+1} + \gamma v_t(s_{t+1}) \right] \right]\)
- 其它未走到状态,state value 估值不变,\(v_{t+1}(s) = v_{t}(s)\);
- 第二项是学习率乘上当前与期望修正项的差异;
- TD Target: \(\bar{v}_t \doteq r_{t+1} + \gamma v_t(s_{t+1})\),TD Error: \(\delta_t \doteq v_t(s_t) - \left[ r_{t+1} + \gamma v_t(s_{t+1}) \right]\);
- 该算法只用于估计给定 policy 的 state value,即不依赖模型地计算 Bellman equation;
- Model-free 的 Bellman equation: \(v_\pi(s) = \mathbb{E} \left[ R + \gamma v_\pi(S') \mid S = s \right]\);
Sarsa(state-action-reward-state-action)算法:估计 action value 用于做 Policy Evaluation
- Sarsa:\(q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left[ r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \right] \right]\)
- Model-free 的 action value:\(q_\pi(s, a) = \mathbb{E} \left[ R + \gamma q_\pi(S', A') \mid s, a \right]\)
- Online 算法:每走一步都可以迭代更新;
- Expected Sarsa:将 Sarsa 最后一项改为 \(\gamma \mathbb{E}[q_t(s_{t+1}, A)]\),即 \(v_t(s_{t+1}) = \mathbb{E}[q_t(s_{t+1}, A)]\)
- 所对应的 Bellman 公式:\(q_\pi(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a \right]\)
- Sarsa:\(G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1})\)
- n-step Sarsa(Half-Online 算法):\(G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n})\)
- MC: \(G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots\)
- 据此,n-step Sarsa 的迭代公式为 \(q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left( r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n}) \right) \right]\)
Q-Learning 算法:直接估计最优 action value
- \(q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left( r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a) \right) \right]\)
- 求解的是 Bellman 最优方程:\(q(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a} q(S_{t+1}, a) \,\middle|\, S_t = s, A_t = a \right]\)
On-Policy 概念
- Behavior Policy:用于产生经验的 Policy;Target Policy:需要最优化的 Policy;
- 二者相同时为 On-Policy,不同时为 Off-Policy;
- Sarsa 和 MC 是 On-Policy,Q-Learning 可以是 Off-Policy;
总结:
$$
\[\begin{array}{l|l} \textbf{Algorithm} & \textbf{Expression of } \bar{q}_t \\ \hline \text{Sarsa} & \bar{q}_t = r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \\ \text{n-step Sarsa} & \bar{q}_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n}) \\ \text{Expected Sarsa} & \bar{q}_t = r_{t+1} + \gamma \sum_a \pi_t(a \mid s_{t+1}) q_t(s_{t+1}, a) \\ \text{Q-learning} & \bar{q}_t = r_{t+1} + \gamma \max_a q_t(s_{t+1}, a) \\ \text{Monte Carlo} & \bar{q}_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots \end{array}\]$$
八、值函数近似
- 值函数近似:用函数 \(\hat{v}(s, w)\) 拟合这些点:\(v_\pi(s_1), \ldots, v_\pi(s_{|\mathcal{S}|})\);
- 线性函数近似:\(\hat{v}(s, w) = \phi^T(s) w\),即类似 \(as^2 + bs + c\) 形式,\(w\) 即 \([a, b, c]^T\);
- Tabular 的算法是直接更新 \(q\) 表,值函数近似算法则是更新 \(w\),因此公式中 \(q\) 新增一个 \(w\) 参数即可(仍然都是 value update 阶段的算法);
- 需要最优化的评估函数:\(J(w) = \mathbb{E} \left[ \left( v_\pi(S) - \hat{v}(S, w) \right)^2 \right]\)
- 两种方法,Normal Distribution(各状态权重均等)与 Stationary Distribution(状态权重等于它在 Episode 中出现的比例);
- Stationary Distribution: \({d_\pi(s)}_{s \in S}\) 表示一个 episode 中状态 \(s\) 的占比,当 episode 无限长时趋向于一个定值,且理论值符合等式 \(d_\pi^T = d_\pi^T P_\pi\);于是 \(J(w) = \sum_{s \in \mathcal{S}} d_\pi(s) \left( v_\pi(s) - \hat{v}(s, w) \right)^2\);
- GD:\(w_{k+1} = w_k - \alpha_k \nabla_w J(w_k)\);为避免求期望,我们使用 SGD(一次只用一个样本):\(w_{t+1} = w_t + \alpha_t \left( v_\pi(s_t) - \hat{v}(s_t, w_t) \right) \nabla_w \hat{v}(s_t, w_t)\);
- 用 MC(替换为 \(g_t\)) 或 TD(替换为 \(r_{t+1} + \gamma \hat{v}(s_{t+1}, w_t)\))做迭代逼近求 \(v_\pi(s_t)\);
- DQN:用神经网络拟合 \(\hat{v}(s,
w)\)
- 回顾 Q-Learning 要最小化的目标函数: \(J(w) = \mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right]\);
- 现在,使用两个神经网络分别拟合两项 q:\(J = \mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w_T) - \hat{q}(S, A, w) \right)^2 \right]\);先固定 \(w_T\) 更新 \(w\),再反过来,如此反复;
- 于是,我们将 \(s\)、\(a\)(作为输入特征)、\(y_T = R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w_T)\)(作为 label)送进神经网络,训练 \(w\);迭代一定次数后 \(w_T = w\);
- Experience Replay:将样本打散均匀采样,使得一个 episode 中的采样之间不存在因果关系;
九、策略梯度方法
- 将 \(\pi\) 的写法从表格改为函数: \(\pi(a|s, \theta)\);梯度方法即 \(\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)\);
- 两个指标:
- average state value \(\bar{v}_\pi = \mathbb{E}[v_\pi(S)] = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \right] = \sum_{s \in \mathcal{S}} d(s) \, v_\pi(s) = d^T v_\pi\)
- average reward \(\bar{r}_\pi \doteq \sum_{s \in \mathcal{S}} d_\pi(s)\, r_\pi(s) = \mathbb{E}[r_\pi(S)] = \lim_{n \to \infty} \frac{1}{n} \, \mathbb{E} \left[ \sum_{k=1}^{n} R_{t+k} \right]\),其中 \(r_\pi(s) \doteq \sum_{a \in \mathcal{A}} \pi(a|s) \, r(s, a)\);
- 有性质:\(\bar{r}_\pi = (1 - \gamma) \, \bar{v}_\pi\);
- 目标函数的梯度计算:\(\nabla_\theta J(\theta) = \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta) \, q_\pi(s, a) = \mathbb{E} \left[ \nabla_\theta \ln \pi(A \mid S, \theta) \, q_\pi(S, A) \right]\),其中 \(J(\theta)\) 可以是 \(\bar{v}_\pi\)、 \(\bar{r}_\pi\) 或 \(\bar{v}_\pi^0\);
- 梯度上升:
- \(\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta) = \theta_t + \alpha \, \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) \, q_\pi(S, A) \right]\);
- 使用 Stochastic 避免期望,并用 MC 或 TD 方法近似 \(q_\pi\)即可得到: \(\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \, q_t(s_t, a_t)\);其中使用 MC 近似的方法就被成为 REINFORCE;
- 将上式变形并将括号部分算作 \(\beta_t\),会发现实际上就是在优化 \(\pi(a_t|s_t, \theta_t)\):\(\theta_{t+1} = \theta_t + \alpha \left( \frac{q_t(s_t, a_t)}{\pi(a_t \mid s_t, \theta_t)} \right) \nabla_\theta \pi(a_t|s_t, \theta_t)\);观察 \(\beta_t\)(步长)与 \(q_t\) 成正比,与 \(\pi\) 成反比,符合直觉;
十、Actor-Critic 方法
- Actor 代表 policy update(上一节中的 \(\theta_t\) 迭代式),Critic 代表 policy evaluation(MC 或 TD求 \(q_t\),其中用 TD 求的方法就是 QAC 方法);
- A2C 或 TD AC 算法:推导易证将上一节的梯度减去一项 \(b(S)\),对结果没有影响: \(\nabla_\theta J(\theta)= \mathbb{E}_{S \sim \eta,\, A \sim \pi} \left[ \nabla_\theta \log \pi(A|S, \theta_t) \left( q_\pi(S, A) - b(S) \right) \right]\);经过推导,为了方便,通常将 \(b(S)\) 选为 \(v_\pi(S)\),从而减小方差,使估值更准确;同样具有直觉理解:\(q_\pi(S, A) - v_\pi(S)\) 代表“优势函数”,我们更应在意策略的相对性而非绝对数值;
- 公式化为: \(\theta_{t+1} = \theta_t + \alpha \nabla_\theta \log \pi(a_t|s_t, \theta_t) \, \delta_t(s_t, a_t) = \theta_t + \alpha \left( \frac{\delta_t(s_t, a_t)}{\pi(a_t \mid s_t, \theta_t)} \right) \nabla_\theta \pi(a_t \mid s_t, \theta_t)\),其中 \(\delta\) 即优势函数;
- 通过重要性采样实现 off-policy: \(\mathbb{E}_{X \sim p_0}[X]= \sum_x p_0(x) x = \sum_x p_1(x) \frac{p_0(x)}{p_1(x)} x = \mathbb{E}_{X \sim p_1} \left[ f(X) \right]\),允许我们在 p1 分布下获取 p0 分布的期望;于是在原式上除以原分布,乘上现分布(behavior policy \(\beta(a|s)\))即可;
- 因为 \(\pi(a|s, 0) > 0\),因此之前的方法均为 stochastic 方法;Deterministic Policy Gradient(DPG):改为确定型 \(a = \mu(s, \theta) \dot{=} \mu(s)\);同样推导得到:\(\nabla_\theta J(\theta) = \sum_{s \in \mathcal{S}} \rho_\mu(s) \, \nabla_\theta \mu(s) \, \nabla_a q_\mu(s, a) \big|_{a = \mu(s)}\);该方法为 off-policy;
十一、PPO
- TRPO(Trust Region Policy Optimization):
- 同样使用重要性采样,保证可信域中新策略比久策略更优;
- 推导得到: \(J(\theta’) - J(\theta) = \mathbb{E}_{s_t \sim p\theta(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ \frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} \cdot \gamma^t \cdot A_{\pi_\theta}(s_t, a_t) \right] \right]\);
- 此时即求: \(\arg\max_{\theta'} \mathbb{E}_{s \sim \nu_\theta,\, a \sim \pi_\theta(\cdot|s)} \left[ \frac{\pi_{\theta'}(a|s)}{\pi_\theta(a | s)} \cdot A_{\pi_\theta}(s, a) \right] \quad \text{s.t.} \quad D_{\mathrm{KL}}(\pi_\theta(\cdot|s) \,\|\, \pi_{\theta'}(\cdot|s)) \leq \delta\);
- PPO(Proximal Policy Optimization):
- 四个模型:策略(Policy / Actor)、价值(Value / Critic)、奖励(Reward)、参考(Ref),其中只有前两个模型参与参数更新;
- 两个损失:策略、价值;
- 在 TRPO 基础上加一个 \(\beta\) 项做实时调整: \(\arg\max_{\theta'} \mathbb{E}_{s \sim \nu_\theta,\, a \sim \pi_\theta(\cdot|s)} \left[ \frac{\pi_{\theta'}(a | s)}{\pi_\theta(a | s)} \cdot \hat{A}_{\pi\theta}(s, a) -\beta \, D_{\mathrm{KL}}\left[ \pi_\theta(\cdot|s) \, \| \, \pi_{\theta'}(\cdot|s) \right] \right]\),其中 \(\left\{ \begin{aligned} \beta &\leftarrow \beta / 2 & \text{if } D_{\mathrm{KL}} < \delta / 1.5 \\ \beta &\leftarrow \beta \times 2 & \text{if } D_{\mathrm{KL}} > \delta \times 1.5 \end{aligned} \right.\)
- 或直接做截断:\(\arg\max_{\theta'} \mathbb{E}_{s \sim \nu\theta,\, a \sim \pi_\theta(\cdot|s)} \left[ \min \left( \frac{\pi_{\theta'}(a|s)}{\pi_\theta(a|s)} \hat{A}_{\pi\theta}(s,a), \ \text{clip}\left(\frac{\pi_{\theta'}(a|s)}{\pi_\theta(a|s)}, 1 - \epsilon, 1 + \epsilon\right) \hat{A}_{\pi\theta}(s,a) \right) \right]\)