强化学习思考(2)强化学习简介
Published:
关于强化学习简介的注意事项。
目录
- 强化学习思考(1)前言
- 强化学习思考(2)强化学习简介
- 强化学习思考(3)马尔可夫决策过程
- 强化学习思考(4)模仿学习和监督学习
- 强化学习思考(5)动态规划
- 强化学习思考(6)蒙特卡罗和时序差分
- 强化学习思考(7)策略梯度
- 强化学习思考(8)Actor-Critic 方法
- 强化学习思考(9)值函数方法
- 强化学习思考(10)Deep Q Network
- 强化学习思考(11)Advanced Policy Gradient
强化学习分类
Model Free / Model Based
Model Free:environment 的 model 对于 agent 来说不是已知的。
Model Based:environment 的 model 对于 agent 来说是已知的。
Prediction / Control
Prediction: evaluate the future with a given policy. (evaluation)
Control: Find the best policy to optimise the future. (evaluation + improvement)
Learning / Planning
- Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
- Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
Value Based / Policy Based / Actor-Critic
- Value Based
- Learnt Value Function
- Implicit policy (ϵ-greedy)
- Policy Based
- Learnt Policy
- Actor-Critic
- Learnt Value Function
- Learnt Policy
On-Policy / Off-Policy
- On policy:
- On-Policy 学习的 agent 以及和环境进行互动的 agent 是相同的 agent。
- each time the policy is changed, even a little bit, we need to generate new samples
- “Learn on the job”
- Learn about policy π from experience sampled from π
- Off policy:
- Off-Policy 学习的 agent 以及和环境进行互动的 agent 是不同的 agent。
- able to improve the policy without generating new samples from that policy
- “Look over someone’s shoulder”
- Learn about policy π from experience sampled from μ
Exploration / Exploitation
- Exploration finds more information about the environment
- Epsilon Greedy
- Boltzmann Exploration
- Exploitation exploits known information to maximise reward
强化学习框架
一般强化学习算法遵循这三步:
1、generate samples
2、fit a model / estimate the return
3、improve the policy
generate samples
- sample one step
- sample N step
- sample complete trajectory
fit a model / estimate the return
- Value-based: fit V(s) or Q(s,a)
- Policy gradients: evaluate returns Rτ=∑tr(st,at)
- Model-based: learn p(st+1∣st,at)
improve the policy
- Value-based: if we have policy π, and we know Qπ(s,a), then we can improve π
- set π′(a∣s)=1 if a=argmax
- this policy is at least as good as \pi and it doesn’t matter what \pi is
- Policy gradients: compute gradient to increase probability of good actions a
- modify \pi(a \mid s) to increase probability of a if Q^\pi(s,a)>V^\pi(s)
- if Q^\pi(s,a)>V^\pi(s), then a is better than average
常见算法
Policy gradients: directly differentiate the reward objective
Value-based: estimate value function or Q-function of the optimal policy (no explicit policy)
Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy
Model-based RL: estimate the transition model, and then
- Use the model to plan (no explicit policy)
- Trajectory optimization/optimal control (primarily in continuous spaces) – essentially backpropagation to optimize over actions
- Discrete planning in discrete action spaces – e.g., Monte Carlo tree search
- Use it to improve a policy
- Backpropagate gradients into the policy
- Requires some tricks to make it work
- Use the model to learn a value function
- Dynamic programming
- Generate simulated experience for model-free learner
- Use the model to plan (no explicit policy)
Comparison
Comparison: sample efficiency
- 注意 sample efficient 不等于 run time efficient
Comparison: stability and ease of use
SL and RL 算法比较
Supervised learning
- almost always gradient descent
Reinforcement learning
- often not gradient descent
- Q-learning: fixed point iteration
- Policy gradient: is gradient descent, but also often the least efficient!
- Model-based RL: model is not optimized for expected reward
RL 各算法比较
Value function fitting
- At best, minimizes error of fit (“Bellman error”)
- Not the same as expected reward
- At worst, doesn’t optimize anything
- Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case
Policy gradient
- The only one that actually performs gradient descent (ascent) on the true objective
Model-based RL
- Model minimizes error of fit
- This will converge
- No guarantee that better model = better policy
Comparison: assumptions
- Common assumption #1: full observability
- Generally assumed by value function fitting methods
- Can be mitigated by adding recurrence
- Common assumption #2: episodic learning
- Often assumed by pure policy gradient methods
- Assumed by some model-based RL methods
- Common assumption #3: continuity or smoothness
- Assumed by some continuous value function learning methods
- Often assumed by some model-based RL methods
参考资料及致谢
所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!