当前位置: 首页 > news >正文

(zhuan) 一些RL的文献(及笔记)

一些RL的文献(及笔记)

copy from: https://zhuanlan.zhihu.com/p/25770890 

Introductions

Introduction to reinforcement learning
Index of /rowan/files/rl

ICML Tutorials:


NIPS Tutorials:
CS 294 Deep Reinforcement Learning, Spring 2017


Deep Q-Learning

DQN:
[1312.5602] Playing Atari with Deep Reinforcement Learning (and its nature version)

Double DQN
[1509.06461] Deep Reinforcement Learning with Double Q-learning

Bootstrapped DQN
[1602.04621] Deep Exploration via Bootstrapped DQN

Priority Experienced Replay


Duel DQN
[1511.06581] Dueling Network Architectures for Deep Reinforcement Learning

Classic Literature

SuttonBook

Book

David Silver's thesis


Policy Gradient Methods for Reinforcement Learning with Function Approximation

(Policy gradient theorem)

1. Policy-based approach is better than value based: policy function is smooth, while using value function to pick policy is not continuous.

2. Policy Gradient method.
Objective function is averaged on the stationary distribution (starting from s0).
For average reward, it needs to be truly stationary.
For state-action (with discount), if all experience starts with s0, then the objective is averaged over a discounted distribution (not necessarily fully-stationary). If we starts with any arbitrary state, then the objective is averaged over the (discounted) stationary distribution.
Policy gradient theorem: gradient operator can “pass” through the state distribution, which is dependent on the parameters (and at a first glance, should be taken derivatives, too). 

3. You can replace Q^\pi(s, a) with an approximate, which is only accurate when the approximate f(s, a) satisfies df/dw = d\pi/d\theta /\pi
If pi(s, a) is loglinear wrt some features, then f has to be linear to these features and \sum_a f(s, a) = 0 (So f is an advantage function).

4. First time to show the RL algorithm converges to a local optimum with relatively free-form functional estimator.

DAgger

Actor-Critic Models

Asynchronous Advantage Actor-Critic Model
[1602.01783] Asynchronous Methods for Deep Reinforcement Learning

Tensorpack's BatchA3C (ppwwyyxx/tensorpack) and GA3C ([1611.06256] Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU)
Instead of using a separate model for each actor (in separate CPU threads), they process all the data generated by actors with a single model, which is updated regularly via optimization. 

On actor-critic algorithms.

Only read the first part of the paper. It proves that actor-critic will converge to the local minima, when the feature space used to linearly represent Q(s, a) also covers the space spanned by \nabla log \pi(a|s) (compatibility condition), and the actor learns slower than the critic. 


Natural Actor-Critic
Natural gradient is applied on actor critic method. When the compatibility condition proposed by the policy gradient paper is satisfied (i.e., Q(s, a) is a linear function with respect to \nabla log pi(a|s), so that the gradient estimation using this estimated Q is the same as the true gradient which uses the unknown perfect Q function computed from the ground truth policy), then the natural gradient of the policy's parameters is just the linear coefficient of Q. 

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients

Covers the above two papers.

Continuous State/Action

Reinforcement Learning with Deep Energy-Based Policies 
Use the soft-Q formulation proposed by  (in the math section) and naturally incorporate the entropy term in the Q-learning paradigm. For continuous space, both the training (updating Bellman equation) and sampling from the resulting policy (in terms of Q) are intractable. For the former, they propose to use a surrogate action distribution, and compute the gradient with importance sampling. For the latter, they use Stein variational method that matches a deterministic function a = f(e, s) to the learned Q-distribution. In terms of performance, they are comparable with DDPG. But since the learnt Q could be diverse (multimodal) under maximal entropy principle, it can be used as a common initialization for many specific tasks (Example, pretrain=learn to run towards arbitrary direction, task=run in a maze). 

Deterministic Policy Gradient Algorithms

Silver's paper. Learn an actor to prediction the deterministic action (rather than a conditional probability distribution \pi(a|s)) in Q-learning. When trained with Q-learning, propagate through Q to \pi. Similar to Policy Gradient Theorem (gradient operator can “pass” the state distribution, which is dependent on the parameters), there is also deterministic version of it. Also interesting comparison with stochastic offline actor-critic model (stochastic = \pi(a|s)). 

Continuous control with deep reinforcement learning (DDPG)
Deep version of DPG (with DQN trick). Neural network + minibatch → not stable, so they also add target network and replay buffer. 

Reward Shaping

Policy invariance under reward transformations: theory and application to reward shaping.

Andrew Ng's reward shaping paper. It proves that for reward shaping, policy is invariant if and only if a difference of a potential function is added to the reward. 

Theoretical considerations of potential-based reward shaping for multi-agent systems
Theoretical considerations of potential-based reward shaping for multi-agent systems
Potential based reward-shaping can help a single-agent achieve optimal solution without changing the value (or Nash Equilibrium). This paper extends it to multi-agent case.

Reinforcement Learning with Unsupervised Auxiliary Tasks
[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks
ICLR17 Oral. Add auxiliary task to improve the performance of Atari Games and Navigation. Auxiliary task includes maximizing pixel changes and maximizing the activation of individual neurons. 

Navigation

Learning to Navigate in Complex Environments
https://openreview.net/forum?id=SJMGPrcle¬eId=SJMGPrcle
Raia's group from DM. ICLR17 poster, adding depth prediction as the auxiliary task and improve the navigation performance (also uses SLAM results as network input)

[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks (in reward shaping)

Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments
Goal: navigation without SLAM.
Learn successor features (Q, V before the last layer, these features have a similar Bellman equation.) for transfer learning: learn k top weights simultaneously while sharing the successor features, using DQN acting on the features). In addition to successor features, also try to reconstruct the frame.

Experiments on simulation.
state: 96x96x four most recent frames.
action: four discrete actions. (still, left, right, straight(1m))
baseline: train a CNN to directly predict the action of A*

Deep Recurrent Q-Learning for Partially Observable MDPs
There is no much performance difference between stacked frame DQN versus DRQN. DRQN may be more robust when the game state is flickered (some are 0)

Counterfactual Regret Minimization

Dynamic Thresholding

With proofs:


Study game state abstraction and its effect on Ludoc Poker.






Decomposition:
Solving Imperfect Information Games Using Decomposition


Safe and Nested Endgame Solving for Imperfect-Information Games


Game-specific RL

Atari Game


Go
AlphaGo 

DarkForest [1511.06410] Better Computer Go Player with Neural Network and Long-term Prediction

Super Smash Bros


Doom
Arnold: [1609.05521] Playing FPS Games with Deep Reinforcement Learning
Intel: [1611.01779] Learning to Act by Predicting the Future
F1: https://openreview.net/forum?id=Hk3mPK5gg¬eId=Hk3mPK5gg

Poker
Limited Texas hold' em


Unlimited Texas hold 'em 
DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

相关文章:

  • java 的底层通信--Socket
  • java.util.concurrent.CountDownLatch用方法
  • PSI分析
  • Hibernate JPA中@Transient、@JsonIgnoreProperties、@JsonIgnore、@JsonFormat、@JsonSerialize等注解解释...
  • maven的基本原理和使用
  • cocos2d-x -Lua 字符串
  • [系统资源攻略]IO第一篇-磁盘IO,内核IO概念
  • 深入理解CSS中的margin
  • 有哪些話让你看了一遍就再也没有忘记?
  • 搭建nlp_server服务器
  • 修身
  • android classloader双亲委托模式
  • 手机号中间四位加星号
  • UISearchBar使用及修改样式
  • 关于api-ms-win-crt-runtimel1-1-0.dll缺失的解决方案
  • [译]CSS 居中(Center)方法大合集
  • 【面试系列】之二:关于js原型
  • 2018天猫双11|这就是阿里云!不止有新技术,更有温暖的社会力量
  • Android单元测试 - 几个重要问题
  • JavaScript服务器推送技术之 WebSocket
  • JavaScript学习总结——原型
  • JS变量作用域
  • JS创建对象模式及其对象原型链探究(一):Object模式
  • leetcode388. Longest Absolute File Path
  • Linux编程学习笔记 | Linux多线程学习[2] - 线程的同步
  • MobX
  • mockjs让前端开发独立于后端
  • mongo索引构建
  • React 快速上手 - 07 前端路由 react-router
  • SpiderData 2019年2月13日 DApp数据排行榜
  • WinRAR存在严重的安全漏洞影响5亿用户
  • 笨办法学C 练习34:动态数组
  • 大数据与云计算学习:数据分析(二)
  • 基于 Babel 的 npm 包最小化设置
  • 精彩代码 vue.js
  • 如何设计一个比特币钱包服务
  • 如何实现 font-size 的响应式
  • 实现菜单下拉伸展折叠效果demo
  • 使用 QuickBI 搭建酷炫可视化分析
  • 手机端车牌号码键盘的vue组件
  • 树莓派 - 使用须知
  • 写给高年级小学生看的《Bash 指南》
  • 一道面试题引发的“血案”
  • 阿里云ACE认证学习知识点梳理
  • ​软考-高级-信息系统项目管理师教程 第四版【第19章-配置与变更管理-思维导图】​
  • #多叉树深度遍历_结合深度学习的视频编码方法--帧内预测
  • (007)XHTML文档之标题——h1~h6
  • (C++)八皇后问题
  • (Repost) Getting Genode with TrustZone on the i.MX
  • (vue)页面文件上传获取:action地址
  • (ZT)出版业改革:该死的死,该生的生
  • (一)u-boot-nand.bin的下载
  • (转)Scala的“=”符号简介
  • (转)大型网站架构演变和知识体系
  • (状压dp)uva 10817 Headmaster's Headache