当前位置：首页 > news >正文

强化学习基础：蒙特卡罗和时序差分

news 来源：原创 2024/5/7 21:41:56

$v_{\pi}$ corresponding to a policy $\pi$
- First-visit MC estimates $v_{\pi}(s)$
- Every-visit MC estimates $v_{\pi}(s)$
问题二（右图）：estimate the action-value function $q_{\pi}$
- First-visit MC estimates $q_{\pi}(s,a)$
- Every-visit MC estimates $q_{\pi}(s,a)$

问题三（左图）：get the optimal policy $\pi_*$
- relationship between the mean and individual return: $\bar{Q}_k=\frac{\sum_{i=1}^kG_i}{k}=\bar{Q}_{k-1}+\frac{1}{k}(G_k-\bar{Q}_{k-1})$
- $\epsilon$-greedy: Exploration vs Exploitation
  - with probability $1-\epsilon$, select the greedy action ${\pi}(s)=\arg \max _{a \in \mathcal{A}(s)} Q(s, a)$ (Exploitation)
  - with probability $\epsilon$, select an action (uniformly) at random ${\pi}(a|s)=\frac{1}{|\mathcal{A}(s)|}$ (Exploration)　　
问题四（右图）：modify the algorithm to put more weights to the most recent returns

求解方法：Temporal Difference

Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.

问题一（左图）：estimate the state-value function $v_{\pi}$ (the estimation of $q_{\pi}$ is similar)
问题二（右图）：get the optimal action value function $q_*$
- On policy: the agent interact with the environment by following the same policy $\pi$ that it seeks to evaluate (or improve)
- Sarsa(0) is an on-policy method

问题三：modified algorithm to get the optimal action value function $q_*$
- Off poliy: the agent interact with the environment by following a policy $b$ $$\pi$ that it seeks to evaluate (or improve)$

$q_*$
- Expected Sarsa is an on-policy method
- $\pi(a|S_{t+1})$ is derived from $Q$ (e.g., $\epsilon$-greedy)

转载于:https://www.cnblogs.com/sunwq06/p/11084512.html

相关文章：

golang 浮点数取精度的效率对比

OpenCV入门指南人脸检测 haar分类器

MySQL主从延时这么长，要怎么优化？

使用 NPOI 导出数据示例

WPF Browser 中如何获取当前路径(临时文件中)？

10+优秀“分步引导”jQuery插件（转）

用processing画李萨如曲线

MVC笔记初识模型（二)

android 手机网络接入点名称及WAP、NET模式的区别

金蝶osf接口开发_金蝶云·星辰 | ?小微企业服务成长平台

小程序商店刷榜_怎么注册微信小程序商店

中getname_【136期】你能谈谈Java中 synchronized 对象锁和类锁的区别

加到service中无效_给 COLA 做减法：应用架构中的“弯弯绕设计”

set集合结构体_Swift - 集合（Set）使用详解（附样例）

前端动态获取servlet虚拟路径_Servlet 过滤器和异常处理

【162天】黑马程序员27天视频学习笔记【Day02-上】

django开发-定时任务的使用

ES6--对象的扩展

IE报vuex requires a Promise polyfill in this browser问题解决

iOS编译提示和导航提示

JavaScript创建对象的四种方式

Laravel Mix运行时关于es2015报错解决方案

Nodejs和JavaWeb协助开发

Python 反序列化安全问题（二）

Solarized Scheme

SSH 免密登录

Vue2 SSR 的优化之旅

Webpack 4x 之路（四）

大快搜索数据爬虫技术实例安装教学篇

翻译：Hystrix - How To Use

后端_MYSQL

基于MaxCompute打造轻盈的人人车移动端数据平台

名企6年Java程序员的工作总结，写给在迷茫中的你！

如何设计一个比特币钱包服务

实战｜智能家居行业移动应用性能分析

使用Maven插件构建SpringBoot项目,生成Docker镜像push到DockerHub上

学习ES6 变量的解构赋值

一个SAP顾问在美国的这些年

如何防止网络攻击？

#我与Java虚拟机的故事#连载13：有这本书就够了

$(selector).each()和$.each()的区别

(12)Hive调优——count distinct去重优化

(delphi11最新学习资料) Object Pascal 学习笔记---第8章第5节（封闭类和Final方法）

（十）【Jmeter】线程（Threads(Users)）之jp@gc - Stepping Thread Group (deprecated)

（十七）Flask之大型项目目录结构示例【二扣蓝图】

（算法）前K大的和

（总结）Linux下的暴力密码在线破解工具Hydra详解

.NET CORE 3.1 集成JWT鉴权和授权2

.NET delegate 委托、 Event 事件,接口回调

.NET 使用 XPath 来读写 XML 文件

.net 验证控件和javaScript的冲突问题

.NET 中的轻量级线程安全

.Net调用Java编写的WebServices返回值为Null的解决方法(SoapUI工具测试有返回值)

.Net小白的大学四年，内含面经

.sys文件乱码_python vscode输出乱码