当前位置: 首页 > news >正文

[PyTorch][chapter 64][强化学习-DQN]

前言:

            DQN 就是结合了深度学习和强化学习的一种算法,最初是 DeepMind 在 NIPS 2013年提出,它的核心利润包括马尔科夫决策链以及贝尔曼公式。

            Q-learning的核心在于Q表格,通过建立Q表格来为行动提供指引,但这适用于状态和动作空间是离散且维数不高时,当状态和动作空间是高维连续时Q表格将变得十分巨大,对于维护Q表格和查找都是不现实的。


1: DQN 历史

2:  DQN 网络参数配置

3:DQN 网络模型搭建


一 DQN 历史

     DQN 跟机器学习的时序差分学习里面的Q-Learning 算法相似

    1.1 Q-Learning 算法

在Q Learning 中,我们有个Q table ,记录不同状态下,各个动作的Q 值

我们通过Q table 更新当前的策略

Q table 的作用: 是我们输入S,通过查表返回能够获得最大Q值的动作A.

但是很多场景状态S 并不是离散的,很难去定义

 1.2  DQN 发展史

     Deep network+Q-learning = DQN

     DQN 和 Q-tabel 没有本质区别:

     Q-table: 内部维护 Q Tabel

     DQN:   通过神经网络  a= NN(s), 替代了 Q Tabel

   


二 网络模型

    2.1 DQN 算法

  2.1 模型

模型参数


三  代码实现:

 5.1 main.py

   

# -*- coding: utf-8 -*-
"""
Created on Fri Nov 17 16:53:02 2023@author: chengxf2
"""import numpy as np
import torch
import gym
import random 
from Replaybuffer import Replay
from Agent import DQN
import rl_utils
import matplotlib.pyplot as plt
from tqdm import tqdm  #生成进度条lr = 5e-3
hidden_dim = 128
num_episodes = 500
minimal_size = 500
gamma = 0.98
epsilon =0.01
target_update = 10
buffer_size = 10000
mini_size = 500
batch_size = 64
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")if __name__ == "__main__":env_name = 'CartPole-v0'env = gym.make(env_name)random.seed(0)np.random.seed(0)env.seed(0)torch.manual_seed(0)replay_buffer = Replay(buffer_size)state_dim = env.observation_space.shape[0]action_dim = env.action_space.nagent = DQN(state_dim, hidden_dim, action_dim, lr, gamma, epsilon,target_update, device)return_list = []for i in range(10):with tqdm(total=int(num_episodes / 10), desc='Iteration %d' % i) as pbar:for i_episode in range(int(num_episodes / 10)):episode_return = 0state = env.reset()done = Falsewhile not done:action = agent.take_action(state)next_state, reward, done, _ = env.step(action)replay_buffer.add(state, action, reward, next_state, done)state = next_stateepisode_return += reward# 当buffer数据的数量超过一定值后,才进行Q网络训练if replay_buffer.size() > minimal_size:b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)transition_dict = {'states': b_s,'actions': b_a,'next_states': b_ns,'rewards': b_r,'dones': b_d}agent.update(transition_dict)return_list.append(episode_return)if (i_episode + 1) % 10 == 0:pbar.set_postfix({'episode':'%d' % (num_episodes / 10 * i + i_episode + 1),'return':'%.3f' % np.mean(return_list[-10:])})pbar.update(1)episodes_list = list(range(len(return_list)))plt.figure(1) plt.subplot(1, 2, 1)  # fig.1是一个一行两列布局的图,且现在画的是左图plt.plot(episodes_list, return_list,c='r')plt.xlabel('Episodes')plt.ylabel('Returns')plt.title('DQN on {}'.format(env_name))plt.figure(1)  # 当前要处理的图为fig.1,而且当前图是fig.1的左图plt.subplot(1, 2, 2)  # 当前图变为fig.1的右图mv_return = rl_utils.moving_average(return_list, 9)plt.plot(episodes_list, mv_return,c='g')plt.xlabel('Episodes')plt.ylabel('Returns')plt.title('DQN on {}'.format(env_name))plt.show()

5.2  Agent.py

# -*- coding: utf-8 -*-
"""
Created on Fri Nov 17 16:00:46 2023@author: chengxf2
"""import random 
import numpy as np
from   torch import nn
import torch
import torch.nn.functional as Fclass QNet(torch.nn.Module):def __init__(self, state_dim, hidden_dim, action_dim):super(QNet, self).__init__()self.net = nn.Sequential(nn.Linear(state_dim, hidden_dim),nn.Linear(hidden_dim, action_dim))def forward(self, state):qvalue = self.net(state)return qvalueclass  DQN:def __init__(self,state_dim, hidden_dim, action_dim,learning_rate,discount, epsilon, target_update, device):self.action_dim = action_dimself.q_net = QNet(state_dim, hidden_dim, action_dim).to(device)self.target_q_net = QNet(state_dim, hidden_dim, action_dim).to(device)#Adam 优化器self.optimizer = torch.optim.Adam(self.q_net.parameters(),lr=learning_rate)self.gamma = discount #折扣因子self.epsilon = epsilon  # e-贪心算法self.target_update = target_update  # 目标网络更新频率self.device = deviceself.count = 0 #计数器def  take_action(self, state):rnd = np.random.random() #产生随机数if rnd <self.epsilon:action = np.random.randint(0, self.action_dim)else:state = torch.tensor([state], dtype=torch.float).to(self.device)qvalue = self.q_net(state)action = qvalue.argmax().item()return actiondef update(self, data):states = torch.tensor(data['states'],dtype=torch.float).to(self.device)actions = torch.tensor(data['actions']).view(-1, 1).to(self.device)rewards = torch.tensor(data['rewards'],dtype=torch.float).view(-1, 1).to(self.device)next_states = torch.tensor(data['next_states'],dtype=torch.float).to(self.device)dones = torch.tensor(data['dones'],dtype=torch.float).view(-1, 1).to(self.device)#从完整数据中按索引取值[64]#print("\n actions ",actions,actions.shape)q_value = self.q_net(states).gather(1,actions) #Q值#下一个状态的Q值max_next_q_values = self.target_q_net(next_states).max(1)[0].view(-1,1)q_targets = rewards + self.gamma * max_next_q_values * (1 - dones)loss = F.mse_loss(q_value, q_targets)loss = torch.mean(loss)self.optimizer.zero_grad()loss.backward()self.optimizer.step()if self.count %self.target_update  ==0:#更新目标网络self.target_q_net.load_state_dict(self.q_net.state_dict())self.count +=1

 5.3 Replaybuffer.py

   

# -*- coding: utf-8 -*-
"""
Created on Fri Nov 17 15:50:07 2023@author: chengxf2
"""import collections 
import random 
import numpy as np
class Replay:def __init__(self, capacity):#双向队列,可以在队列的两端任意添加或删除元素。self.buffer = collections.deque(maxlen = capacity)def add(self, state, action ,reward, next_state, done):#数据加入bufferself.buffer.append((state,action,reward, next_state, done))def sample(self, batch_size):#采样数据data = random.sample(self.buffer, batch_size)state,action, reward, next_state,done = zip(*data)return np.array(state), action, reward, np.array(next_state), donedef size(self):return len(self.buffer)

 5.4 rl_utils.py

from tqdm import tqdm
import numpy as np
import torch
import collections
import randomclass ReplayBuffer:def __init__(self, capacity):self.buffer = collections.deque(maxlen=capacity) def add(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): transitions = random.sample(self.buffer, batch_size)state, action, reward, next_state, done = zip(*transitions)return np.array(state), action, reward, np.array(next_state), done def size(self): return len(self.buffer)def moving_average(a, window_size):cumulative_sum = np.cumsum(np.insert(a, 0, 0)) middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_sizer = np.arange(1, window_size-1, 2)begin = np.cumsum(a[:window_size-1])[::2] / rend = (np.cumsum(a[:-window_size:-1])[::2] / r)[::-1]return np.concatenate((begin, middle, end))def train_on_policy_agent(env, agent, num_episodes):return_list = []for i in range(10):with tqdm(total=int(num_episodes/10), desc='Iteration %d' % i) as pbar:for i_episode in range(int(num_episodes/10)):episode_return = 0transition_dict = {'states': [], 'actions': [], 'next_states': [], 'rewards': [], 'dones': []}state = env.reset()done = Falsewhile not done:action = agent.take_action(state)next_state, reward, done, _ = env.step(action)transition_dict['states'].append(state)transition_dict['actions'].append(action)transition_dict['next_states'].append(next_state)transition_dict['rewards'].append(reward)transition_dict['dones'].append(done)state = next_stateepisode_return += rewardreturn_list.append(episode_return)agent.update(transition_dict)if (i_episode+1) % 10 == 0:pbar.set_postfix({'episode': '%d' % (num_episodes/10 * i + i_episode+1), 'return': '%.3f' % np.mean(return_list[-10:])})pbar.update(1)return return_listdef train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size):return_list = []for i in range(10):with tqdm(total=int(num_episodes/10), desc='Iteration %d' % i) as pbar:for i_episode in range(int(num_episodes/10)):episode_return = 0state = env.reset()done = Falsewhile not done:action = agent.take_action(state)next_state, reward, done, _ = env.step(action)replay_buffer.add(state, action, reward, next_state, done)state = next_stateepisode_return += rewardif replay_buffer.size() > minimal_size:b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)transition_dict = {'states': b_s, 'actions': b_a, 'next_states': b_ns, 'rewards': b_r, 'dones': b_d}agent.update(transition_dict)return_list.append(episode_return)if (i_episode+1) % 10 == 0:pbar.set_postfix({'episode': '%d' % (num_episodes/10 * i + i_episode+1), 'return': '%.3f' % np.mean(return_list[-10:])})pbar.update(1)return return_listdef compute_advantage(gamma, lmbda, td_delta):td_delta = td_delta.detach().numpy()advantage_list = []advantage = 0.0for delta in td_delta[::-1]:advantage = gamma * lmbda * advantage + deltaadvantage_list.append(advantage)advantage_list.reverse()return torch.tensor(advantage_list, dtype=torch.float)

DQN 算法
遇强则强(八):从Q-table到DQN - 知乎使用Pytorch实现强化学习——DQN算法_dqn pytorch-CSDN博客

https://www.cnblogs.com/xiaohuiduan/p/12993691.html

https://www.cnblogs.com/xiaohuiduan/p/12945449.html

强化学习第五节(DQN)【个人知识分享】_哔哩哔哩_bilibili

CSDN

组会讲解强化学习的DQN算法_哔哩哔哩_bilibili

3-ε-greedy_ReplayBuffer_FixedQ-targets_哔哩哔哩_bilibili

4-代码实战DQN_Agent和Env整体交互_哔哩哔哩_bilibili

DQN基本概念和算法流程(附Pytorch代码) - 知乎

CSDN

DQN 算法

相关文章:

  • 微软 Edge 浏览器目前无法支持 avif 格式
  • Labelme加载AI(Segment-Anything)模型进行图像标注
  • 解决DaemonSet没法调度到master节点的问题
  • 短视频获客系统成功分享,与其开发流程与涉及到的技术
  • Ubuntu18.04安装A-Loam保姆级教程
  • 4-20mA高精度采集方案
  • 【nlp】3.5 Transformer论文复现:3.解码器部分(解码器层)和4.输出部分(线性层、softmax层)
  • Re54:读论文 How Context Affects Language Models‘ Factual Predictions
  • CSS:浏览器设置placeholder样式 / 微信小程序设置placeholder样式
  • Feign接口请求返回异常 no suitable HttpMessageConvert found for response type
  • 网络安全(黑客技术)—0基础小白自学
  • 系列六、Spring整合单元测试
  • Pycharm创建项目新环境,安装Pytorch
  • java springboot测试类虚拟MVC环境 匹配返回值与预期内容是否相同 (JSON数据格式) 版
  • HCIA题目解析(1)
  • 30秒的PHP代码片段(1)数组 - Array
  • Angular数据绑定机制
  • co.js - 让异步代码同步化
  • Javascript 原型链
  • MobX
  • mongo索引构建
  • oldjun 检测网站的经验
  • Spring-boot 启动时碰到的错误
  • Vue小说阅读器(仿追书神器)
  • 计算机在识别图像时“看到”了什么?
  • 山寨一个 Promise
  • 学习JavaScript数据结构与算法 — 树
  • 一些css基础学习笔记
  • TPG领衔财团投资轻奢珠宝品牌APM Monaco
  • # Swust 12th acm 邀请赛# [ A ] A+B problem [题解]
  • (51单片机)第五章-A/D和D/A工作原理-A/D
  • (安卓)跳转应用市场APP详情页的方式
  • (机器学习-深度学习快速入门)第一章第一节:Python环境和数据分析
  • (七)微服务分布式云架构spring cloud - common-service 项目构建过程
  • (详细版)Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
  • (一)python发送HTTP 请求的两种方式(get和post )
  • .equals()到底是什么意思?
  • .net core webapi 大文件上传到wwwroot文件夹
  • .NET Project Open Day(2011.11.13)
  • .NET WebClient 类下载部分文件会错误?可能是解压缩的锅
  • .net6解除文件上传限制。Multipart body length limit 16384 exceeded
  • .NET命名规范和开发约定
  • .net专家(张羿专栏)
  • [AI]文心一言出圈的同时,NLP处理下的ChatGPT-4.5最新资讯
  • [Android]使用Android打包Unity工程
  • [C++] 多线程编程-thread::yield()-sleep_for()
  • [C进阶] 数据在内存中的存储——浮点型篇
  • [HITCON 2017]SSRFme perl语言的 GET open file 造成rce
  • [LeetCode]—Copy List with Random Pointer 深度复制带“任意指针”的链表
  • [LeetCode]Reverse Linked List II
  • [MICROSAR Adaptive] --- autosar官方文档阅读建议
  • [Perl] Find Shell on your Wordpress site
  • [POI2006] OKR-Periods of Words——最大周期长度(扩展最小周期长度)
  • [Python进阶] 正则表达式介绍
  • [ROS]安装tutlebot时无法下载解决方法