当前位置：首页 > news >正文

【HuggingFace Transformers】OpenAIGPTModel的核心——Block源码解析

news 来源：原创 2024/9/20 7:55:47

OpenAIGPTModel的核心——Block源码解析

1. Block 介绍
2. Block类源码解析
3. Attention类源码解析
4. MLP类源码解析

1. Block 介绍

在 GPT 模型中，Block 是 Transformer 架构的核心组成部分。每个 Block 主要由三个部分构成：Attention、MLP以及两个Layer Norm。首先，Attention 层负责计算输入中各位置之间的注意力权重，并生成加权的表示。接着，将Attention 的输出与输入进行残差连接，并通过第一个Layer Norm层进行层归一化，形成中间状态。随后，MLP 层进一步处理这些中间状态，通过激活函数引入非线性变换。最后将MLP 层的输出和输入进行残差连接，并通过第二个Layer Norm层进行层归一化，最终输出Block的计算结果。这样，Block 可以有效地提取和转换序列中的复杂特征，并支持深层模型的训练和推理。Block 的结构如下：
在这里插入图片描述
图片地址：Improving Language Understanding by Generative Pre-Training

2. Block类源码解析

源码地址：transformers/src/transformers/models/openai/modeling_openai.py

# -*- coding: utf-8 -*-
# @time: 2024/9/3 20:42
from torch import nn
from transformers.models.openai.modeling_openai import Attention, MLPclass Block(nn.Module):def __init__(self, n_positions, config, scale=False):super().__init__()nx = config.n_embdself.attn = Attention(nx, n_positions, config, scale)  # 定义 Attention 层self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)  # 定义 LayerNorm 层1self.mlp = MLP(4 * nx, config)  # 定义 MLP 层self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)  # 定义 LayerNorm 层2def forward(self, x, attention_mask=None, head_mask=None, output_attentions=False):# 自注意力层计算attn_outputs = self.attn(x,attention_mask=attention_mask,head_mask=head_mask,output_attentions=output_attentions,)a = attn_outputs[0]  # 提取注意力机制的输出结果 an = self.ln_1(x + a)  # 残差连接与第一个层层归一化m = self.mlp(n)  # 前馈神经网络计算h = self.ln_2(n + m)  # 残差连接与第二个层层归一化# 输出outputs = [h] + attn_outputs[1:]return outputs

3. Attention类源码解析

源码地址：transformers/src/transformers/models/openai/modeling_openai.py

# -*- coding: utf-8 -*-
# @time: 2024/9/3 20:44import math
import torchfrom torch import nn
from transformers.pytorch_utils import Conv1D, find_pruneable_heads_and_indices, prune_conv1d_layerclass Attention(nn.Module):def __init__(self, nx, n_positions, config, scale=False):super().__init__()# 模型的隐藏状态维度 n_state 为嵌入维度 nxn_state = nx  # in Attention: n_state=768 (nx=n_embd)# [switch nx => n_state from Block to Attention to keep identical to TF implementation]# 检查n_state是否可以被注意力头的数量整除，如果不能整除，则抛出异常if n_state % config.n_head != 0:raise ValueError(f"Attention n_state shape: {n_state} must be divisible by config.n_head {config.n_head}")# 注册一个名为bias的缓冲区变量，用于存储一个下三角矩阵，防止未来信息泄露（适用于因果自注意力）self.register_buffer("bias",torch.tril(torch.ones(n_positions, n_positions)).view(1, 1, n_positions, n_positions),persistent=False,)self.n_head = config.n_head  # 获取注意力头的数量self.split_size = n_state  # 设置split_size为n_state，用于后续的维度拆分self.scale = scale  # 设置是否缩放注意力权重self.c_attn = Conv1D(n_state * 3, nx)  # 定义一个1D卷积层c_attn，用于生成查询（Q）、键（K）和值（V），输出维度是n_state的3倍self.c_proj = Conv1D(n_state, nx)  # 定义一个1D卷积层c_proj，用于映射最终的注意力输出self.attn_dropout = nn.Dropout(config.attn_pdrop)  # 定义一个dropout层，防止注意力机制的过拟合self.resid_dropout = nn.Dropout(config.resid_pdrop)  # 定义一个dropout层，防止残差连接的过拟合self.pruned_heads = set()  # 初始化一个集合，用于存储已被剪枝的注意力头# 定义剪枝指定注意力头的方法（辅助工具：可选）def prune_heads(self, heads):# 如果没有指定要剪枝的头，直接返回if len(heads) == 0:return# 根据要剪枝的头，找到可以剪枝的头和对应的索引heads, index = find_pruneable_heads_and_indices(heads, self.n_head, self.split_size // self.n_head, self.pruned_heads)index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])  # 构造要剪枝的索引，用于卷积层的权重剪枝# Prune conv1d layersself.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)  # 对c_attn卷积层进行剪枝self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)  # 对c_proj卷积层进行剪枝# Update hyper paramsself.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))  # 更新split_sizeself.n_head = self.n_head - len(heads)  # 更新n_headself.pruned_heads = self.pruned_heads.union(heads)  # 将剪枝的头加入已剪枝集合# 定义计算注意力的方法def _attn(self, q, k, v, attention_mask=None, head_mask=None, output_attentions=False):w = torch.matmul(q, k)  # 计算查询（Q）和键（K）的点积，得到注意力权重矩阵# 根据scale值对注意力权重矩阵是否进行缩放if self.scale:w = w / math.sqrt(v.size(-1))# w = w * self.bias + -1e9 * (1 - self.bias)  # TF implementation method: mask_attn_weights# XD: self.b may be larger than w, so we need to crop itb = self.bias[:, :, : w.size(-2), : w.size(-1)]  # 获取与注意力权重矩阵大小一致的下三角掩码w = w * b + -1e4 * (1 - b)  # 使用掩码防止未来信息泄露，并通过大负值进行掩码处理# 如果提供了attention mask，将attention mask加到权重矩阵 w 上if attention_mask is not None:# Apply the attention maskw = w + attention_mask# 对权重矩阵 w 进行softmax归一化，并进行dropout操作w = nn.functional.softmax(w, dim=-1)w = self.attn_dropout(w)# Mask heads if we want to# 如果提供了head mask，使用head mask以屏蔽特定的头if head_mask is not None:w = w * head_maskoutputs = [torch.matmul(w, v)]  # 注意力权重w 与 v 相乘，并将其作为输出# 如果要求输出注意力权重，将注意力权重也加入输出列表if output_attentions:outputs.append(w)return outputs# 定义将多头的输出合并为原来维度的方法def merge_heads(self, x):"""调整x的维度顺序,然后把x的最后两个维度合并为一个维度"""x = x.permute(0, 2, 1, 3).contiguous()new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)return x.view(*new_x_shape)  # in Tensorflow implementation: fct merge_states# 定义将输入拆分成多个注意力头的方法def split_heads(self, x, k=False):"""把x的最后一维拆成（num_head）和（head_dim）两个维度,然后根据k的值进行维度调整"""new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)x = x.view(*new_x_shape)  # in Tensorflow implementation: fct split_statesif k:return x.permute(0, 2, 3, 1)else:return x.permute(0, 2, 1, 3)def forward(self, x, attention_mask=None, head_mask=None, output_attentions=False):# 通过卷积层c_attn计算query, key, value，并进行拆分，适应多头注意力机制x = self.c_attn(x)query, key, value = x.split(self.split_size, dim=2)query = self.split_heads(query)key = self.split_heads(key, k=True)value = self.split_heads(value)# 计算注意力，获取计算后的注意力输出attn_outputs = self._attn(query, key, value, attention_mask, head_mask, output_attentions)a = attn_outputs[0]a = self.merge_heads(a)  # 将多头注意力的输出合并a = self.c_proj(a)  # 通过映射层c_proj进一步处理输出a = self.resid_dropout(a)  # 然后再dropout操作# 将多头注意力的输出与其他注意力输出（注意力权重）一起打包成列表后输出outputs = [a] + attn_outputs[1:]return outputs  # a, (attentions)

4. MLP类源码解析

源码地址：transformers/src/transformers/models/openai/modeling_openai.py

# -*- coding: utf-8 -*-
# @time: 2024/9/3 20:45from torch import nn
from transformers.pytorch_utils import Conv1D
from transformers.activations import silu, gelu_new# ACT_FNS 激活函数字典
ACT_FNS = {"relu": nn.ReLU(), "silu": silu, "gelu": gelu_new, "swish": silu}class MLP(nn.Module):def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)super().__init__()nx = config.n_embdself.c_fc = Conv1D(n_state, nx)  # 定义卷积层 c_fcself.c_proj = Conv1D(nx, n_state)  # 定义卷积层 c_projself.act = ACT_FNS[config.afn]  # 从 ACT_FNS 字典中根据 config.afn 指定的激活函数名称选择激活函数self.dropout = nn.Dropout(config.resid_pdrop)  # 定义dropout 层，防止过拟合def forward(self, x):h = self.act(self.c_fc(x))  # 将输入x从 n_state 维度转换到 nx 维度，然后使用激活函数h2 = self.c_proj(h)  # 将经过激活函数处理后的数据 h 从 nx 维度转换回 n_state 维度return self.dropout(h2)  # 对 h2 进行dropout操作后返回结果

在这个 MLP 模块中，第一层变换将输入维度从 4 * nx 压缩到 nx，第二层变换将维度从 nx 再扩展回 4 * nx。这里为什么要这么做呢？有何作用？欢迎大家在评论区讨论，给出你的答案！

补充：
Conv1D 实际上实现了一个线性变换层，与传统的全连接层（nn.Linear）非常相似。这个实现简化了线性变换操作，通过 torch.addmm 执行矩阵乘法和偏置加法，避免了使用 nn.Linear 层的额外开销，同时保持了与模型的兼容性。

Conv1D 类源码解析：
源码地址：transformers/src/transformers/pytorch_utils.py

# -*- coding: utf-8 -*-
# @time: 2024/9/4 11:31import torchfrom torch import nnclass Conv1D(nn.Module):"""1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).Basically works like a linear layer but the weights are transposed.Args:nf (`int`): The number of output features.nx (`int`): The number of input features."""def __init__(self, nf, nx):super().__init__()self.nf = nf  # 设置输出特征的数量self.nx = nx  # 设置输入特征的数量self.weight = nn.Parameter(torch.empty(nx, nf))  # 初始化权重参数，shape为 (nx, nf)self.bias = nn.Parameter(torch.zeros(nf))  # 初始化偏置参数，shape为 (nf,)nn.init.normal_(self.weight, std=0.02)  # 用均值为0，标准差为0.02的正态分布初始化权重def __repr__(self) -> str:return "Conv1D(nf={nf}, nx={nx})".format(**self.__dict__)  # 返回该层的字符串表示，包括输出特征和输入特征的数量def forward(self, x):size_out = x.size()[:-1] + (self.nf,)  # 计算输出特征的大小，这里保持x除最后一个维度外的其他维度不变，然后将最后一个维度设置为 self.nfx = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)  # 对输入 x 进行变形，将最后一个维度与权重的维度匹配，并执行矩阵乘法加上偏置x = x.view(size_out)  # 将x 变回size_out的维度大小return x