当前位置：首页 > news >正文

混合专家模型（MoE）入门

news 来源：原创 2024/9/20 16:59:40

模型规模是提升LLM大语言模型性能的关键因素，但也会增加计算成本。Mixture of Experts (MoE) 架构通过分布式专家层和动态门控机制，有效降低了计算资源，使模型能够在扩展参数规模的同时保持高效的运行。

Mixtral of Experts

代码地址：mistral-inference/src/mistral_inference at main · mistralai/mistral-inference

在这里插入图片描述

混合专家层：一共8个专家，输入向量被门控神经网络根据得分路由到其中的2个专家，加权后输出

Mixtral 8x7B 模型配置：

参数名	含义	数值
dim	向量维度	4096
n_layers	Transformer的层数	32
head_dim	MHA中每个头的维度	128
hidden_dim	FFN中隐藏层的维度	14436
n_heads	MHA中头的数量	32
n_kv_heads	GQA中key和value的头数	8
context_len	上下文长度	32678
vocab_size	词汇表大小	32000
num_experts	专家数量	8
top_k_experts	每个token被路由的专家数	2

MoE层细节：

混合专家层定义为为 ${E_0, E_i, ... , E_{n-1}\}$ ，路由层定义为 $G$ ，计算公式如下：
$\sum^{n-1}_{i=0}G(x)_i \cdot E_i(x)$
在Mixtral中，每个专家层都是一个FFN。路由层提供不同专家的权重，与专家层的输出加权求和，得到MoE的输出。

路由层也是一个线性层，路由层的输出维度等于专家数量。定义 $W_g$ 定义为路由层权重，其形状为(dim, n_experts)
$G(x)=Softmax(TopK(x\cdot W_g))$
当top_k_experts=2时，选取得分排名前2的专家，沿着num_experts维度计算Softmax

Mixtral MoE 代码实现（Pytorch）：

class MoE(nn.Module):def __init__(self,num_experts: int,num_experts_per_tok: int,**kwargs,):super().__init__()# 初始化专家，例如8个专家分给2个GPUself.experts = nn.ModuleList([FeedForward(**kwargs).to(f"cuda:{i//4}") for i in range(num_experts)])# 门控线性层self.gate = nn.Linear(kwargs["dim"], num_experts, bias=False)# 路由的专家数量self.num_experts_per_tok = num_experts_per_tokdef forward(self, x):orig_shape = x.shape# (b, n, d) -> (b*n, d)x = x.view(-1, x.shape[-1])# shape: (b*n, num_experts)scores = self.gate(x)# (b*n, k), 一个是权重，一个是索引expert_weights, expert_indices = torch.topk(scores, self.num_experts_per_tok, dim=-1)# 归一化，k个权重和为1expert_weights = expert_weights.softmax(dim=-1)# (b*n, k) -> (b*n*k,)flat_expert_indices = expert_indices.view(-1)# (b*n, d) -> (b*n*k, d)x = x.repeat_interleave(self.num_experts_per_tok, dim=0)y = torch.empty_like(x)for i, expert in enumerate(self.experts):# 根据索引进行路由，将每个token输入索引对应的专家y[flat_expert_indices == i] = expert(x[flat_expert_indices == i])# (b*n, k, d) * (b*n, k, 1) -> (b*n, d)y = (y.view(*expert_weights.shape, -1) * expert_weights.unsqueeze(-1)).sum(dim=1)return y.view(*orig_shape)  # (b*n, d) -> (b, n, d)

DeepSeekMoE

代码地址：modeling_deepseek.py · deepseek-ai/deepseek-moe-16b-base at main (huggingface.co)

针对两个问题：

Knowledge Hybridity (知识混杂)：专家数量有限，而文本中的知识类型太多，不同类型的词元被输入一个专家中
Knowledge Redundancy (知识冗余)：不同的专家可能都在学习一些通用的、常识性的东西，最终导致参数的冗余

提出解决方案：

Fine-Grained Expert Segmentation (划分细粒度专家)：保持参数不变，将专家细分，然后灵活组合，以应对不同语境
Shared Expert Isolation (设置共享专家)：设置常态激活的专家学习不同语境下的通用知识，提高其余专家的特异性

在这里插入图片描述

(a)Mixtral等2专家路由；(b)划分细粒度专家；(c)设置共享专家

DeepSeekMoE 模型配置：

Params	Layers	Hidden Size	Attn Heads	Shared Experts	Routed Experts	Relative Expert Size	Sequence Length
2.0B	9	1280	10	1	67 (7 act.)	0.25	2048
16.4B	28	2048	16	2	64 (6 act.)	0.25	4096
144.6B	62	4096	32	4	128 (12 act.)	0.125	4096

Relative Expert Size是相对于标准FFN层的维度缩放，当专家数量增加时，有必要降低每个专家的参数。当专家数为64时，将每个FFN的参数量降低为原本的1/4，则可以保持和原本16个专家相当的参数。

DeepSeekMoE 实现细节：

细粒度专家其实是将原本的FFN层缩减参数量后，然后增加专家数量
共享专家是一直保持激活的专家，也是缩减后的FFN
还有一个负载均衡损失，惩罚token被过多路由到某一个或某几个专家，是一种正则化

负载均衡损失

在这里插入图片描述

举例说明，假设有6个token，4个专家，topK为2，且不考虑前面的常数项

负载均衡时，每个专家上的token数量近似

在这里插入图片描述

此时，负载均衡损失为 $3 * (0.24 + 0.26 + 0.25 + 0.25) = 3$

当负载不均衡时，tokens主要集中在专家1和2

在这里插入图片描述

此时，负载均衡损失为 $6 * 0.47 + 4 * 0.32 + 1 * 0.10 + 1 * 0.11 = 4.57$

deepseek-moe-16b-base代码阅读：

Gate network

class MoEGate(nn.Module):def __init__(self, config):super().__init__()self.config = configself.top_k = config.num_experts_per_tok  # topKself.n_routed_experts = config.n_routed_experts  # num of expertsself.scoring_func = config.scoring_func  # softmaxself.alpha = config.aux_loss_alpha  # 负载均衡损失的权重# 门控神经网络self.gating_dim = config.hidden_sizeself.weight = nn.Parameter(torch.empty((self.n_routed_experts, self.gating_dim)))def forward(self, hidden_states):bsz, seq_len, h = hidden_states.shape        hidden_states = hidden_states.view(-1, h)# 计算每个专家的得分logits = F.linear(hidden_states, self.weight, None)if self.scoring_func == 'softmax':scores = logits.softmax(dim=-1)else:raise NotImplementedError(f'insupportable scoring function for MoE gating: {self.scoring_func}')# 选取topk专家，shape: (bsz * seq_len, k)topk_weight, topk_idx = torch.topk(scores, k=self.top_k, dim=-1, sorted=False)# 计算负载均衡损失scores_for_aux = scoresaux_topk = self.top_ktopk_idx_for_aux_loss = topk_idx.view(bsz, -1)# (bsz * seq_len, n_routed_experts) -> (n_routed_experts, )Pi = scores_for_aux.mean(0)# 将分配的专家索引转为one-hot向量，shape: (bsz * seq_len * k, n_routed_experts)mask_ce = F.one_hot(topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts)ce = mask_ce.float().mean(0)fi = ce * self.n_routed_expertsaux_loss = (Pi * fi).sum() * self.alphareturn topk_idx, topk_weight, aux_loss

DeepSeekMoE layer

class DeepseekMoE(nn.Module):def __init__(self, config):super().__init__()self.config = configself.num_experts_per_tok = config.num_experts_per_tokself.experts = nn.ModuleList([DeepseekMLP(config, intermediate_size = config.moe_intermediate_size) for i in range(config.n_routed_experts)])self.gate = MoEGate(config)if config.n_shared_experts is not None:intermediate_size = config.moe_intermediate_size * config.n_shared_expertsself.shared_experts = DeepseekMLP(config=config, intermediate_size = intermediate_size)def forward(self, hidden_states):identity = hidden_statesorig_shape = hidden_states.shape# shape: (bsz * seq_len, k)topk_idx, topk_weight, aux_loss = self.gate(hidden_states)# (bsz, seq_len, d) -> (bsz*seq_len, d)hidden_states = hidden_states.view(-1, hidden_states.shape[-1])flat_topk_idx = topk_idx.view(-1)# 复制输入，方便使用for循环进行处理hidden_states = hidden_states.repeat_interleave(self.num_experts_per_tok, dim=0)y = torch.empty_like(hidden_states)# 计算路由专家的输出for i, expert in enumerate(self.experts):y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)y =  y.view(*orig_shape)y = AddAuxiliaryLoss.apply(y, aux_loss)# 加上共享专家的输出if self.config.n_shared_experts is not None:y = y + self.shared_experts(identity)return y

从代码来看，不管是细粒度专家，还是共享专家，在专家本身的结构上都没有变化，变化的是专家负责的语义。一个是将语义细化，产生更多的组合去适应丰富的场景；另一个是将通用语义抽取出来，让其余的细粒度专家更专注于本身应该负责的语义。

DeepSeekMoE V2

代码未开源，这里放一个权重和配置文件地址：deepseek-ai/DeepSeek-V2 at main (huggingface.co)

论文的主要贡献是MLA (Multi-Head Latent Attention): Equipped with low-rank key-value joint compression, boosting inference efficiency.

为了提升推理速度，大型语言模型（LLM）采用Key-Value (KV) 缓存来存储中间结果，但这增加了存储负担，限制了批量大小和序列长度。

为降低KV缓存，Transformer中自注意力机制的发展历史：MHA，GQA，MQA，MLA

在这里插入图片描述

对比不同的注意力机制， $n_h$ 分别表示头的数量， $d_h$ 表示每个头的维度， $l$ 表示Transformer的层数， $n_{kv}$ 表示GQA中kv头的数量

Name	MHA	GQA	MQA	MLA
KV cache	$2n_hd_hl$	$2n_{kv}d_hl$	$2d_hl$	$2n_hd_cl$

核心思想：将key和value从高维空间映射到低维空间，缓存投影矩阵和低维空间的矩阵，每次推理的时候重新计算key和value

$q, k, v$ 经过线性投影层后，维度为 $n_hd_h$
$q=W^Qh,\ k=W^Kh,\ v=W^Vh$
注意力机制计算公式：
$o=Softmax(\frac{q^Tk}{\sqrt{d_h}})v \\ u=W^Oo$
MLA使用低秩矩阵 $c^{KV}$ 进行压缩，维度为 $d_c<<d_h$ ，推理时将 $c^{KV}$ 映射回高维矩阵
$c^{KV}=W^{DKV}h \\ k^C=W^{UK}c^{KV},\ v^C=W^{UV}c^{KV}$
算子融合：上映射这一步的算子 $W_{UK}$ 可以与 $W_Q$ 融合， $W_UV$ 可以与 $W_O$ 融合，从而不需要计算key和value，直接输入 $c^{KV}$ 输出 $u$

此外，为了降低存储资源，论文甚至在训练过程中对query使用了低秩矩阵，即使略微增加了计算量

然而，算子融合也带了问题，算与旋转位置编码(ROPE)不兼容，因为ROPE必须直接作用在query和key上

解决方法：如下图，将query和key分成两组，分别用MLA和MQA，只在MQA这一组使用ROPE

在这里插入图片描述

修正一下前面的KV 缓存，MLA真实的KV缓存应该为 $n_hd_c+d_R)l$ ，其中 $d_R$ 为MQA中共享k的维度

举例说明，假设有8个头，每个头的维度为64，GAQ中 $n_{kv}=4$

Name	MHA	GQA	MQA	MLA
KV cache	$1024 l$	$512 l$	$128 l$	$(512 + 64) l = 576 l$

使用for循环对专家进行路由是非常低效的，如果做高效推理，通常会在并行策略（数据并行、模型并行、专家并行）中重新实现token的重排和分发

参考资料：

混合专家模型（MoE）详解 (huggingface.co)
Mixtral of Experts
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model