机器学习笔记之线性分类——高斯判别分析(二)最优参数求解
机器学习笔记之线性分类——高斯判别分析之最优参数求解
- 引言
引言
上一节介绍了高斯判别分析(Gaussain Discriminant Analysis)的策略构建思路,本节将基于该策略,对概率分布最优参数进行求解。
回顾:高斯判别分析的策略构建思路
高斯判别分析是典型的概率生成模型,其核心操作是将求解最优后验概率通过贝叶斯定理转化为先验概率分布与似然的乘积形式:
Y
p
r
e
d
^
=
arg
max
Y
p
r
e
d
∈
{
0
,
1
}
P
(
Y
p
r
e
d
=
i
∣
X
)
∝
arg
max
Y
∈
{
0
,
1
}
P
(
X
∣
Y
)
P
(
Y
)
\begin{aligned}\hat {\mathcal Y_{pred}} & = \mathop{\arg\max}\limits_{\mathcal Y_{pred} \in \{0,1\}} P(\mathcal Y_{pred} = i \mid \mathcal X) \\ & \propto \mathop{\arg\max}\limits_{\mathcal Y \in \{0,1\}}P(\mathcal X \mid \mathcal Y)P(\mathcal Y) \end{aligned}
Ypred^=Ypred∈{0,1}argmaxP(Ypred=i∣X)∝Y∈{0,1}argmaxP(X∣Y)P(Y)
基于二分类假设,令
Y
\mathcal Y
Y服从 伯努利分布,则先验分布
P
(
Y
)
P(\mathcal Y)
P(Y)的概率密度函数表示如下:
P
(
Y
)
=
ϕ
Y
(
1
−
ϕ
)
1
−
Y
P(\mathcal Y) = \phi^{\mathcal Y}(1 - \phi)^{1 - \mathcal Y}
P(Y)=ϕY(1−ϕ)1−Y
其中
ϕ
\phi
ϕ表示
Y
\mathcal Y
Y选择标签
1
1
1时的概率结果;在给定先验分布
P
(
Y
)
P(\mathcal Y)
P(Y)条件下,令各类标签对应的似然
P
(
X
∣
Y
=
1
)
,
P
(
X
∣
Y
=
0
)
P(\mathcal X \mid \mathcal Y=1),P(\mathcal X \mid \mathcal Y = 0)
P(X∣Y=1),P(X∣Y=0)均服从高斯分布:
{
X
∣
Y
=
1
∼
N
(
μ
1
,
Σ
)
X
∣
Y
=
0
∼
N
(
μ
2
,
Σ
)
\begin{cases}\mathcal X \mid \mathcal Y=1 \sim \mathcal N(\mu_1,\Sigma) \\ \mathcal X \mid \mathcal Y=0 \sim \mathcal N(\mu_2,\Sigma) \end{cases}
{X∣Y=1∼N(μ1,Σ)X∣Y=0∼N(μ2,Σ)
将上述逻辑合并,使用同一公式进行表示:
P
(
X
∣
Y
)
=
N
(
μ
1
,
Σ
)
Y
N
(
μ
2
,
Σ
)
1
−
Y
P(\mathcal X \mid \mathcal Y) = \mathcal N(\mu_1,\Sigma)^{\mathcal Y} \mathcal N(\mu_2,\Sigma)^{1 - \mathcal Y}
P(X∣Y)=N(μ1,Σ)YN(μ2,Σ)1−Y
至此,先验概率
P
(
Y
)
P(\mathcal Y)
P(Y),似然
P
(
X
∣
Y
)
P(\mathcal X \mid \mathcal Y)
P(X∣Y)均设定完毕,并包含四个 概率分布参数:
θ
=
{
μ
1
,
μ
2
,
Σ
,
ϕ
}
\theta = \{\mu_1,\mu_2,\Sigma,\phi\}
θ={μ1,μ2,Σ,ϕ}
设似然函数为
L
(
θ
)
\mathcal L(\theta)
L(θ),似然函数表示如下:
注意:该函数本身时’联合概率分布‘,而不是纯粹的似然;
L
(
θ
)
=
log
∏
i
=
1
N
P
(
x
(
i
)
,
y
(
i
)
)
=
log
∏
i
=
1
N
P
(
x
(
i
)
∣
y
(
i
)
)
P
(
y
(
i
)
)
=
∑
i
=
1
N
log
P
(
x
(
i
)
∣
y
(
i
)
)
+
log
P
(
y
(
i
)
)
\begin{aligned} \mathcal L(\theta) & = \log \prod_{i=1}^N P(x^{(i)},y^{(i)}) \\ & = \log \prod_{i=1}^N P(x^{(i)} \mid y^{(i)})P(y^{(i)}) \\ & = \sum_{i=1}^N \log P(x^{(i)} \mid y^{(i)}) + \log P(y^{(i)}) \end{aligned}
L(θ)=logi=1∏NP(x(i),y(i))=logi=1∏NP(x(i)∣y(i))P(y(i))=i=1∑NlogP(x(i)∣y(i))+logP(y(i))
将上述分布带入
L
(
θ
)
\mathcal L(\theta)
L(θ):
L
(
θ
)
=
∑
i
=
1
N
{
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
N
(
μ
2
,
Σ
)
1
−
y
(
i
)
]
+
log
[
ϕ
y
(
i
)
(
1
−
ϕ
)
1
−
y
(
i
)
]
}
=
∑
i
=
1
N
{
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
]
+
log
[
N
(
μ
2
,
Σ
)
1
−
y
(
i
)
]
+
log
[
ϕ
y
(
i
)
(
1
−
ϕ
)
1
−
y
(
i
)
]
}
\begin{aligned} \mathcal L(\theta) & = \sum_{i=1}^N \left\{\log\left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\mathcal N(\mu_2,\Sigma)^{1- y^{(i)}}\right] + \log \left[\phi^{y^{(i)}}(1- \phi)^{1 - y^{(i)}}\right]\right\} \\ & = \sum_{i=1}^N \left\{\log \left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\right] + \log \left[\mathcal N(\mu_2,\Sigma)^{1 - y^{(i)}}\right] + \log \left[\phi^{y^{(i)}}(1- \phi)^{1 - y^{(i)}}\right]\right\} \end{aligned}
L(θ)=i=1∑N{log[N(μ1,Σ)y(i)N(μ2,Σ)1−y(i)]+log[ϕy(i)(1−ϕ)1−y(i)]}=i=1∑N{log[N(μ1,Σ)y(i)]+log[N(μ2,Σ)1−y(i)]+log[ϕy(i)(1−ϕ)1−y(i)]}
最终,使用极大似然估计求解似然函数中的模型参数
θ
\theta
θ:
θ
^
=
arg
max
θ
L
(
θ
)
\hat {\theta} = \mathop{\arg\max}\limits_{\theta} \mathcal L(\theta)
θ^=θargmaxL(θ)
求解过程
将
L
(
θ
)
\mathcal L(\theta)
L(θ)完全展开,表示如下:
L
(
θ
)
=
∑
i
=
1
N
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
]
+
∑
i
=
1
N
log
[
N
(
μ
2
,
Σ
)
1
−
y
(
i
)
]
+
∑
i
=
1
N
log
[
ϕ
y
(
i
)
(
1
−
ϕ
)
1
−
y
(
i
)
]
\mathcal L(\theta) = \sum_{i=1}^N \log \left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\right] + \sum_{i=1}^N \log \left[\mathcal N(\mu_2,\Sigma)^{1 - y^{(i)}}\right] + \sum_{i=1}^N \log \left[\phi^{y^{(i)}}(1- \phi)^{1 - y^{(i)}}\right]
L(θ)=i=1∑Nlog[N(μ1,Σ)y(i)]+i=1∑Nlog[N(μ2,Σ)1−y(i)]+i=1∑Nlog[ϕy(i)(1−ϕ)1−y(i)]
求解最优先验概率分布参数 ϕ \phi ϕ
L
(
θ
)
\mathcal L(\theta)
L(θ)展开结果共包含3项,其中只有最后一项包含参数
ϕ
\phi
ϕ,因此则有:
ϕ
^
=
arg
max
ϕ
L
(
θ
)
=
arg
max
ϕ
∑
i
=
1
N
log
[
ϕ
y
(
i
)
(
1
−
ϕ
)
1
−
y
(
i
)
]
\begin{aligned}\hat {\phi} & = \mathop{\arg\max}\limits_{\phi} \mathcal L(\theta) \\ & = \mathop{\arg\max}\limits_{\phi} \sum_{i=1}^N \log \left[\phi^{y^{(i)}}(1- \phi)^{1 - y^{(i)}}\right] \end{aligned}
ϕ^=ϕargmaxL(θ)=ϕargmaxi=1∑Nlog[ϕy(i)(1−ϕ)1−y(i)]
将该式展开:
ϕ
^
=
arg
max
ϕ
∑
i
=
1
N
[
log
ϕ
y
(
i
)
+
log
(
1
−
ϕ
)
1
−
y
(
i
)
]
=
arg
max
ϕ
∑
i
=
1
N
[
y
(
i
)
log
ϕ
+
(
1
−
y
(
i
)
)
log
(
1
−
ϕ
)
]
\begin{aligned}\hat \phi & = \mathop{\arg\max}\limits_{\phi} \sum_{i=1}^N\left[\log \phi^{y^{(i)}} + \log (1 - \phi)^{1 - y^{(i)}}\right] \\ & = \mathop{\arg\max}\limits_{\phi} \sum_{i=1}^N\left[y^{(i)} \log \phi + (1 - y^{(i)})\log(1 - \phi)\right] \end{aligned}
ϕ^=ϕargmaxi=1∑N[logϕy(i)+log(1−ϕ)1−y(i)]=ϕargmaxi=1∑N[y(i)logϕ+(1−y(i))log(1−ϕ)]
由于只有
ϕ
\phi
ϕ一个参数,因此令
L
(
ϕ
)
=
∑
i
=
1
N
[
y
(
i
)
log
ϕ
+
(
1
−
y
(
i
)
)
log
(
1
−
ϕ
)
]
\mathcal L(\phi) = \sum_{i=1}^N\left[y^{(i)} \log \phi + (1 - y^{(i)})\log(1 - \phi)\right]
L(ϕ)=∑i=1N[y(i)logϕ+(1−y(i))log(1−ϕ)],并对
ϕ
\phi
ϕ求导:
由于分母不含
i
i
i,因此将连加号提到分母上。
∂
L
(
ϕ
)
∂
ϕ
=
∑
i
=
1
N
y
(
i
)
(
1
−
ϕ
)
−
ϕ
(
1
−
y
(
i
)
)
ϕ
(
1
−
ϕ
)
=
∑
i
=
1
N
y
(
i
)
(
1
−
ϕ
)
−
ϕ
(
1
−
y
(
i
)
)
ϕ
(
1
−
ϕ
)
\begin{aligned}\frac{\partial \mathcal L(\phi)}{\partial \phi} & = \sum_{i=1}^N \frac{y^{(i)}(1 - \phi) - \phi(1 - y^{(i)})}{\phi(1 - \phi)} \\ & = \frac{\sum_{i=1}^Ny^{(i)}(1 - \phi) - \phi(1 - y^{(i)})}{\phi(1 - \phi)} \end{aligned}
∂ϕ∂L(ϕ)=i=1∑Nϕ(1−ϕ)y(i)(1−ϕ)−ϕ(1−y(i))=ϕ(1−ϕ)∑i=1Ny(i)(1−ϕ)−ϕ(1−y(i))
令
∂
L
(
ϕ
)
∂
ϕ
≜
0
\frac{\partial \mathcal L(\phi)}{\partial \phi} \triangleq 0
∂ϕ∂L(ϕ)≜0,则有分子为0:
∑
i
=
1
N
[
y
(
i
)
(
1
−
ϕ
)
−
ϕ
(
1
−
y
(
i
)
)
]
=
0
ϕ
^
=
1
N
∑
i
=
1
N
y
(
i
)
\sum_{i=1}^N \left[y^{(i)}(1 - \phi) - \phi(1 - y^{(i)})\right] = 0 \\ \hat \phi = \frac{1}{N} \sum_{i=1}^N y^{(i)}
i=1∑N[y(i)(1−ϕ)−ϕ(1−y(i))]=0ϕ^=N1i=1∑Ny(i)
由于
y
(
i
)
∈
{
0
,
1
}
y^{(i)} \in \{0,1\}
y(i)∈{0,1},因此
ϕ
^
\hat \phi
ϕ^可以理解为 标签为1的样本数量占整个样本数量的比率。令
N
1
=
∑
i
=
1
N
y
(
i
)
N_1 = \sum_{i=1}^Ny^{(i)}
N1=∑i=1Ny(i),则有:
ϕ
^
=
N
1
N
\hat \phi = \frac{N_1}{N}
ϕ^=NN1
求解最优似然分布的期望参数 μ \mu μ
最优解 μ 1 ^ \hat {\mu_1} μ1^的求解过程
由于不同似然对应的概率分布期望参数
μ
\mu
μ不同,因此这里以
μ
1
\mu_1
μ1为例,求解 最优参数
μ
^
1
\hat \mu_1
μ^1。
L
(
θ
)
\mathcal L(\theta)
L(θ)展开的三项结果中,只有第一项包含
μ
1
\mu_1
μ1,因此则有:
μ
1
^
=
arg
max
μ
1
L
(
θ
)
=
arg
max
μ
1
∑
i
=
1
N
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
]
=
arg
max
μ
1
∑
i
=
1
N
y
(
i
)
log
[
N
(
μ
1
,
Σ
)
]
\begin{aligned}\hat {\mu_1} & = \mathop{\arg\max}\limits_{\mu_1} \mathcal L(\theta) \\ & = \mathop{\arg\max}\limits_{\mu_1} \sum_{i=1}^N \log \left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\right] \\ & = \mathop{\arg\max}\limits_{\mu_1} \sum_{i=1}^N y^{(i)} \log \left[\mathcal N(\mu_1,\Sigma)\right] \end{aligned}
μ1^=μ1argmaxL(θ)=μ1argmaxi=1∑Nlog[N(μ1,Σ)y(i)]=μ1argmaxi=1∑Ny(i)log[N(μ1,Σ)]
由于
N
(
μ
1
,
Σ
)
\mathcal N(\mu_1,\Sigma)
N(μ1,Σ)是一个
p
p
p维高斯分布,因此
N
(
μ
1
,
Σ
)
\mathcal N(\mu_1,\Sigma)
N(μ1,Σ)的概率密度函数表示如下:
N
(
μ
1
,
Σ
)
=
1
(
2
π
)
p
2
∣
Σ
∣
1
2
e
−
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
\mathcal N(\mu_1,\Sigma) = \frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{2}(x^{(i)} - \mu_1)^{T}\Sigma^{-1}(x^{(i)} - \mu_1)}
N(μ1,Σ)=(2π)2p∣Σ∣211e−21(x(i)−μ1)TΣ−1(x(i)−μ1)
其中
∣
Σ
∣
|\Sigma|
∣Σ∣表示协方差矩阵
Σ
\Sigma
Σ的行列式。将概率密度函数带入上式,得到如下结果:
μ
1
^
=
arg
max
μ
1
∑
i
=
1
N
y
(
i
)
log
[
1
(
2
π
)
p
2
∣
Σ
∣
1
2
e
−
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
]
=
arg
max
μ
1
∑
i
=
1
N
{
y
(
i
)
log
[
1
(
2
π
)
p
2
∣
Σ
∣
1
2
]
+
y
(
i
)
log
[
e
−
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
]
}
=
arg
max
μ
1
∑
i
=
1
N
{
y
(
i
)
log
[
1
(
2
π
)
p
2
∣
Σ
∣
1
2
]
+
y
(
i
)
[
−
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
]
}
\begin{aligned}\hat {\mu_1} & = \mathop{\arg\max}\limits_{\mu_1} \sum_{i=1}^N y^{(i)} \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{2}(x^{(i)} - \mu_1)^{T}\Sigma^{-1}(x^{(i)} - \mu_1)}\right] \\ & = \mathop{\arg\max}\limits_{\mu_1} \sum_{i=1}^N \left\{ y^{(i)} \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}\right] + y^{(i)} \log \left[e^{-\frac{1}{2}(x^{(i)} - \mu_1)^{T}\Sigma^{-1}(x^{(i)} - \mu_1)}\right] \right\} \\ & = \mathop{\arg\max}\limits_{\mu_1} \sum_{i=1}^N \left\{ y^{(i)} \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}\right] + y^{(i)} \left[-\frac{1}{2}(x^{(i)} - \mu_1)^{T}\Sigma^{-1}(x^{(i)} - \mu_1)\right]\right\} \end{aligned}
μ1^=μ1argmaxi=1∑Ny(i)log[(2π)2p∣Σ∣211e−21(x(i)−μ1)TΣ−1(x(i)−μ1)]=μ1argmaxi=1∑N{y(i)log[(2π)2p∣Σ∣211]+y(i)log[e−21(x(i)−μ1)TΣ−1(x(i)−μ1)]}=μ1argmaxi=1∑N{y(i)log[(2π)2p∣Σ∣211]+y(i)[−21(x(i)−μ1)TΣ−1(x(i)−μ1)]}
由于这里求解的是
μ
1
^
\hat {\mu_1}
μ1^,因此
y
(
i
)
log
[
1
(
2
π
)
p
2
∣
Σ
∣
1
2
]
y^{(i)} \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}\right]
y(i)log[(2π)2p∣Σ∣211]可视为常数。令
L
(
μ
1
)
=
∑
i
=
1
N
y
(
i
)
[
−
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
]
\mathcal L(\mu_1) = \sum_{i=1}^N y^{(i)} \left[-\frac{1}{2}(x^{(i)} - \mu_1)^{T}\Sigma^{-1}(x^{(i)} - \mu_1)\right]
L(μ1)=∑i=1Ny(i)[−21(x(i)−μ1)TΣ−1(x(i)−μ1)],对
L
(
μ
1
)
\mathcal L(\mu_1)
L(μ1)展开结果如下:
L
(
μ
1
)
=
−
1
2
∑
i
=
1
N
(
x
(
i
)
T
Σ
−
1
−
μ
1
T
Σ
−
1
)
(
x
(
i
)
−
μ
1
)
=
−
1
2
∑
i
=
1
N
y
(
i
)
(
x
(
i
)
T
Σ
−
1
x
(
i
)
−
μ
1
T
Σ
−
1
x
(
i
)
−
x
(
i
)
Σ
−
1
μ
1
T
+
μ
1
T
Σ
−
1
μ
1
)
\begin{aligned} \mathcal L(\mu_1) & = -\frac{1}{2} \sum_{i=1}^N({x^{(i)}}^{T}\Sigma^{-1} - \mu_1^{T}\Sigma^{-1})(x^{(i)} - \mu_1) \\ & = -\frac{1}{2} \sum_{i=1}^N y^{(i)}({x^{(i)}}^{T}\Sigma^{-1}x^{(i)} - \mu_1^{T}\Sigma^{-1}x^{(i)} - x^{(i)}\Sigma^{-1}\mu_1^{T} + \mu_1^{T} \Sigma^{-1}\mu_1) \end{aligned}
L(μ1)=−21i=1∑N(x(i)TΣ−1−μ1TΣ−1)(x(i)−μ1)=−21i=1∑Ny(i)(x(i)TΣ−1x(i)−μ1TΣ−1x(i)−x(i)Σ−1μ1T+μ1TΣ−1μ1)
观察
μ
1
T
Σ
−
1
x
(
i
)
\mu_1^{T}\Sigma^{-1}x^{(i)}
μ1TΣ−1x(i)和
x
(
i
)
Σ
−
1
μ
1
T
x^{(i)} \Sigma^{-1} \mu_1^{T}
x(i)Σ−1μ1T这两项,其中
x
(
i
)
x^{(i)}
x(i)和
μ
1
\mu_1
μ1均是
p
p
p维列向量,而
Σ
−
1
\Sigma^{-1}
Σ−1是
p
×
p
p \times p
p×p的方阵,所以
μ
1
T
Σ
−
1
x
(
i
)
\mu_1^{T}\Sigma^{-1}x^{(i)}
μ1TΣ−1x(i)和
x
(
i
)
Σ
−
1
μ
1
T
x^{(i)} \Sigma^{-1} \mu_1^{T}
x(i)Σ−1μ1T结果均是标量,且:
将两式展开后均是一个线性计算,根据乘法交换律,自然是相等的。
μ
1
T
Σ
−
1
x
(
i
)
=
x
(
i
)
Σ
−
1
μ
1
T
∈
R
\mu_1^{T}\Sigma^{-1}x^{(i)} = x^{(i)} \Sigma^{-1} \mu_1^{T} \in \mathbb R
μ1TΣ−1x(i)=x(i)Σ−1μ1T∈R
因此,将上述结果进行合并:
L
(
μ
1
)
=
−
1
2
∑
i
=
1
N
y
(
i
)
(
x
(
i
)
T
Σ
−
1
x
(
i
)
−
2
μ
1
T
Σ
−
1
x
(
i
)
+
μ
1
T
Σ
−
1
μ
1
)
\mathcal L(\mu_1) = -\frac{1}{2} \sum_{i=1}^N y^{(i)}({x^{(i)}}^{T}\Sigma^{-1}x^{(i)} - 2\mu_1^{T}\Sigma^{-1}x^{(i)} + \mu_1^{T} \Sigma^{-1}\mu_1)
L(μ1)=−21i=1∑Ny(i)(x(i)TΣ−1x(i)−2μ1TΣ−1x(i)+μ1TΣ−1μ1)
对
μ
1
\mu_1
μ1求导:
需要学习‘矩阵论’的矩阵求导~
∂
(
μ
1
T
Σ
−
1
μ
1
)
∂
μ
1
=
2
Σ
−
1
μ
1
∂
L
(
μ
1
)
∂
μ
1
=
1
2
∑
i
=
1
N
y
(
i
)
(
−
2
Σ
−
1
x
(
i
)
+
2
Σ
−
1
μ
1
)
=
∑
i
=
1
N
y
(
i
)
(
−
Σ
−
1
x
(
i
)
+
Σ
−
1
μ
1
)
=
∑
i
=
1
N
y
(
i
)
Σ
−
1
(
−
x
(
i
)
+
μ
1
)
\frac{\partial(\mu_1^{T} \Sigma^{-1}\mu_1)}{\partial \mu_1} = 2\Sigma^{-1}\mu_1 \\ \begin{aligned}\frac{\partial \mathcal L(\mu_1)}{\partial \mu_1} & = \frac{1}{2} \sum_{i=1}^N y^{(i)}(-2 \Sigma^{-1}x^{(i)} + 2\Sigma^{-1}\mu_1) \\ & = \sum_{i=1}^N y^{(i)}(-\Sigma^{-1}x^{(i)} + \Sigma^{-1}\mu_1) \\ & = \sum_{i=1}^N y^{(i)}\Sigma^{-1}(-x^{(i)} + \mu_1)\end{aligned}
∂μ1∂(μ1TΣ−1μ1)=2Σ−1μ1∂μ1∂L(μ1)=21i=1∑Ny(i)(−2Σ−1x(i)+2Σ−1μ1)=i=1∑Ny(i)(−Σ−1x(i)+Σ−1μ1)=i=1∑Ny(i)Σ−1(−x(i)+μ1)
令
∂
L
(
μ
1
)
∂
μ
1
≜
0
\frac{\partial \mathcal L(\mu_1)}{\partial \mu_1} \triangleq 0
∂μ1∂L(μ1)≜0,则有:
Σ
−
1
[
∑
i
=
1
N
y
(
i
)
(
−
x
(
i
)
+
μ
1
)
]
=
0
∑
i
=
1
N
y
(
i
)
μ
1
=
∑
i
=
1
N
y
(
i
)
x
(
i
)
μ
1
^
=
∑
i
=
1
N
y
(
i
)
x
(
i
)
∑
i
=
1
N
y
(
i
)
\begin{aligned} \Sigma^{-1}\left[\sum_{i=1}^N y^{(i)}(-x^{(i)} + \mu_1)\right] = 0 \\ \sum_{i=1}^N y^{(i)}\mu_1 = \sum_{i=1}^N y^{(i)}x^{(i)} \\ \hat {\mu_1} = \frac{\sum_{i=1}^N y^{(i)}x^{(i)}}{\sum_{i=1}^N y^{(i)}} \quad \quad \\ \end{aligned}
Σ−1[i=1∑Ny(i)(−x(i)+μ1)]=0i=1∑Ny(i)μ1=i=1∑Ny(i)x(i)μ1^=∑i=1Ny(i)∑i=1Ny(i)x(i)
最优解 μ 2 ^ \hat {\mu_2} μ2^的求解过程
同理,
μ
2
\mu_2
μ2的求解过程和
μ
1
\mu_1
μ1的唯一区别是指数部分为
1
−
y
(
i
)
1 - y^{(i)}
1−y(i):
μ
2
^
=
arg
max
μ
2
∑
i
=
1
N
{
(
1
−
y
(
i
)
)
log
[
1
(
2
π
)
p
2
∣
Σ
∣
1
2
]
+
(
1
−
y
(
i
)
)
[
−
1
2
(
x
(
i
)
−
μ
2
)
T
Σ
−
1
(
x
(
i
)
−
μ
2
)
]
}
\hat {\mu_2} = \mathop{\arg\max}\limits_{\mu_2} \sum_{i=1}^N \left\{(1 - y^{(i)}) \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}\right] + (1 - y^{(i)})\left[-\frac{1}{2}(x^{(i)} - \mu_2)^{T}\Sigma^{-1}(x^{(i)} - \mu_2)\right] \right\}
μ2^=μ2argmaxi=1∑N{(1−y(i))log[(2π)2p∣Σ∣211]+(1−y(i))[−21(x(i)−μ2)TΣ−1(x(i)−μ2)]}
中间部分和
μ
1
\mu_1
μ1相同,省略;
关于
μ
2
\mu_2
μ2的最优解
μ
2
^
\hat {\mu_2}
μ2^表示如下:
和
μ
1
\mu_1
μ1求解过程相比,只是将
y
(
i
)
y^{(i)}
y(i)替换为
1
−
y
(
i
)
1 - y^{(i)}
1−y(i)
μ
2
^
=
∑
i
=
1
N
(
1
−
y
(
i
)
)
x
(
i
)
∑
i
=
1
N
(
1
−
y
(
i
)
)
\hat {\mu_2} = \frac{\sum_{i=1}^N(1 - y^{(i)})x^{(i)}}{\sum_{i=1}^N(1 - y^{(i)})}
μ2^=∑i=1N(1−y(i))∑i=1N(1−y(i))x(i)
求解最优似然分布的方差参数 Σ \Sigma Σ
场景描述
在求解
Σ
^
\hat \Sigma
Σ^过程中,需要对样本集合进行划分:
X
1
=
{
x
(
i
)
∣
y
(
i
)
=
1
}
i
=
1
,
2
,
⋯
,
N
X
2
=
{
x
(
i
)
∣
y
(
i
)
=
0
}
i
=
1
,
2
,
⋯
,
N
\mathcal X_1 = \{x^{(i)} \mid y^{(i)} = 1\}_{i=1,2,\cdots,N} \\ \mathcal X_2 = \{x^{(i)} \mid y^{(i)} = 0\}_{i=1,2,\cdots,N}
X1={x(i)∣y(i)=1}i=1,2,⋯,NX2={x(i)∣y(i)=0}i=1,2,⋯,N
记样本集合
X
1
\mathcal X_1
X1的数量为
N
1
N_1
N1,样本集合
X
2
\mathcal X_2
X2的数量为
N
2
N_2
N2,那么样本集合包含如下性质:
N
1
+
N
2
=
N
X
1
∪
X
2
=
X
N_1 + N_2 = N \\ \mathcal X_1 \cup \mathcal X_2 = \mathcal X
N1+N2=NX1∪X2=X
样本均值
μ
X
\mu_{\mathcal X}
μX,各样本集合均值
μ
X
i
\mu_{\mathcal X_{i}}
μXi、方差
S
X
i
\mathcal S_{\mathcal X_i}
SXi表示如下:
μ
X
=
1
N
∑
i
=
1
N
x
(
i
)
μ
X
i
=
1
N
i
∑
x
(
j
)
∈
X
i
x
(
j
)
(
i
=
1
,
2
)
S
X
i
=
1
N
i
∑
x
(
j
)
∈
X
i
(
x
(
j
)
−
μ
X
i
)
(
x
(
j
)
−
μ
X
i
)
T
(
i
=
1
,
2
)
\begin{aligned} \mu_{\mathcal X} & = \frac{1}{N} \sum_{i=1}^N x^{(i)}\\ \mu_{\mathcal X_i} & = \frac{1}{N_i} \sum_{x^{(j)} \in \mathcal X_i} x^{(j)} \quad (i=1,2) \\ \mathcal S_{\mathcal X_i} & = \frac{1}{N_i} \sum_{x^{(j)} \in \mathcal X_i}(x^{(j)} - \mu_{\mathcal X_i})(x^{(j)} - \mu_{\mathcal X_i})^{T} \quad (i=1,2) \end{aligned}
μXμXiSXi=N1i=1∑Nx(i)=Ni1x(j)∈Xi∑x(j)(i=1,2)=Ni1x(j)∈Xi∑(x(j)−μXi)(x(j)−μXi)T(i=1,2)
基于上述场景,期望最优解
μ
1
^
,
μ
2
^
\hat {\mu_1},\hat {\mu_2}
μ1^,μ2^可以进一步化简:
μ
1
^
=
∑
i
=
1
N
y
(
i
)
x
(
i
)
∑
i
=
1
N
y
(
i
)
=
∑
x
(
j
)
∈
X
1
x
(
j
)
N
1
=
N
1
N
1
μ
X
1
=
μ
X
1
μ
2
^
=
∑
i
=
1
N
(
1
−
y
(
i
)
)
x
(
i
)
∑
i
=
1
N
(
1
−
y
(
i
)
)
=
N
⋅
μ
X
−
N
1
⋅
μ
X
1
N
−
N
1
=
N
⋅
μ
X
−
N
1
⋅
μ
X
1
N
2
\begin{aligned} \hat {\mu_1} & = \frac{\sum_{i=1}^N y^{(i)}x^{(i)}}{\sum_{i=1}^N y^{(i)}} = \frac{\sum_{x^{(j)} \in \mathcal X_1} x^{(j)}}{N_1} = \frac{N_1}{N_1} \mu_{\mathcal X_1} = \mu_{\mathcal X_1}\\ \hat {\mu_2} & = \frac{\sum_{i=1}^N(1 - y^{(i)})x^{(i)}}{\sum_{i=1}^N(1 - y^{(i)})} = \frac{N \cdot \mu_{\mathcal X} - N_1 \cdot \mu_{\mathcal X_1}}{N - N_1} = \frac{N \cdot \mu_{\mathcal X} - N_1 \cdot \mu_{\mathcal X_1}}{N_2} \end{aligned}
μ1^μ2^=∑i=1Ny(i)∑i=1Ny(i)x(i)=N1∑x(j)∈X1x(j)=N1N1μX1=μX1=∑i=1N(1−y(i))∑i=1N(1−y(i))x(i)=N−N1N⋅μX−N1⋅μX1=N2N⋅μX−N1⋅μX1
求解过程
继续观察
L
(
θ
)
\mathcal L(\theta)
L(θ)的展开式,只有第一项与第二项包含
Σ
\Sigma
Σ。定义
L
(
Σ
)
\mathcal L(\Sigma)
L(Σ):
L
(
Σ
)
=
∑
i
=
1
N
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
]
+
∑
i
=
1
N
log
[
N
(
μ
2
,
Σ
)
1
−
y
(
i
)
]
\mathcal L(\Sigma) = \sum_{i=1}^N \log \left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\right] + \sum_{i=1}^N \log \left[\mathcal N(\mu_2,\Sigma)^{1 - y^{(i)}}\right]
L(Σ)=i=1∑Nlog[N(μ1,Σ)y(i)]+i=1∑Nlog[N(μ2,Σ)1−y(i)]
观察其中任意一项,如:
∑
i
=
1
N
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
]
\sum_{i=1}^N \log \left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\right]
∑i=1Nlog[N(μ1,Σ)y(i)],如果
y
(
i
)
=
0
y^{(i)}=0
y(i)=0,意味着
log
[
N
(
μ
1
,
Σ
)
y
(
i
)
]
=
0
\log \left[\mathcal N(\mu_1,\Sigma)^{y^{(i)}}\right] = 0
log[N(μ1,Σ)y(i)]=0。可以看出,上述两项中均包含很多零项。结合场景描述,可以将上述公式化简为如下形式:
将所有的‘零项’全部剔除了。
L
(
Σ
)
=
∑
x
(
j
)
∈
X
1
log
N
(
μ
1
,
Σ
)
+
∑
x
(
j
)
∈
X
2
log
N
(
μ
2
,
Σ
)
\mathcal L(\Sigma) = \sum_{x^{(j)} \in \mathcal X_1} \log \mathcal N(\mu_1,\Sigma) + \sum_{x^{(j)} \in \mathcal X_2} \log \mathcal N(\mu_2, \Sigma)
L(Σ)=x(j)∈X1∑logN(μ1,Σ)+x(j)∈X2∑logN(μ2,Σ)
观察其中任意一项:以
∑
x
(
j
)
∈
X
1
log
N
(
μ
1
,
Σ
)
\sum_{x^{(j)} \in \mathcal X_1} \log \mathcal N(\mu_1,\Sigma)
∑x(j)∈X1logN(μ1,Σ)为例,将概率密度函数带入,将其展开:
∑
x
(
j
)
∈
X
1
log
N
(
μ
1
,
Σ
)
=
∑
x
(
j
)
∈
X
1
log
{
1
(
2
π
)
p
2
∣
Σ
∣
1
2
e
−
1
2
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
}
=
∑
x
(
j
)
∈
X
1
{
log
[
1
(
2
π
)
p
2
]
+
log
[
∣
Σ
∣
−
1
2
]
+
[
−
1
2
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
]
}
=
∑
x
(
j
)
∈
X
1
log
[
1
(
2
π
)
p
2
]
+
∑
x
(
j
)
∈
X
1
log
[
∣
Σ
∣
−
1
2
]
+
∑
x
(
j
)
∈
X
1
[
−
1
2
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
]
\begin{aligned} \sum_{x^{(j)} \in \mathcal X_1} \log \mathcal N(\mu_1,\Sigma) & = \sum_{x^{(j)} \in \mathcal X_1} \log \left\{\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{2}(x^{(j)}-\mu_1)^{T}\Sigma^{-1}(x^{(j)} - \mu_1)}\right\} \\ & = \sum_{x^{(j)} \in \mathcal X_1} \left\{\log \left[\frac{1}{(2\pi)^{\frac{p}{2}}}\right] + \log \left[|\Sigma|^{-\frac{1}{2}}\right] + \left[-\frac{1}{2} (x^{(j)} - \mu_1)^{T} \Sigma^{-1}(x^{(j)} - \mu_1)\right]\right\} \\ & = \sum_{x^{(j)} \in \mathcal X_1} \log\left[\frac{1}{(2\pi)^{\frac{p}{2}}}\right] + \sum_{x^{(j)} \in \mathcal X_1} \log \left[|\Sigma|^{-\frac{1}{2}}\right] + \sum_{x^{(j)} \in \mathcal X_1} \left[-\frac{1}{2} (x^{(j)} - \mu_1)^{T} \Sigma^{-1}(x^{(j)} - \mu_1)\right] \end{aligned}
x(j)∈X1∑logN(μ1,Σ)=x(j)∈X1∑log{(2π)2p∣Σ∣211e−21(x(j)−μ1)TΣ−1(x(j)−μ1)}=x(j)∈X1∑{log[(2π)2p1]+log[∣Σ∣−21]+[−21(x(j)−μ1)TΣ−1(x(j)−μ1)]}=x(j)∈X1∑log[(2π)2p1]+x(j)∈X1∑log[∣Σ∣−21]+x(j)∈X1∑[−21(x(j)−μ1)TΣ−1(x(j)−μ1)]
观察大括号中的三项,第一项不含
Σ
\Sigma
Σ,视为常数;仔细观察第三项:
−
1
2
∑
x
(
j
)
∈
X
1
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
-\frac{1}{2} \sum_{x^{(j)} \in\mathcal X_1}(x^{(j)} - \mu_1)^{T} \Sigma^{-1}(x^{(j)} - \mu_1)
−21x(j)∈X1∑(x(j)−μ1)TΣ−1(x(j)−μ1)
已知
x
(
j
)
,
μ
1
x^{(j)},\mu_1
x(j),μ1均是
p
p
p维向量,则
(
x
(
j
)
−
μ
1
)
T
(x^{(j)} - \mu_1)^{T}
(x(j)−μ1)T的维度是
1
×
p
1 \times p
1×p;
Σ
−
1
\Sigma^{-1}
Σ−1是协方差矩阵的逆矩阵,是
p
×
p
p \times p
p×p维度的方阵;
(
x
(
j
)
−
μ
1
)
(x^{(j)} - \mu_1)
(x(j)−μ1)的维度自然是
p
×
1
p \times 1
p×1。
因此,
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
(x^{(j)} - \mu_1)^{T} \Sigma^{-1}(x^{(j)} - \mu_1)
(x(j)−μ1)TΣ−1(x(j)−μ1)本身就是一个 实数。而实数本身也是一个方阵(
1
×
1
1 \times 1
1×1的方阵)。在这里引入 线性代数中的秩,记作
t
r
tr
tr——实数的秩即实数本身。
可以将第三项表示如下:
−
1
2
∑
x
(
j
)
∈
X
1
t
r
[
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
]
-\frac{1}{2}\sum_{x^{(j)} \in \mathcal X_1} tr\left[(x^{(j)} - \mu_1)^{T}\Sigma^{-1}(x^{(j)} - \mu_1)\right]
−21x(j)∈X1∑tr[(x(j)−μ1)TΣ−1(x(j)−μ1)]
根据矩阵的秩的性质,可以将上述结果表示如下:
矩阵A,B,C能够相乘并且相乘结果是方阵的条件下:tr(ABC) = tr(CAB) = tr(BCA)
由于
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
( x^{(j)} - \mu_1)^{T}\Sigma^{-1}(x^{(j)} - \mu_1)
(x(j)−μ1)TΣ−1(x(j)−μ1)结果是实数,因此
∑
x
(
j
)
∈
X
1
\sum_{x^{(j)} \in \mathcal X_1}
∑x(j)∈X1放到tr的里面还是外面没有任何区别。
−
1
2
∑
x
(
j
)
∈
X
1
t
r
[
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
(
x
(
j
)
−
μ
1
)
]
=
−
1
2
∑
x
(
j
)
∈
X
1
t
r
[
(
x
(
j
)
−
μ
1
)
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
]
=
−
1
2
t
r
[
∑
x
(
j
)
∈
X
1
(
x
(
j
)
−
μ
1
)
(
x
(
j
)
−
μ
1
)
T
Σ
−
1
]
\begin{aligned} -\frac{1}{2} \sum_{x^{(j)} \in \mathcal X_1} tr\left[( x^{(j)} - \mu_1)^{T}\Sigma^{-1}(x^{(j)} - \mu_1) \right] & = -\frac{1}{2} \sum_{x^{(j)} \in \mathcal X_1} tr\left[(x^{(j)} - \mu_1)( x^{(j)} - \mu_1)^{T}\Sigma^{-1}\right] \\ & = -\frac{1}{2} tr\left[\sum_{x^{(j)} \in \mathcal X_1} (x^{(j)} - \mu_1)( x^{(j)} - \mu_1)^{T}\Sigma^{-1}\right] \end{aligned}
−21x(j)∈X1∑tr[(x(j)−μ1)TΣ−1(x(j)−μ1)]=−21x(j)∈X1∑tr[(x(j)−μ1)(x(j)−μ1)TΣ−1]=−21tr⎣
⎡x(j)∈X1∑(x(j)−μ1)(x(j)−μ1)TΣ−1⎦
⎤
又因为
Σ
−
1
\Sigma^{-1}
Σ−1中不含
j
j
j,因此可以将
Σ
−
1
\Sigma^{-1}
Σ−1提出来:
−
1
2
t
r
[
(
∑
x
(
j
)
∈
X
1
(
x
(
j
)
−
μ
1
)
(
x
(
j
)
−
μ
1
)
T
)
Σ
−
1
]
-\frac{1}{2} tr\left[\left(\sum_{x^{(j)} \in \mathcal X_1} (x^{(j)} - \mu_1)( x^{(j)} - \mu_1)^{T}\right)\Sigma^{-1}\right]
−21tr⎣
⎡⎝
⎛x(j)∈X1∑(x(j)−μ1)(x(j)−μ1)T⎠
⎞Σ−1⎦
⎤
观察:
∑
x
(
j
)
∈
X
1
(
x
(
j
)
−
μ
1
)
(
x
(
j
)
−
μ
1
)
T
\sum_{x^{(j)} \in \mathcal X_1} (x^{(j)} - \mu_1)( x^{(j)} - \mu_1)^{T}
∑x(j)∈X1(x(j)−μ1)(x(j)−μ1)T 和标签为1的样本的协方差矩阵仅差
N
N
N倍。
记标签为1样本的协方差矩阵为
S
1
\mathcal S_1
S1,标签为0样本的协方差矩阵为
S
2
\mathcal S_2
S2。上述第三项可以将其表示为:
−
1
2
N
1
⋅
t
r
(
S
1
⋅
Σ
−
1
)
-\frac{1}{2} N_1 \cdot tr(\mathcal S_1 \cdot \Sigma^{-1})
−21N1⋅tr(S1⋅Σ−1)
因此,
∑
x
(
j
)
∈
X
1
log
N
(
μ
1
,
Σ
)
\sum_{x^{(j)} \in \mathcal X_1} \log \mathcal N(\mu_1,\Sigma)
∑x(j)∈X1logN(μ1,Σ)可以表示为:
−
1
2
N
1
⋅
log
∣
Σ
∣
−
1
2
N
1
⋅
t
r
(
S
1
⋅
Σ
−
1
)
+
C
1
(
C
1
=
∑
x
(
j
)
∈
X
1
log
[
1
(
2
π
)
p
2
]
)
-\frac{1}{2} N_1 \cdot \log |\Sigma| - \frac{1}{2} N_1 \cdot tr\left(\mathcal S_1 \cdot \Sigma^{-1}\right) + \mathcal C_1 \quad \left(\mathcal C_1 = \sum_{x^{(j)} \in \mathcal X_1} \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}}\right]\right)
−21N1⋅log∣Σ∣−21N1⋅tr(S1⋅Σ−1)+C1⎝
⎛C1=x(j)∈X1∑log[(2π)2p1]⎠
⎞
同理,
∑
x
(
j
)
∈
X
2
log
N
(
μ
2
,
Σ
)
\sum_{x^{(j)} \in \mathcal X_2} \log \mathcal N(\mu_2,\Sigma)
∑x(j)∈X2logN(μ2,Σ)可以表示为:
−
1
2
N
2
⋅
log
∣
Σ
∣
−
1
2
N
2
⋅
t
r
(
S
2
⋅
Σ
−
1
)
+
C
2
(
C
2
=
∑
x
(
j
)
∈
X
2
log
[
1
(
2
π
)
p
2
]
)
-\frac{1}{2} N_2 \cdot \log |\Sigma| - \frac{1}{2} N_2 \cdot tr\left(\mathcal S_2 \cdot \Sigma^{-1}\right) + \mathcal C_2 \quad \left(\mathcal C_2 = \sum_{x^{(j)} \in \mathcal X_2} \log \left[\frac{1}{(2\pi)^{\frac{p}{2}}}\right]\right)
−21N2⋅log∣Σ∣−21N2⋅tr(S2⋅Σ−1)+C2⎝
⎛C2=x(j)∈X2∑log[(2π)2p1]⎠
⎞
至此,
L
(
Σ
)
\mathcal L(\Sigma)
L(Σ)可以表示如下:
L
(
Σ
)
=
∑
x
(
j
)
∈
X
1
log
N
(
μ
1
,
Σ
)
+
∑
x
(
j
)
∈
X
2
log
N
(
μ
2
,
Σ
)
=
−
1
2
(
N
1
+
N
2
)
log
∣
Σ
∣
−
1
2
N
1
⋅
t
r
(
S
1
⋅
Σ
−
1
)
−
1
2
N
2
⋅
t
r
(
S
2
⋅
Σ
−
1
)
+
(
C
1
+
C
2
)
=
−
1
2
[
N
log
∣
Σ
∣
+
N
1
⋅
t
r
(
S
1
⋅
Σ
−
1
)
+
N
2
⋅
t
r
(
S
2
⋅
Σ
−
1
)
]
+
C
(
C
=
C
1
+
C
2
)
\begin{aligned} \mathcal L(\Sigma) & = \sum_{x^{(j)} \in \mathcal X_1} \log \mathcal N(\mu_1,\Sigma) + \sum_{x^{(j)} \in \mathcal X_2} \log \mathcal N(\mu_2,\Sigma) \\ & = -\frac{1}{2}(N_1 + N_2) \log |\Sigma| - \frac{1}{2}N_1 \cdot tr(\mathcal S_1 \cdot \Sigma^{-1}) - \frac{1}{2}N_2 \cdot tr(\mathcal S_2 \cdot \Sigma^{-1}) + (\mathcal C_1 + \mathcal C_2) \\ & = -\frac{1}{2} \left[N \log |\Sigma| + N_1 \cdot tr(\mathcal S_1 \cdot \Sigma^{-1}) + N_2 \cdot tr(\mathcal S_2 \cdot \Sigma^{-1})\right] + \mathcal C \quad (\mathcal C = \mathcal C_1 + \mathcal C_2) \end{aligned}
L(Σ)=x(j)∈X1∑logN(μ1,Σ)+x(j)∈X2∑logN(μ2,Σ)=−21(N1+N2)log∣Σ∣−21N1⋅tr(S1⋅Σ−1)−21N2⋅tr(S2⋅Σ−1)+(C1+C2)=−21[Nlog∣Σ∣+N1⋅tr(S1⋅Σ−1)+N2⋅tr(S2⋅Σ−1)]+C(C=C1+C2)
基于上式,对
Σ
\Sigma
Σ进行求导:
求导过程中,需要注意‘行列式的导数’与‘秩的导数’:
∂
t
r
(
A
B
)
∂
A
=
B
T
∂
∣
A
∣
∂
A
=
∣
A
∣
⋅
A
−
1
\frac{\partial tr(AB)}{\partial A} = B^{T} \\ \frac{\partial |A|}{\partial A} = |A|\cdot A^{-1}
∂A∂tr(AB)=BT∂A∂∣A∣=∣A∣⋅A−1
求导结果如下:
∂
L
(
Σ
)
∂
Σ
=
−
1
2
(
N
⋅
∣
Σ
∣
⋅
Σ
−
1
∣
Σ
∣
+
N
1
⋅
S
1
T
⋅
(
−
1
)
Σ
−
2
+
N
2
⋅
S
2
T
⋅
(
−
1
)
Σ
−
2
)
=
−
1
2
[
N
⋅
Σ
−
1
−
N
1
⋅
S
1
T
⋅
Σ
−
2
−
N
2
⋅
S
2
T
⋅
Σ
−
2
]
\begin{aligned} \frac{\partial \mathcal L(\Sigma)}{\partial \Sigma} & = -\frac{1}{2}(N \cdot \frac{|\Sigma| \cdot\Sigma^{-1}}{|\Sigma|} + N_1 \cdot \mathcal S_1^{T}\cdot (-1) \Sigma^{-2} + N_2 \cdot \mathcal S_2^{T}\cdot (-1) \Sigma^{-2}) \\ & = -\frac{1}{2}[N\cdot \Sigma^{-1} - N_1 \cdot S_1^{T} \cdot \Sigma^{-2} - N_2 \cdot S_2^{T} \cdot \Sigma^{-2}] \end{aligned}
∂Σ∂L(Σ)=−21(N⋅∣Σ∣∣Σ∣⋅Σ−1+N1⋅S1T⋅(−1)Σ−2+N2⋅S2T⋅(−1)Σ−2)=−21[N⋅Σ−1−N1⋅S1T⋅Σ−2−N2⋅S2T⋅Σ−2]
由于
S
1
,
S
2
\mathcal S_1,\mathcal S_2
S1,S2均为协方差矩阵,因此它们是实对称矩阵。即:
S
1
T
=
S
1
,
S
2
T
=
S
2
\mathcal S_1^{T} = \mathcal S_1,\mathcal S_2^{T} = \mathcal S_2
S1T=S1,S2T=S2
∂
L
(
Σ
)
∂
Σ
\frac{\partial \mathcal L(\Sigma)}{\partial \Sigma}
∂Σ∂L(Σ)最终表示为:
∂
L
(
Σ
)
∂
Σ
=
−
1
2
[
N
⋅
Σ
−
1
−
N
1
⋅
S
1
⋅
Σ
−
2
−
N
2
⋅
S
2
⋅
Σ
−
2
]
\frac{\partial \mathcal L(\Sigma)}{\partial \Sigma} = -\frac{1}{2}[N\cdot \Sigma^{-1} - N_1 \cdot S_1 \cdot \Sigma^{-2} - N_2 \cdot S_2 \cdot \Sigma^{-2}]
∂Σ∂L(Σ)=−21[N⋅Σ−1−N1⋅S1⋅Σ−2−N2⋅S2⋅Σ−2]
令
∂
L
(
Σ
)
∂
Σ
≜
0
\frac{\partial \mathcal L(\Sigma)}{\partial \Sigma} \triangleq 0
∂Σ∂L(Σ)≜0,则有:
N
⋅
Σ
−
1
−
N
1
⋅
S
1
⋅
Σ
−
2
−
N
2
⋅
S
2
⋅
Σ
−
2
=
0
N\cdot \Sigma^{-1} - N_1 \cdot S_1 \cdot \Sigma^{-2} - N_2 \cdot S_2 \cdot \Sigma^{-2} = 0
N⋅Σ−1−N1⋅S1⋅Σ−2−N2⋅S2⋅Σ−2=0
等式两边同乘
Σ
2
\Sigma^2
Σ2,可得:
N
Σ
−
N
1
S
1
−
N
2
S
2
=
0
Σ
^
=
N
1
S
1
+
N
2
S
2
N
N \Sigma - N_1 \mathcal S_1 - N_2 \mathcal S_2 = 0 \\ \hat \Sigma = \frac{N_1\mathcal S_1 + N_2 \mathcal S_2}{N}
NΣ−N1S1−N2S2=0Σ^=NN1S1+N2S2
思考
这里在定义似然的概率分布时就定义 Σ \Sigma Σ是公用的,在 Σ ^ \hat \Sigma Σ^的求解结果中发现,从理论角度观察, S 1 , S 2 \mathcal S_1,\mathcal S_2 S1,S2本质上应该是相同的。如果将 S 1 = S 2 \mathcal S_1 = \mathcal S_2 S1=S2代入上式会发现 就是一个恒等式。但之所以有差异,自然是 高斯分布产生样本的随机性导致的。
下一节将介绍另一种概率生成模型——朴素贝叶斯。
相关参考:
机器学习-线性分类8-高斯判别分析(Gaussian Discriminant Analysis)-模型求解(求协方差)
机器学习-线性分类7-高斯判别分析(Gaussian Discriminant Analysis)-模型求解(求期望)