当前位置：首页 > news >正文

极限多标签学习之-PLT

news 来源：原创 2024/5/6 14:07:20

《Probabilistic label trees for extreme multi-label classification》

主要贡献：提出了PLT.

文章目录

- - 问题定义
  - PLT model
  - 一致性和遗憾界分析
  - 在线PLT

问题定义

符号系统：

Key notations	Meaning
$\mathcal{X}$	instance space
$\mathcal{L} = \{1, \dots, m\}$	Label set
$\mathcal{Y} = \{0, 1\}^m$	Label space
$\mathbf{x} \in \mathcal{X}$	an instance
$\mathbf{y} \in \mathcal{Y}$	a label corresponding to $\mathbf{x}$
$\mathcal{L}_\mathbf{x} \subseteq \mathcal{L}$	relevant(positive) labels, otherwise irrelevant(positive) labels. $y_j = 1 \Leftrightarrow j \in \mathcal{L}_\mathbf{x}$
$R(\cdot)$	The expected loss, or risk
$\mathbf{P}(\mathbf{x},\mathbf{y})$	观测 $(\mathbf{x},\mathbf{y})$ 的概率分布, 假定每个观测独立采样
$\ell(\mathbf{y},\hat{\mathbf{y}})$	Loss
$T$	The tree
$L_T$	leaf set; $l_j \in L_T$ 对应 $\in \mathcal{L}$
$V_T$	the set of all nodes
$L_v$	内节点 $v$ 的所有叶子
$\mathcal{L}_v \subseteq \mathcal{L}$	内节点 $v$ 对应的所有叶子的标签集合
$\uparrow(v), \downarrow(v)$	父节点，直接孩子节点集合
$\text{Path}(v)$	从 $v$ 到根节点的路径
$\text{len}_v$	路径长度
$\text{deg}_v$	节点 $v$ 的度

本文作者的问题定义写的很好，读起来很通畅。先前也看了一些XC的文章，都没有将问题定义描述的很好（或者压根没有问题定义）。

极限多标签分类问题可定义为（类似于多标签分类问题的定义）：寻找一个分类器 $\mathbf{h}(\mathbf{x}) = (h_1(\mathbf{x}),\dots,h_m(\mathbf{x})) \in \mathcal{H}^m:\mathcal{X}\mapsto \mathbb{R}^m$ ，使得期望损失极小：
$R_\ell(\mathbf{h}) = \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \mathbf{P}(\mathbf{x},\mathbf{y})}(\ell(\mathbf{y},\mathbf{h}(\mathbf{x})))$
一般地， $m\geq 10^5,|\mathcal{L}_\mathbf{x}| \ll m$ 。那么在损失 $\ell$ 上的最优分类器为：
$\mathbf{h}_\ell^* = \argmin_{\mathbf{h}} R_\ell(\mathbf{h})$
文中定义了一个分类器 $\mathbf{h}$ 针对损失 $\ell$ 的遗憾(regret)：
$\text{reg}_\ell(\mathbf{h}) = R_\ell(\mathbf{h}) - R_\ell(\mathbf{h}_\ell^*) = R_\ell(\mathbf{h}) - R_\ell^*$
当然它越小越好。
令 $\eta_j = P(y_j=1|\mathbf{x}), j\in \mathcal{L}$ ，希望 $L_1$ 估计误差最小，其中 $\hat{\eta}_j$ 为 $\eta_j$ 的估计。
$|\eta_j - \hat{\eta}_j|$
令 $\ell_\text{log}$ 为交叉熵损失，其在样本 $\mathbf{x}$ 上的条件风险（也就是期望损失）为：
$\mathbb{E}_\mathbf{y}\ell_{\text{log}}(\mathbf{y},\mathbf{h}(\mathbf{x})) = \sum_{j=1}^m R_\text{log}(h_j(\mathbf{x})|\mathbf{x})$
那么最优预测为
$h_j^*(\mathbf{x}) = \argmin_\mathbf{h}R_\text{log}(h_j(\mathbf{x})|\mathbf{x}) = \eta_j(\mathbf{x})$
当然，交叉熵损失函数实际上只对应一般的（文章中用了一个似乎比较地道的词：vanilla）1-vs-all方法。
而更加流行的评价指标就有 $P @ k, n D CG @ k, PSP @ k$ 等，也就是人们通常只关心top-k。