当前位置：首页 > news >正文

【Deep Learning学习笔记】Deep learning for nlp without magic_Bengio_ppt_acl2012

news 来源：原创 2024/5/11 4:57:45

看完180多页的ppt，真心不容易。记得流水账如下：

Five reason to explore Deep Learning:

1. learning representation; 2. the need for distribution representation -- curse dimensionality; 3. unsurperwised feature and weight learning; 4. multi-level representation; 5. why now （RBM，训练方法等出现）

1. the basic

1.1 from logistic regression to neural nets

看问题角度很有意思。逻辑回归本身就是一个单一神经元的神经网络（感知机）。而（三层）神经网络，就是多个逻辑回归模型放到一起，各自输出各自的，然后再加一个softmax层，变成分类器。

From Maxent Classifiers to Neural Networks

最大熵的函数形式，也可以转成sigmoid函数形式，所以最大熵也等同于只有一个神经元的神经网络。在实际应用中，最大熵也可以作为softmax层来使用。

训练神经网络：（1）Stochastic gradient descent （梯度下降）；（2）Conjugate gradient or L-BFGS

为什么神经网络需要非线性（Non-linearities）？如果都是线性的话，多层神经网络的描述能力相当于只有一个层的神经网络。

1.2 word representation

one-hot representation;

distributional representation;

class-based representation (hard class -- cluster, or soft class -- LDA);

word embedding

1.3 unsuperwised word vector learning

feed-forward computation：如何计算一个语句s（cat chills on a mat）的概率？

构建三层神经网络，输入层是每个词（和对应的实数向量），中间隐含层，输出层是单个节点变量，表示句子概率。

训练的时候，给定一个ngram窗口，来构建上述神经网络，输出ngram概率s；同时，在当前ngram的基础上，构建反例，同样用上述网络计算反例概率sc。则，目标优化函数是最小化这个数值

J = max (0, 1-S+Sc)

google 的 word2vec，用的就是这个目标函数。

为了优化这个目标函数，可以用梯度下降方法计算梯度，bp方式逐层更新网络权重。

1.4backpropagation training

介绍bp的基本原理

1.5learning word level classifiers: pos and ner

和1.3中的训练ngram的网络结构类似，只不过“replaces the single scalar score with a SoBmax/Maxent classifier”，即最上一层是softmax层，用来做分类器。

The interesting twist in deep learning is that the input featuresare also learned——同传统bp过程不同的是，word embedding中，输入向量（指word embedding）也被学习了。

word embedding也有助于在各个资源（词典）之间share信息——以词为单位，信息源融合

1.6sharing statistical strength

semi-supervised learning：指先用unsupervised learning做pretrain，然后用supervised learning做细调。pretrain能成功的一个理由是：原则上我们要得到条件概率p(c|x)，不过pretrain得到的是p(x)，后者能够很好地逼近前者。

autoencoder：multi-level NN with output = input

pca = linear manifold = linear auto-encoder

正常autoencoder相当于non-linear pca

附："manifold"这个词的含义相当于“复印”，即在某个方向上存在微小变化，但是总体来讲还和原来的物体一致。

Minimizing reconstruction errorforces latent representation of“similar inputs” to stay onmanifold。

autoencoder改进：对于离散输入，用交叉熵或者log-likelihood作为准则函数；Undercomplete、Sparsity、Denoising、Contractive等问题的解决，其中Sparsity的解决是强迫参数在0附近。

2. recursive NN

2.1 motivation

RNN可以学习句子的句法结构，但只能是二叉树的结构。

2.2 RNN for parsing

可以参考“leanring meanings for sentence”

2.3 theory: bp through structure

介绍很简略，不过基本过程与bp一致。

对于语法树中的每一个节点，节点的label计算，可以在节点的向量表示的基础上，加上softmax层，进行训练和标记。

实验表明，这种方法对短句效果比较好，对长句的效果比较差

还讲了几个应用：paraphrase detection、scene parsing（用NLP中的parsing应用在图像上面，分析图像结构）

2.4 recusive auto-encoders

类似RNN，只不过目标函数不再是一个surpervised score，而是reconstruct error

semi-supervised autoencoders，在目标函数中加入了cross entropy

2.5 applications tosentiment detection（情感倾向性检测）and paraphrase detection

sentiment detection（情感倾向性检测）：bag of words方法，采用本文自动学习向量的方法（在此基础上再构件分类器，区分是“正面”倾向还是“负面”倾向的情感）

paraphrase detection：how to compare the meanings of two sentences?

recusive auto-encoder to full sentence paraphrase detection (sochar 2011): 用2.3的方法分别计算两个句子的语法树、以及非叶子结点，同叶子节点一起，两颗语法树的节点之间计算相似度，形成相似度矩阵，在矩阵基础之上，再用NN方法，计算paraphrase的可能性。

个人疑问：句子的长度不同，形成的相似度矩阵的大小（两个维度）不同，如何将不同规模的矩阵，用同样的NN方法来计算相似度的值，ppt中没说，只能看sochar原文了。

2.6compositionality through recursive matrix-vector spaces

上文中，语法树每个中间节点都由一个vector来表示，本小节中的方法，除了vector之外，还有一个matrix。方法比较复杂，介绍比较简略。

3. applications

3.1 applications

3.1.1 nerual language model

LM: Bengio 2003

ASR:Mikolov 2011 word2vec

output bottleneck：通常，NNLM的输出是个向量，向量的维度与词表大小有关，最简单的，one-hot表示方法，或者输出向量是ngram中要预测的词语的向量，但是该向量要与词表中每个词语做相似度计算，从而确定预测出的是哪个词语。

对这个问题，Mikolov借鉴class-based language model的想法，在NNLM上也是输出为word class，然后再用p(word|class, context)来还原计算p(word|context)

SMT：也是从LM角度来做的，将从前SMT中的ngram换成NNLM

3.1.2structured embedding fo knowledge bases

Bengio aaai2011

3.1.3assorted speech and nlp applications

learn multiple word vectors：处理一词多义现象——用多个word vector来表示这个词语

......

3.2 resources (tutorials and code)

• See “Neural Net Language Models” Scholarpedia entry
•  Deep Learning tutorials: http://deeplearning.net/tutorials
•  Stanford deep learning tutorials with simple programming assignments and reading list
http://deeplearning.stanford.edu/wiki/
•  Recursive Autoencoder class project
http://cseweb.ucsd.edu/~elkan/250B/learningmeaning.pdf
•  Graduate Summer School: Deep Learning, Feature Learning
http://www.ipam.ucla.edu/programs/gss2012/
•  ICML 2012 Representation Learning tutorial http://www.iro.umontreal.ca/~bengioy/talks/deep-learning-tutorial-2012.html
•  Paper references in separate pdf

softwares

• Theano (Python CPU/GPU) mathema>cal and deep learning library http://deeplearning.net/so\ware/theano
•  Can do automatic, symbolic differen>a>on
•  Senna: POS, Chunking, NER, SRL
•  by Collobert et al. http://ronan.collobert.com/senna/
•  State-of-the-art performance on many tasks
•  3500 lines of C, extremely fast and using very liCle memory
•  Recurrent Neural Network Language Model
http://www.fit.vutbr.cz/~imikolov/rnnlm/
•  Recursive Neural Net and RAE models for paraphrase detection, sentiment analysis, relation classification www.socher.org

3.3 deep learning tricks

•  Stochastic gradient descent and seáng learning rates
•  Main hyper-parameters
•  Learning rate schedule & Early stopping
•  Minibatches
•  Parameter initialization
•  Number of hidden units
•  L1 or L2 weight decay
•  Sparsity regularization
•  Debugging à Finite difference gradient check (Yay)
•  How to efficiently search for hyper-parameter configurations

tanh(z)=2logistic(2z)−1
tanh is better than sigmoid(logistic) in deep learning

Ordinary gradient descent is a batch method, very slow, should never be used. Use 2nd order batch method such as LBFGS.

learning rate: Better results can generally be obtained by allowing learning rates to decrease, typically in O(1/t)

parameter initialization:

Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value if weights were 0
Initialize weights ~ Uniform(-r,r), r inversely proportional to fanin (previous layer size) and fan-out (next layer size)