word2vec 总结
wordmeaning
用dot product计算similartiy
distributional similarity 和 distributed 概念区别
distributional similarity 是可以用这个词所搭配的前后文词语来了解这个词的含义
对比的是denotational的对词下定义的方式,也就是字典上解释这个词所用的方式,而上面所说的更像是我们用例句的方式来学习这个词
distributed是与one-hot vector相反的对word的表示方式,是dense vector
main idea of word2vec
Predict between every word and its context word
two algorithm:
skip-gram
Predict context word given target (position independent)
continuous bags of words
predict target word from bag-of-words context
two (moderately efficient) training methods
- Hierarchical softmax
- negative sampling
Naive softmax
Todo: But the trick is since we have a one hot target which is just predicted the word actually occurred. under this criteria the only thing that left in the cross entropy loss is the negative probablity of the true class. the next class course
dot product 是similarity 的一种计算方式
softmax是一种将$R^V$空间的东西映射到概率分布空间的一种方法
1. exponentiate to make positive
2. nomalize to give probablility
Softmax 起名字的原因,类似maximize 更大的东西
Q&A: Actually, we have two words representation for the same word, one is when he is the central word, the other is when he is the context word. two vectors for each word
Advantages:
1. make the math easier, because two representation are separated when you do the optimization rather than tied to each other
2. in practice emprically works a little bit better
思考:
在这个window里其实不考虑position的影响
intermissiong introduction:
sentence embedding 可以用来做情感分析 希望用一个向量表示一个词汇
常用的方法是:
- use bag-of-words
v(‘natural language processing’) = 1/3(v(‘natual’) + v(‘language’) +v(‘processing’))
还会使用的方法recurrent nerual network, recursive nerual network and covolutional neural network
pricenton paper work: simple unsurpervised network
weighted bag-of-words and remove some special direction
Step 1: 类似average的想法但是down weight to the frequent word
p(w)是指这个词的频率,a是常数
$v_s = \frac{1}{|s|}\sum_{w\in s} \frac{a}{a+p(w)}v_w$
Step 2: computer the first principal componet u of {$v_s$}
for all the sentence s in S do
$v_s = v_s - uu^Tv_s$
Interpretation: given the sentence representation, emiting a single word 的概率是跟这个 word的频率以及与这个词与整个句子表达相近程度来决定的。
$\frac{dx^T}{dx} = I, \frac{dx^TA}{dx}=A, \frac{dAx}{dx}=A^T$
Cosine measure 需要除以两个vector的模,但是在这里面不需要除以因为对于一个vector的模特别大,相应的所有词都会乘上他使得大家都是平等的
stochastic gradient descent nerual network love noise