word2vec

word2vec 总结

  1. wordmeaning

    1. 用dot product计算similartiy

    2. distributional similarity 和 distributed 概念区别

      1. distributional similarity 是可以用这个词所搭配的前后文词语来了解这个词的含义

        对比的是denotational的对词下定义的方式,也就是字典上解释这个词所用的方式,而上面所说的更像是我们用例句的方式来学习这个词

      2. distributed是与one-hot vector相反的对word的表示方式,是dense vector

  2. main idea of word2vec

    Predict between every word and its context word

    two algorithm:

    1. skip-gram

      Predict context word given target (position independent)

    2. continuous bags of words

      predict target word from bag-of-words context

    two (moderately efficient) training methods

    1. Hierarchical softmax
    2. negative sampling

    Naive softmax

Todo: But the trick is since we have a one hot target which is just predicted the word actually occurred. under this criteria the only thing that left in the cross entropy loss is the negative probablity of the true class. the next class course

dot product 是similarity 的一种计算方式

softmax是一种将$R^V$空间的东西映射到概率分布空间的一种方法

1. exponentiate to make positive
2. nomalize to give probablility

Softmax 起名字的原因,类似maximize 更大的东西

Q&A: Actually, we have two words representation for the same word, one is when he is the central word, the other is when he is the context word. two vectors for each word

Advantages:

1. make the math easier, because two representation are separated when you do the optimization rather than tied to each other
2. in practice emprically works a little bit better

思考:

在这个window里其实不考虑position的影响

intermissiong introduction:

sentence embedding 可以用来做情感分析 希望用一个向量表示一个词汇

常用的方法是:

  1. use bag-of-words

v(‘natural language processing’) = 1/3(v(‘natual’) + v(‘language’) +v(‘processing’))

  1. 还会使用的方法recurrent nerual network, recursive nerual network and covolutional neural network

  2. pricenton paper work: simple unsurpervised network

    weighted bag-of-words and remove some special direction

    1. Step 1: 类似average的想法但是down weight to the frequent word p(w)是指这个词的频率,a是常数

      $v_s = \frac{1}{|s|}\sum_{w\in s} \frac{a}{a+p(w)}v_w$

    2. Step 2: computer the first principal componet u of {$v_s$}

      for all the sentence s in S do

      $v_s = v_s - uu^Tv_s$

      Interpretation: given the sentence representation, emiting a single word 的概率是跟这个 word的频率以及与这个词与整个句子表达相近程度来决定的。

$\frac{dx^T}{dx} = I, \frac{dx^TA}{dx}=A, \frac{dAx}{dx}=A^T$

Cosine measure 需要除以两个vector的模,但是在这里面不需要除以因为对于一个vector的模特别大,相应的所有词都会乘上他使得大家都是平等的

stochastic gradient descent nerual network love noise