text classification
##Extract feature
Model: bag of words
In this model, a text is represneted as the bag(multiset) of its words, disregarding grammar and even word order but keeping mulitplicity.
也就是说使用词袋模型,不考虑语法,词序,却主要记录词的出现次数。
multiplicity:multiset {a, a, a, b, b, b}, a and b both have multiplicity 3.
The list representation does not preserve the order of the words in the original sentences. This is just the main feature of the Bag-of-words model.
However, term frequencies are not necessarily the best representation for the text. To address this problem, one of the most popular ways to normalize
the term frequencies is to weight a term by the inverse of document frequency, or tf-idf.
Tf-idf: term frequency-inverse document frequency
numerical statisic that is intended to reflect how important a word is to a document in a. collection or corpus.
The tf-idf increses proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust fot the fact that some words appear more frequently in general.
Conceptually, we can view baf-of-word model as a speica.l case of the n-gram model, with n=1. For n>1 the model is named w-shingling.
Hash trick
A fast and space-efficient way of vectorizing features. Turn arbitary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than lookin the dindices up in an associative array.
对于spam filtering的问题,这个过程的问题在于,随着训练集的增长,这些词典会占用大量的存储空间并且会逐渐增大。相反,如果词汇保持固定而不随着训练集的增加而增加,则对手可能试图发明不在所存储的词汇表中的新单词或拼写错误以绕过机器学习过滤器。
zipf’s law
The frequency of any word is inversely proportional to its rank in the frequency table.
Questions:
augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document: