Topic Identification

topic identification

常规流程

speech is tokenized into words or phones by asr systems
standard text-based processing techniques are appplied to the resulting tokenlizations
Produce a vector representation for each spoken doument, typically a bag-of-wordds multinomial representation or a more compact vector give by probablilistic topic models
topic id is performed on the spoken document representations by supervised training of classifiers such as bayesian classifiers and svm

解决的问题：a difficult and realistic scenario where the speech corppus of a test language is annotated only with a minimal number of topic labels, i.e no manual transcriptions or dictionaries for building an asr system are available

Previous work: the cross-lingual phoneme recoginzers can prodeuc reasonable speech tokenizations

缺点：the performacne is highlyu dependent on the language and environmental condition(channel and noise) mismatch between the training and the test data

Paper: topic identification for speech without asr

Idea: focus on unsupervised approaches that operate directly on the speech of interest

Raw acoustic feature based unsupervised term discovery is one such approach that aims to idnetify and cluster repeating word-like units across speech based around segmental dynamic time warping.

UTD in 【NLP on spoken document without ASR】are limited since the acoustic features on which UTD is performed are produced by acoustic models trained from the transcribed speech of its evaluation corpus.

本文改进：
1. UTD operates on language-independent speech representations extracted from multilingual bottleneck networks trained on languages other than the test language
Another alternative to producing speech tokenizations without language dependecy is the model-based approach i.e. unsupervised learning of hidden markov model based phoneme0like units from untranscribed speech.

本文提出：

the variational bayesian inference变分贝叶斯推断 based acoustic unit discovery(AUD) framework in 【variational inference for acoustic unit descovery】that allows parallelized large-scale training
在进行了speech 的tokenlization之后，所有的work are limited to using bag-of-words features as spoken document representations.

UTD only identifies relatively long repeated terms(0.5 -1 sec)

AUD/ASR enables full-coverage segmentation of continuous speech into a sequence of units/words and such a resulting temporal sequence 时间层序 enables another feature learning architecture based on CNN.

本文提出

Instead of treating the sequential tokens as a bag of acoustic units or words, the whole token sequence is encoded as concatenated continuous vectors, and followed by convolution and temporal pooling operations that capture the local and global dependecies.

such continuous space feature extraction frameworks have been used in various language processing tasks like spoken language processing tasks like spoken language understanding and text classification

本文三个问题值得探究：
1. 是否such a cnn-based framework can perform as well on noisy automatically discovered phoneme-like units as on otrhographic正射 words/characters
2. pre-trained vectors of phoneme-like units from word2vec 是否provide superior perfomance to random initiallization as evidenced by the word-based tasks
3. CNNs 是否 are still competitive in low-resource settings of hundreds to two-thousand training exemplars, rather than large/medium sized datasets

UTD unsurpervised term discovery

Aim to automatically identify and cluster repeated terms from speech .

AUD based on variational inference, rather than the maximum likelihood training which may oversimplify the parameter estimations nor the Gibbs Sampling traing which is not amenable负责 to large scale applications

Phone-loop model is formulated where each phoneme-line