Los puntos clave no están disponibles para este artículo en este momento.
Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more difficult to handle within a data-driven formalism. This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a (continuous) semantic vector space, in which familiar clustering techniques can be applied. This leads to the specification of a powerful framework for automatic semantic classification, as well as the derivation of several language model families with various smoothing properties. Because of their large-span nature, these language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to adjust the standard n-gram probability. Such hybrid language modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. This paper concludes with a discussion of intrinsic tradeoffs, such as the influence of training data selection on the resulting performance.
Building similarity graph...
Analyzing shared references across papers
Loading...
J.R. Bellegarda (Tue,) studied this question.
www.synapsesocial.com/papers/6a0968d3b0d552aa8b45ab1b — DOI: https://doi.org/10.1109/5.880084
J.R. Bellegarda
Proceedings of the IEEE
Apple (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...