Key points are not available for this paper at this time.
The sentence is a standard textual unit in natural language processing applications. In many languages the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special sentence boundary recognition rules for every new text collection. As an alternative, this article presents an efficient, trainable system for sentence boundary disambiguation. The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuation mark. Satz is very fast both in training and sentence analysis, and its combined robustness and accuracy surpass existing techniques. The system needs only a small lexicon and training corpus, and has been shown to transfer quickly and easily from English to other languages, as demonstrated on French and German. 1.
Building similarity graph...
Analyzing shared references across papers
Loading...
David D. Palmer
Marti A. Hearst
Computational Linguistics
Palo Alto Research Center
Mitre (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Palmer et al. (Sun,) studied this question.
www.synapsesocial.com/papers/6a07ff8e217278811afe14d2 — DOI: https://doi.org/10.5555/972695.972697