June 1, 1997

Adaptive multilingual sentence boundary disambiguation

Key Points

Key points are not available for this paper at this time.

Abstract

The sentence is a standard textual unit in natural language processing applications. In many languages the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special sentence boundary recognition rules for every new text collection. As an alternative, this article presents an efficient, trainable system for sentence boundary disambiguation. The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuation mark. Satz is very fast both in training and sentence analysis, and its combined robustness and accuracy surpass existing techniques. The system needs only a small lexicon and training corpus, and has been shown to transfer quickly and easily from English to other languages, as demonstrated on French and German. 1.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

David D. Palmer

Marti A. Hearst

Journals

Computational Linguistics

Actions

Institutions

Palo Alto Research Center

Mitre (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Adaptive multilingual sentence boundary disambiguation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study