[PDF][PDF] A method for tokenizing text

RM Kaplan - Inquiries into words, constraints and contexts, 2005 - stanford.edu
Inquiries into words, constraints and contexts, 2005stanford.edu
The stream of characters in a natural language text must be broken up into distinct
meaningful units (or tokens) before any language processing beyond the character level can
be performed. If languages were perfectly punctuated, this would be a trivial thing to do: a
simple program could separate the text into word and punctuation tokens simply by breaking
it up at white-space and punctuation marks. But real languages are not perfectly punctuated,
and the situation is always more complicated. Even in a well (but not perfectly) punctuated …
The stream of characters in a natural language text must be broken up into distinct meaningful units (or tokens) before any language processing beyond the character level can be performed. If languages were perfectly punctuated, this would be a trivial thing to do: a simple program could separate the text into word and punctuation tokens simply by breaking it up at white-space and punctuation marks. But real languages are not perfectly punctuated, and the situation is always more complicated. Even in a well (but not perfectly) punctuated language like English, there are cases where the correct tokenization cannot be determined simply by knowing the classification of individual characters, and even cases where several distinct tokenizations are possible. For example, the English string chap. can be taken as either an abbreviation for the word chapter or as the word chap appearing at the end of a sentence, and Jan. can be regarded either as an abbreviation for January or as a sentencefinal proper name. The period should be part of the word-token in the first cases but taken as a separate token of the string in the second. As another example, white-space is a fairly reliable indicator of an English token boundary, but there are some multi-component words in English that include whitespace as internal characters (eg to and fro, jack rabbit, General Motors, a priori).
These difficulties for English are relatively limited and text-processing applications often either ignore them (eg, simply forget about abbreviations and multi-component words—there are many more difficult problems to worry about) or treat them with special-purpose machinery. But this is a much bigger problem for other languages (eg Chinese text is very poorly
stanford.edu