Copyright ©1995 by NeXT Computer, Inc.  All Rights Reserved.

IXAttributeReader



Inherits From: Object
Declared In: indexing/IXAttributeReader.h



Class Description

An IXAttributeReader breaks a stream of text into lexemes, emitting a format suitable for consumption by an IXAttributeParser. Lexemes are the lexical components of the text, and are usually words, though they may be phrases, numbers, formulas, or even binary encoded graphics or sound bites.

An IXAttributeReader accepts text in one of three formats:  Attribute Reader Format (ARF), RTF, or ASCII.  Processed lexemes and unrecognized text, if any, are both output in ARF.  This allows multiple attribute readers to process a single stream in series, so that different parts of the stream are handled by different readers.  For more information on ARF, see "Attribute Reader Format" in the "Other Features" section of this chapter.

An IXAttributeReader can perform any of four predefined operations while analyzing a stream of text.  It can fold case, reducing uppercase characters to their lowercase equivalents; it can unique the lexemes, emitting numerical backward references instead of fully formed lexemes when duplicates are encountered; it can fold plurals, reducing plural terms to their singular form; and it can perform stemming, reducing terms to their stems (for example, "write," "writing," and "written" would all be reduced to "write").  The first two of these operations are fully implemented by IXAttributeReader.  The other two are declared as abstract methods for language-specific subclasses.



Instance Variables

NXHashTable *stopWords;

const char *punctuation;

unsigned char *charMapping;

struct {

unsigned caseFolding:1;

unsigned pluralFolding:1;

unsigned stemsReduced:1;

unsigned lexemeUniquing:1;

} booleanOptions;


stopWords Words removed from output.
punctuation Characters that delimit words.
charMapping Character mapping table.
booleanOptions.caseFolding YES if uppercase letters are converted to lowercase.
booleanOptions.pluralFolding YES if plurals are converted to singular form.
booleanOptions.stemsReduced YES if derivative terms are reduced to their stems.
booleanOptions.lexemeUniquing
YES if lexemes are uniqued for more compact output.



Method Types

Analyzing a text stream analyzeStream:
Altering lexemes foldPlural:inLength:
reduceStem:inLength:
Setting reader options setCaseFolded:
isCaseFolded
setPluralsFolded:
arePluralsFolded
setStemsReduced:
areStemsReduced
setPunctuation:
punctuation
setStopWords:
stopWords



Instance Methods

analyzeStream:
(NXStream *)analyzeStream:(NXStream *)stream

Scans stream for lexemes, returning a stream which contains the results of the lexical analysis in Attribute Reader Format. Several objects that implement this protocol may be chained together, each one further analyzing the output of its predecessor.



arePluralsFolded
(BOOL)arePluralsFolded

Returns YES if the IXAttributeReader reduces plurals to their singular forms when reading text, NO otherwise.  For example, if plurals are folded, then "boxes" will be folded to "box" and "children" to "child."  The default is NO.

IXAttributeReader itself doesn't fold plurals; a subclass must override foldPlural:inLength: to provide this functionality.  This method simply reports the value of a flag.  To implement plural folding, simply override the foldPlural:inLength: method.

See also:  setPluralsFolded:, foldPlural:inLength:



areStemsReduced
(BOOL)areStemsReduced

Returns YES if the IXAttributeReader reduces derivatives to their stems, NO otherwise.  For example, if stems are reduced, "forest", "deforest", "reforest", "deforestation", "forestry", and "unforested" will all be reduced to "forest."  The default is NO.

IXAttributeReader itself doesn't reduce stems; a subclass must override reduceStem:inLength: to provide this functionality. This method simply reports the value of a flag.

See also:  setStemsReduced:, reduceStem:inLength:



foldPlural:inLength:
(unsigned int)foldPlural:(char *)aString inLength:(unsigned int)aLength

Does nothing and returns the length of aString.  Overridden by subclasses to perform language-specific plural-to-singular form conversion.

Subclass implementations should convert aString from a plural to a singular form in a language specific manner.  aLength is the length of the string buffer, not the string itself.  If aString is altered, its new length should be the return value of this method.

See also:  setPluralsFolded:



isCaseFolded
(BOOL)isCaseFolded

Returns YES if the IXAttributeReader converts text to lowercase when reading, NO otherwise.  The default is YES.

See also:  setCaseFolded:



punctuation
(char *)punctuation

Returns a string containing the characters used by the IXAttributeReader to separate lexemes.  The sender of this message is responsible for freeing the string.

See also:  setPunctuation:



reduceStem:inLength:
(unsigned int)reduceStem:(char *)aString inLength:(unsigned int)aLength

Does nothing and returns the length of aString.  Overridden by subclasses to perform language-specific stem reduction.

Subclass implementations should reduce aString to its stem in a language specific manner.  aLength is the length of the string buffer, not the string itself.  If aString is altered, its new length should be the return value of this method.

See also:  areStemsReduced, setStemsReduced:



setCaseFolded:
setCaseFolded:(BOOL)flag

If flag is YES, the IXAttributeReader converts text to lowercase when reading.  If flag is NO, it doesn't affect the case of text. The default is YES.  Returns self.

See also:  isCaseFolded



setPluralsFolded:
setPluralsFolded:(BOOL)flag

If flag is YES, the IXAttributeReader reduces all plurals to singular form when it reads text.  If flag is NO, it doesn't alter plural forms.  The default is YES.  Returns self.

IXAttributeReader itself doesn't fold plurals; a subclass must override foldPlural: inLength: to provide this functionality.  This method simply sets the value of a flag.

See also:  arePluralsFolded, foldPlural:inLength:



setPunctuation:
setPunctuation:(const char *)aString

Sets the punctuation string to aString.  The punctuation string defines the characters that the IXAttributeReader uses to separate lexemes.  Returns self.

The ASCII null character (0) is always a punctuation character.  The default is the set of characters for which NXIsAlNum() returns 0 (false), except underscore.

See also:  punctuation, NXIsAlNum() (Common Classes and Functions)



setStemsReduced:
setStemsReduced:(BOOL)flag

If flag is YES, the IXAttributeReader reduces words to common stems when it reads text.  If flag is NO, it doesn't reduce stems.  The default is YES.  Returns self.

The IXAttributeReader class itself doesn't reduce stems; a subclass must override reduceStem:inLength: to provide this functionality.  This method simply sets the value of a flag.

See also:  areStemsReduced, reduceStem:inLength:



setStopWords:
setStopWords:(const char *)stopWords

Sets the IXAttributeReader's stop word list using the newline-separated words in stopWords.  An IXAttributeReader deletes stop words from the stream of lexemes it produces.  The default is to use no stop words.

See also:  stopWords:



stopWords
(char *)stopWords

Returns a string containing a newline-separated list of the stop words used by the IXAttributeReader.  An IXAttributeReader deletes stop words from the lexemes it emits.  The sender of this message is responsible for freeing the string.

See also:  setStopWords: