Copyright ©1995 by NeXT Computer, Inc.  All Rights Reserved.

IXAttributeParser



Inherits From: Object
Declared In: indexing/IXAttributeParser.h



Class Description

An IXAttributeParser breaks text streams down into lists of lexemes occurring in the text.  A lexeme is a word or phrase that should be treated as a single term.  Though not directly accessible, the lists are used by other classes in the Indexing Kit to build indexes for the text, or to resolve queries against the text.

An IXAttributeParser uses a number of IXAttributeReaders to divide a text stream into individual lexemes, each associated with a specific attribute, like Title, Author, or Abstract, and collects the lexemes into a histogram for each attribute.  The parser can weight the lexemes for a given attribute in several ways:  by the number of occurrences within the attribute, by the relative frequency of occurrence within the attribute, or by peculiarity within the attribute relative to a reference domain.  A lexeme's peculiarity is the square root of the ratio of its frequency within the attribute to its frequency within the reference domain; for example, the word "computer" has a much lower peculiarity with respect to the domain of computer science literature than to that of archaeological literature because it occurs much more frequently in the former.

An IXAttributeParser parses any of three text formats:  Attribute Reader Format (ARF), RTF, or ASCII text (it prefers them in that order).  A parser determines a file's or stream's format by examining the type argument to a parse... or analyze... method. If that type is ARF, RTF, or ASCII, the parser can simply start processing the text.  If not, the parser will examine the first few bytes of the text to see if it is, indeed, in one of the parsable formats; for example, if it finds "{\rtf" at the beginning of a stream, it assumes that the stream contains RTF.  Failing this, the parser will attempt to convert the text into one of the parsable formats using the filtering services provided by the Application Kit.  If the text can't be converted into a parsable format using the filtering services, the parser simply treats the file or stream as though it were ASCII, checking first for nonprintable characters; if there is a significant number of them on the first page (more than 1 in 16), the file or stream isn't parsed at all.  For example, if told to parse a WordPerfect document, the parser would attempt to convert the document from WordPerfect format to one of the three parsable formats.  If the document couldn't be converted, it would be parsed as ASCII,  control words, formatting commands, and all (unless the document contained enough nonprintable characters that it would be regarded as unprintable by the parser).

To attempt conversion of a file of type mytype, the parser will call the Application Kit function NXCreateTypedFileName() to generate a typed file-name pasteboard type.  Thus, the filter must declare this as its input type in a services file in order to be visible to the parser.  If no filter is found by this approach, and the file is readable, then the parser will attempt conversion a second time using the function NXCreateTypedFileContents() to generate a typed file contents pasteboard type.

When a parser isn't supplied for a class or method that needs one (for example, an IXFileFinder), a default parser is created, along with a default reader for the current user's preferred language, as set in the Preferences application.  NeXT ships language-specific IXLanguageReaders for all supported user languages in /NextLibrary/Readers.  These IXLanguageReaders are dynamically loaded into an application when needed.  Your code can get a reader for a specific language by sending the IXLanguageReader class object a readerForLanguage: message.  If the language is specified as "Default", the reader for current user's preferred language is loaded.  If a reader for the requested language can't be found, the English reader is used by default.



Instance Variables

None declared in this class.



Method Types

Initializing an instance init
Managing readers setAttributeReaders:
getAttributeReaders:
Managing text stream types understandsType:
addSourceType:
removeSourceType:
Managing parse options setMinimumWeight:
minimumWeight
setPercentPassed:
percentPassed
setWeightingDomain:
weightingDomain
setWeightingType:
weightingType
Parsing text parseFile:ofType:
parseStream:ofType:
analyzeFile:ofType:
analyzeStream:ofType:
reset



Instance Methods

addSourceType:
addSourceType:(const char *)aType

Records the Pasteboard type or file extension aType as one of the types for which the IXAttributeParser will respond YES when sent an understandsType: message, and which the IXAttributeParser will attempt to parse. If an IXAttributeParser has had no source types added, or has had all source types removed with removeSourceType:, it acts as though it understands any type, and will parse any file or stream.  Returns self.

See also:  removeSourceType:, understandsType:, analyzeFile:ofType:, analyzeStream:ofType:, parseFile:ofType:, parseStream:ofType:, Pasteboard class of the Application Kit



analyzeFile:ofType:
(NXStream *)analyzeFile:(const char *)filename ofType:(const char *)aType

Parses the contents of filename, and returns the contents of filename in Attribute Reader Format as produced by the IXAttributeParser's IXAttributeReaders.  If the IXAttributeParser doesn't understand the type aType, this method returns NULL.  Otherwise, aType is used to determine whether the contents of filename are in a parsable format (one of ARF, RTF, or ASCII), or if not, to locate a filter service that can convert the contents of filename.  Files that can't be converted into a parsable format are parsed as though they contained ASCII text, unless they contain a significant amount of nonprintable text (for example, control characters), in which case the file is assumed to be binary, and not parsed.

See also:  analyzeStream:ofType:, parseFile:ofType:, parseStream:ofType:, understandsType:, addSourceType:, Attribute Reader Format ("Other Features" section), Pasteboard class of the Application Kit



analyzeStream:ofType:
(NXStream *)analyzeStream:(NXStream *)stream ofType:(const char *)aType

Parses stream, and returns the contents of stream in Attribute Reader Format as read by the IXAttributeParser's IXAttributeReaders.  If the IXAttributeParser doesn't understand the pasteboard type aType, this method returns NULL. Otherwise,  aType is used to determine whether stream is in a parsable format (one of ARF, RTF, or ASCII), or if not, to locate a filter service that can convert the contents of stream.  Streams that can't be converted into a parsable format are parsed as though they contained ASCII text, unless a significant amount of the text is nonprintable, in which case the stream isn't parsed.

See also:  analyzeFile:ofType:, parseStream:ofType:, parseFile:ofType:, understandsType:, addSourceType:, Attribute Reader Format (Other Features section), Pasteboard class of the Application Kit



getAttributeReaders:
getAttributeReaders:(List *)aList

Empties aList, fills it with the IXAttributeReaders used by the IXAttributeParser, and returns it by reference.   The sender of this message may free the List, but not its contents.  Returns self.

See also:  setAttributeReaders:



init
init

Initializes a newly created IXAttributeParser, setting the percent passed to 100 and the weighting type to IX_NoWeighting. Returns self.

See also:  setPercentPassed:, setWeightingType:



minimumWeight
(unsigned int)minimumWeight

Returns the minimum weight required for a lexeme to be included in the attribute/value list.

See also:  setMinimumWeight:, percentPassed



parseFile:ofType:
parseFile:(const char *)filename ofType:(const char *)aType

Parses the contents of filename, and returns self.  If the IXAttributeParser doesn't understand the type aType, this method returns nil.  Otherwise,  aType is used to determine whether the contents of filename are in a parsable format (one of ARF, RTF, or ASCII), or if not, to locate a filter service that can convert the contents of filename.  Files that can't be converted into a parsable format are parsed as though they contained ASCII text, unless a significant amount of the text is nonprintable, in which case the stream isn't parsed.

See also:  parseStream:ofType:, analyzeFile:ofType:, analyzeStream:ofType:, understandsType:, addSourceType:, Pasteboard class of the Application Kit



parseStream:ofType:
parseStream:(NXStream *)stream ofType:(const char *)aType

Parses stream, and returns self.  If the IXAttributeParser doesn't understand the type aType, this method returns nil.  Otherwise, aType is used to determine whether stream is in a parsable format (one of ARF, RTF, or ASCII), or if not, to locate a filter service that can convert the contents of stream.  Streams that can't be converted into a parsable format are parsed as though they contained ASCII text, unless a significant amount of the text is nonprintable, in which case the stream isn't parsed.

See also:  parseFile:ofType:, analyzeStream:ofType:, analyzeFile:ofType:, understandsType:, addSourceType:, Pasteboard class of the Application Kit



percentPassed
(unsigned int)percentPassed

Returns the percentage of the lexemes for each attribute that will be included in the result of a parse.  Any lexeme whose weight puts it at this percentile or higher will be included.

See also:  setPercentPassed:, minimumWeight



removeSourceType:
removeSourceType:(const char *)aType

Removes the pasteboard type or file extension aType from the IXAttributeParser's list of understood types.  The IXAttributeParser will respond NO to subsequent understandsType: messages with aType as the argument, and won't parse files or streams of that type.  Returns self.

See also:  addSourceType:, understandsType:, Pasteboard class of the Application Kit



reset
reset

Clears the state built up by parsing a file or stream, preparing the IXAttributeParser to analyze a different file or stream.  It is possible to combine multiple streams or files by parsing them in sequence without resetting the IXAttributeParser, in which case the results accumulate in the attribute/value list.  Returns self.

See also:  analyzeFile:ofType:, analyzeStream:ofType:, parseFile:ofType:, parseStream:ofType:



setAttributeReaders:
setAttributeReaders:(List *)aList

Establishes the objects in aList as the IXAttributeReaders used by the IXAttributeParser, and frees any of the previous set of IXAttributeReaders that the IXAttributeParser will no longer use.  The List must contain instances of IXAttributeReader or a subclass.  Readers will be used on a stream of text in the order they appear in the List.  Returns self.

See also:  getAttributeReaders:



setMinimumWeight:
setMinimumWeight:(unsigned int)anInt

Sets the minimum weight required for inclusion in the parse result.  For example, setting the minimum weight to 10 causes all lexemes with weight less than 10 to be dropped from the result of a parse.  Returns self.

The IXAttributeParser uses only one of minimum weight or percent passed.  If the minimum weight is set, the percent passed is reset to 100; if the percent passed is set, the minimum weight is reset to 0.

See also:  minimumWeight, setPercentPassed:



setPercentPassed:
setPercentPassed:(unsigned int)anInt

Sets the percentage of lexemes for a given attribute that will be included in the result of a parse.  Any lexeme whose weight puts it at this percentile or higher will be included.  For example, setting this value to 25 would include the top quarter of the lexemes in the search result; if there were 2000 lexemes, the 500 heaviest lexemes by weight would be included.

The IXAttributeParser uses only one of minimum weight or percent passed.  If the minimum weight is set, the percent passed is reset to 100; if the percent passed is set, the minimum weight is reset to 0.

Returns self.

See also:  percentPassed, setMinimumWeight:



setWeightingDomain:
setWeightingDomain:(IXWeightingDomain *)aDomain

Sets the weighting domain used by the IXAttributeParser to aDomain, and returns self.  The weighting domain is used to assign peculiarity weights to lexemes for a given attribute; the frequency of the lexeme within the attribute is divided by the frequency of the lexeme in the domain to give the lexeme's peculiarity, and the result is normalized by taking its square root.  This is only done when the IXAttributeParser's weighting type is IX_PeculiarityWeighting.

See also:  weightingDomain, setWeightingType:



setWeightingType:
setWeightingType:(IXWeightingType)anInt

Sets the weighting type used by the IXAttributeParser to anInt and returns self.  The weighting type is used to determine how to calculate lexeme weights, and may be one of the following values:

IX_NoWeighting
IX_AbsoluteWeighting
IX_FrequencyWeighting
IX_PeculiarityWeighting

IX_NoWeighting means that all lexemes are assigned a weight of 0.  With IX_AbsoluteWeighting, each lexeme is assigned a weight equal to the number of times it occurs within the attribute.  IX_FrequencyWeighting results in each lexeme being weighted by relative frequency of occurrence:  the number of times it occurs in the attribute divided by the total number of lexemes in the attribute.  IX_PeculiarityWeighting uses a weighting domain to calculate a frequency relative to some large body of text; the final weight of a lexeme is calculated by taking the square root of its frequency in the attribute divided by its frequency in the domain.  IX_PeculiarityWeighting is useful for lowering the significance of lexemes that are common in a particular set of texts.

See also:  weightingType, setWeightingDomain:



understandsType:
(BOOL)understandsType:(const char *)aType

Returns YES if the IXAttributeParser will parse files of the pasteboard type or file extension aType, NO if not.  If no types have been added with addSourceType:, or if all types added have been removed with removeSourceType:, this method always returns YES.

See also:  addSourceType:, removeSourceType:, Pasteboard class of the Application Kit



weightingDomain
(IXWeightingDomain *)weightingDomain

Returns the weighting domain used by the IXAttributeParser, or nil if there is none.

See also:  setWeightingDomain:, setWeightingType:



weightingType
(IXWeightingType)weightingType

Returns the weighting type used by the IXAttributeParser.  See setWeightingType: for a list of the possible values and their meanings.

See also:  setWeightingType:, setWeightingDomain: