Copyright ©1995 by NeXT Computer, Inc.  All Rights Reserved.




The Indexing Kit Query Language

The Indexing Kit defines a declarative query language for expressing assertions about objects.  The query language can be used to select objects from collections such as those in an IXFileFinder or IXRecordManager.  It can also be used to test assertions against attributable objects such as instances of IXAttributeParser or IXFileRecord.



Queries

Queries are assertions comprising simple predicates and boolean operators.  A predicate is made of literal values, operators, and attributes.  An attribute is a value provided by the object being examined.  For example, in a set of students, one attribute might be the student's age, another might be his or her home state, and another his or her grade point average.  Some informal predicates based on these attributes are:

Age is greater than 20.
Grade point average is 4.0.
Home state is Michigan.

These predicates also use literal values (20, 4.0, and "Michigan") and operators (greater than and equals).  The possible literal values and operators are defined by the query language, but attributes are defined by the programmer and represent the return values of Objective C methods that take no arguments and return a single scalar value (an integer, floating-point number, or null-terminated string).

Combining the predicates listed above with logical operators to form logical expressions gives:

Age is greater than 20 and grade point average is 4.0.
Age is over 20 or home state is Michigan.
Home state is Michigan and grade point average is not 4.0.
Age is greater than 20 and home state is Michigan, and grade point average is not 4.0.

A query expression is evaluated against an evaluation context that supplies the objects to be examined by the query.  The result of the query is the object or set of objects from the evaluation context that satisfy the query expression.  During query evaluation, each object supplied by the context is examined:  The attributes named in the expression are bound to the attributes of the object, and the assertion is evaluated.  If the assertion proves true, the object is included in the result.

Note that all of the values in a query can be bound (that is, they can be constant values), but this results in an expression that's either always true or always false.  An expression that's always true will select all objects supplied by the evaluation context, while one that's always false will select none.  For example, "5 is greater than 3" is always true, so a query consisting of this assertion will always select every object supplied by the context.



Types

There are four data types in the query language:  boolean, number, string, and vector.  A type called regular expression is also defined as a subtype of string.  There are no type constructors, so it isn't possible to define new types.

The first three types, boolean, number, and string, are scalar.  Scalar values are coerced to other types as needed.  Numbers can become strings, as generated by sprintf(); they can also become boolean values in the manner of C:  0 is false, any other value is true.  When passed to an arithmetic or boolean operator, a string is scanned for a numeric representation, or converted to 0 if none is found.  Boolean false is coerced to the number 0, while boolean true is coerced to the number 1.

A regular expression is a string interpreted as a pattern for matching other strings.  Two forms of regular expression notation are provided: the standard regular expression notation, as described in the UNIX manual page for ed(1), and the Bourne shell notation for specifying patterns in filenames, as described in the UNIX manual page for sh(1).

The only composite type in the query language is the vector.  A vector is a weighted set of unique scalar values, usually strings. Vectors often arise from the evaluation of attributes, since attribute values are typically generated by attribute parsers from unstructured text.  For more information on attribute parsers, see the IXAttirbuteParser class specification.



Symbols

The Indexing Kit query language defines three kinds of symbols:  literals, operators, and attributes.  Symbols are delimited by white space and commas, as well as by the parentheses associated with operators.  The query language is case-sensitive, except for strings given to the parse() operator (see below).

A literal is a scalar value, like a number or a string, that's specified in the query.  Literal values can be expressed for all of the query language's data types.  The boolean literals are yes, true, no, and false.  Numbers can simply be written in integer or floating-point form:  0, 1, 27.35.  Strings are indicated by balanced pairs of quotation marks (single or double), or by the quote() operator (described below).  To include a closing delimiter in a string, precede it with a backslash.  For example:

"This is a string with a \" quotation mark in it"
'This is \'too'
quote(This string contains an embedded \) parenthesis)

Other escape sequences allowed in strings are:  `\n' (newline), `\r' (carriage return), `\t' (tab), `\f' (form feed), `\b' (space), `\\' (backslash), and `\nnn', where n is any octal digit.  Vectors of strings may be introduced with the parse() operator.

Operators denote actions that are performed during the compilation or evaluation of a query, like adding two numbers or parsing a string.  Operators are formed with a name and a pair of parentheses, much like functions in C.  A fixed set of operators is defined by the query language; there is no way to define a new operator.

Attributes are values supplied by the object being examined; an attribute name may be supplied wherever a literal value is expected.  An attribute is associated with an Objective C method selector; the attribute's name is bound to a value by sending the selector to the object being examined.  Legal attribute names include any otherwise undefined symbols; that is, any unquoted alphanumeric string that isn't an operator name or object name.  By convention, they usually begin with a capital letter. Content, Age, and Publisher are examples of legal attribute names.  Content is a special attribute, described in the section "Predefined Attributes."



Operators

Operators take one or more arguments and produce a single result.  The query language operators fall into six categories: literal, projection, arithmetic, relational, boolean, and search.  The behavior of an operator often depends on the types of its arguments, which must be enclosed in parentheses and delimited by white space or punctuation.

A literal operator takes a single argument and produces a literal result.  quote(), as mentioned above, creates a string from the text in its parentheses.  regex() and shell() both transform the text between their parentheses into regular expressions.  regex() builds regular expressions from Berkeley regular expression strings, and shell() builds regular expressions from Bourne shell expansion notation.  For example, the following two expressions result in the same regular expression:

regex(term.*)
shell(term*)

The last transform operator is parse(), which produces a vector of strings derived from its argument text.  The parse() operator is analogous to the IXAttributeParser class; in fact, the query language implements the parse() operator with IXAttributeParser. By default, parse() folds the case of its operand; this can be turned off with IXAttributeReader's setFoldsCase: method.  See the IXAttributeParser and IXAttributeReader class specifications for more information.

There's only one projection operator, project().  It takes two arguments, the second of which is the result of a parse() expression.  The first argument is an attribute, which the project() operator extracts from the parsed text.  In the example below, the project() operator extracts the Name attribute from the Attribute Reader Format text:

project(Name parse({\\rtf0 {\\zd Name}{\\\z Jane Draper\\za1}}))

In the absence of the project() operator, the Default attribute is extracted from the result of a parse() by whatever operator takes it as an argument.

add(), sub(), mul(), div(), and neg() are the arithmetic operators:  add, subtract, multiply, divide, and negate, respectively.  All of these, except for neg(), require two scalar arguments; neg() allows only one.  These operators coerce their arguments to double-precision floating-point numbers, and produce a numeric result.

The relational operators are gt(), ge(), eq(), ne(), lt(), and le():  greater than, greater than or equal to, equal, not equal, less than, and less than or equal to, respectively.  These operators perform the expected comparisons with numbers and booleans, and lexical comparisons on strings, so that, for example, "graze" is less than "style."  All of these operators produce a boolean result, and require two scalar arguments.  eq() and ne() perform pattern-matching when one of their arguments is a regular expression.  When operands of differing types are passed to a relational operator, both operands are coerced to numbers.

The boolean operators perform logical operations on scalar values, or on the results of search operations.  When applied to scalar values, they coerce the values to boolean type and produce the appropriate logical result.  The or() and and() operators perform logical disjunction and conjunction, respectively.  With a single scalar argument, not() performs logical inversion.  With two scalar arguments, not() is a shorthand for and() with the second argument logically inverted; that is, not(a b) is equivalent to and(a not(b)).  The application of boolean operators to search operations is described below.

The example predicates given earlier can be expressed in the query language using the relational and set operators.  Naming the attributes Age, GPA, and HomeState, we have:

gt(Age 20)
eq(GPA 4.0)
eq(HomeState "Michigan")

Combining these predicates with boolean operators, we get the following compound assertions:

and(gt(Age 20), eq(GPA 4.0))
or(gt(Age 20), eq(HomeState "Michigan")
not(eq(HomeState "Michigan"), eq(GPA 4.0))
not(and(gt(Age 20), eq(HomeState "Michigan")), eq(GPA 4.0))

The last group of operators, the search operators, provide a general and powerful searching capability.  They take either one or two arguments, which may be strings or vectors of strings, and search for a subject string (or strings) in a target body of text. When two arguments are supplied, the first is the target of the search, and the second is the subject.  When one argument is supplied, the Default attribute of the object being examined is the target of the search, and the argument is the subject.

The three search operators are whole(), prefix(), and within().  Each matches strings in a different way.  whole() indicates that full-term matches are desired:  the desired string will only match one in the attribute if they match exactly from beginning to end.  prefix() indicates that the requested value need only match the beginning part of the attribute's value.  within() performs a substring search:  a value can match any portion of the attribute's value.

Here are some examples of search expressions involving strings (HomeState is assumed to be a string-valued attribute):

prefix(HomeState "Mi")
whole(HomeState regex(Mi.*))

The expressions above all search for objects whose HomeState begins with "Mi"--this can be used to find abbreviations of state names as well as full names.  They will match values like "Michigan" and "Mississippi," but will also match any word beginning with "Mi," like "Misery."  For this particular example, a more careful search would be better:

or(eq(HomeState "Michigan"), eq(HomeState "Mi"))
or(whole(HomeState "Michigan"), whole(HomeState parse(Mi))

These two examples explicitly check for the full name of the state, or for its abbreviation.  The first example has the weakness of being case-sensitive; the abbreviation must be exactly "Mi" to match.  The second example, by using the parse() operator, takes advantage of case folding (which is done by default, but can be turned off), so that "Mi", "MI", and "mi" will all match.

When the target of a search is a vector and the subject is a string, the search is satisfied by a match between the subject and one or more elements of the target.  When the target is a string and the subject is a vector, the search requires a match between the target and zero, one, or all of the elements of the subject; the desired interpretation is indicated by applying a unary boolean operator to the search expression, as follows:

applying unary not() requires that none of the subject elements match
applying unary or() requires that at least one of the subject elements match
applying unary and() requires that all of the subject elements match

In the absence of an explicit application of a unary boolean operator, and() is assumed.  When both the target and subject are vectors, the search requires a match between at least one element of the target and zero, one, or all of the elements of the subject; the desired interpretation is indicated by applying one of the unary boolean operators, as mentioned above.

Consider the following expressions:

whole(parse(small dog))
whole(Default and(parse(small dog)))
whole(Pets parse(small dog))

The first two expressions above are equivalent.  They search for objects whose Default attribute contains both "small" and "dog."  The third is similar, except that it searches the Pets attribute.

whole(or(parse(small dog)))
whole(Pets or(parse(small dog)))

The expressions above search for objects containing the words "small" or "dog"; only one need be present.  The first expression searches the Default attribute, while the second searches the Pets attribute.

prefix(parse(small dog))
prefix(or(parse(small dog)))
within(parse(small dog))
within(or(parse(small dog)))

The prefix() expressions above search for objects whose Default attribute contains words beginning with "small" and "dog"--for example, "smallpox" or "doggerel."  The within() expressions search for objects containing the sequences "small" or "dog" as substrings--words like "smallish," "dismally," "endogen," or "underdog."  The first example of each pair finds objects that have a match for each word, while the second finds objects that have a match for either.

within("The beasts of the Mohave desert, the Russian steppes,
and the Serengeti plain" Location)

The example above illustrates a different kind of search, in which objects whose Location attribute is a substring of the string on the left (such as "Russian steppes").  This is useful for finding objects related to a particular fragment of text.

The search operators differ from the other operators in that, as a side effect, they may assign weights to the objects returned by the query, as a measure of probable relevance.  That is, in addition to producing a boolean result used in evaluating the assertion, a search operator may produce a weight providing an heuristic measure of the likely relevance of the match.  The weights are derived from vector targets; weights for the results of several search operators are averaged.

A variation on the weighting algorithm is used when subject terms occur more than once within a parse() operator.  In that case, weights are assigned to the subject terms, and are used to filter the weights derived from the target.  The algorithm used for weighting the subject terms is determined by the weighting type assigned to the IXAttributeParser used to process the parse() operator.  One particularly useful application of this mechanism is similarity searching.  To begin with, a large piece of representative text is placed within a parse() as the search subject, and a unary or() is applied to the search result.  The query will select all objects containing matches for one or more of the subject terms, and the target vector weightings are filtered by the subject vector weightings.  This is semantically equivalent to the statistical correlation searching performed by numerous commercial products under the name "similarity searching."



Predefined Attributes

The query language implicitly defines one attribute, Default.  Any tokens that aren't explicitly assigned to another attribute are assigned to Default by IXAttributeParser.  Parsing can generate many attributes based on keywords, such as Title, Author, PublicationDate, and so on.  None of these is guaranteed to be generated for a body of text, so Default covers the general case.

If a query expression is evaluated against an IXFileFinder, the expression can also use the Content attribute.  Content is defined as the entire unparsed contents of a file, and can be used for literal substring search with an IXFileFinder.  Here are some examples of its use:

eq(Content "small dog")
whole(Content "small dog")

The expressions above are equivalent, and search for files whose contents are exactly "small dog".

prefix(Content "small dog")
within(Content "small dog")

The expressions above search for the literal phrase "small dog" at the beginning of the file and anywhere in the file, respectively.

The IXFileRecord class also defines a set of attributes, which can be specified in a query against an IXFileFinder or against a single IXFileRecord.  The Content attribute isn't available in queries evaluated against an IXFileRecord, as that class doesn't have any way of actually reading the file's contents.  Here are the attributes defined for IXFileRecords:

Attribute Name Description & Type
FileName The file's relative pathname (string)
BaseName The last component of the file's pathname (string)
FileType The file's type (string); for example, "rtf"
FileDevice The file's logical device number (number)
FileInode The file's inode number (number)
FileMode The file's permissions (number)
FileCount The number of hard links to the file (number)
FileOwner The file owner's user id (number)
FileGroup The file group's group id (number)
FileSize The file's physical size, in bytes (number)
AccessTime The time the file was last accessed (number)
ModifyTime The time the file's contents were last changed (number)
ChangeTime The time the file's status was last changed (number)
UnixType The file type, as defined in sys/stat.h (number)