W3C

XQuery 1.0 and XPath 2.0 Full-Text

W3C Working Draft 1 May 2006

This version:
http://www.w3.org/TR/2006/WD-xquery-full-text-20060501/
Latest version:
http://www.w3.org/TR/xquery-full-text/
Previous versions:
http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050915/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050404/ http://www.w3.org/TR/2004/WD-xquery-full-text-20040709/
Editors:
Sihem Amer-Yahia, AT&T Labs - Research <sihem@research.att.com>
Chavdar Botev, Invited Expert <cbotev@cs.cornell.edu>
Stephen Buxton, Mark Logic Corporation <stephen.buxton@marklogic.com>
Pat Case, Library of Congress <pcase@crs.loc.gov>
Jochen Doerre, IBM <doerre@de.ibm.com>
Mary Holstege, Mark Logic Corporation <mary.holstege@marklogic.com>
Darin McBeath, Elsevier <D.McBeath@elsevier.com>
Michael Rys, Microsoft <mrys@microsoft.com>
Jayavel Shanmugasundaram, Invited Expert <jai@cs.cornell.edu>

This document is also available in these non-normative formats: XML.


Abstract

This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text which is a language that extends XQuery 1.0 [XQuery 1.0: An XML Query Language] and XPath 2.0 [XML Path Language (XPath) 2.0] with full-text search capabilities.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a public W3C Working Draft for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This is the fifth version of this document. Since the last version was published, several technical and editorial changes have been made to all the sections of the document. Among the most significant changes are: the addition of a section describing the processing model for full-text search and how it integrates with the XQuery Processing Model; the reformulation of the AllMatches model so that a primitive match (TokenInfo) now can represent an interval of token positions, and hence, a match of a phrase (in the former version phrases were modeled using distance constraints, which had certain unwanted implications when distance operators were explicitly applied to phrases); the restriction of the FTTimes operation to simple FTSelections; and several simplifications in the semantics functions that the latter two changes made possible, like the removal of the AllMatches normalization. The XQuery functions that are used to define the semantics of the full-text operations have been thoroughly revised and are now syntax- and type-checked.

This document has been produced following the procedures set out for the W3C Process. This document was produced through the efforts of XML Query Working Group and the XSL Working Group (both part of the XML Activity). It is designed to be read in conjunction with the following documents: W3C XQuery and XPath Full-Text Requirements [XQuery and XPath Full-Text Requirements] and the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].

Public comments on this document and its open issues are invited. Comments should be entered into the issue tracking system for this specification (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C mailing list, public-qt-comments@w3.org (http://lists.w3.org/Archives/Public/public-qt-comments/) with "[FT]" at the beginning of the subject field of email messages involving such comments.

This document was produced by groups operating under the 5 February 2004 W3C Patent Policy. W3C maintains public lists of any patent disclosures made in connection with the deliverables of the XML Query Working Group and the XSL Working Group; those pages also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
    1.1 Full-Text Search and XML
    1.2 Organization of this document
    1.3 A word about namespaces
2 Full-Text Extensions to XQuery and XPath
    2.1 Processing Model
    2.2 Expression FTContainsExpr
        2.2.1 FTContainsExpr Description
        2.2.2 FTContainsExpr Examples
    2.3 Score Variables
        2.3.1 Using Weights Within a Scored FTContainsExpr
    2.4 Extensions to the Static Context
3 FTSelections
    3.1 Full-Text Operators
        3.1.1 FTWords
        3.1.2 FTOr
        3.1.3 FTAnd
        3.1.4 FTMildNot
        3.1.5 FTUnaryNot
        3.1.6 FTOrder
        3.1.7 FTScope
        3.1.8 FTDistance
        3.1.9 FTWindow
        3.1.10 FTTimes
        3.1.11 FTContent
    3.2 FTMatchOptions
        3.2.1 FTCaseOption
        3.2.2 FTDiacriticsOption
        3.2.3 FTStemOption
        3.2.4 FTThesaurusOption
        3.2.5 FTStopwordOption
        3.2.6 FTLanguageOption
        3.2.7 FTWildCardOption
    3.3 FTIgnoreOption
4 Semantics
    4.1 Tokenization
        4.1.1 Examples
        4.1.2 Representations of Tokenized Text and Matching
    4.2 Evaluation of FTSelections
        4.2.1 AllMatches
            4.2.1.1 Formal Model
            4.2.1.2 Examples
            4.2.1.3 XML representation
        4.2.2 FTSelections
            4.2.2.1 XML Representation
            4.2.2.2 The evaluate function
            4.2.2.3 Formal semantics functions
            4.2.2.4 FTWords
            4.2.2.5 FTOr
            4.2.2.6 FTAnd
            4.2.2.7 FTUnaryNot
            4.2.2.8 FTMildNot
            4.2.2.9 FTOrder
            4.2.2.10 FTScope
            4.2.2.11 FTContent
            4.2.2.12 FTDistance
            4.2.2.13 FTWindow
            4.2.2.14 FTTimes
        4.2.3 Match Options Semantics
            4.2.3.1 Types
            4.2.3.2 High-Level Semantics
            4.2.3.3 Formal Semantics Functions
            4.2.3.4 FTCaseOption
            4.2.3.5 FTDiacriticsOption
            4.2.3.6 FTStemOption
            4.2.3.7 FTThesaurusOption
            4.2.3.8 FTStopWordOption
            4.2.3.9 FTLanguageOption
            4.2.3.10 FTWildCardOption
    4.3 XQuery 1.0 and XPath 2.0 Full-Text and Scoring Expressions
        4.3.1 FTContainsExpr
            4.3.1.1 Semantics of FTContainsExpr
        4.3.2 Scoring
        4.3.3 Example

Appendices

A EBNF for XQuery 1.0 Grammar with Full-Text extensions
    A.1 Terminal Symbols
B EBNF for XPath 2.0 Grammar with Full-Text extensions
    B.1 Terminal Symbols
C Static Context Components
D Error Conditions
E References
    E.1 Normative References
    E.2 Non-normative References
F Acknowledgements (Non-Normative)
G Glossary (Non-Normative)
H Checklist of Implementation-Defined Features (Non-Normative)
I Change Log (Non-Normative)


1 Introduction

This document defines the language and the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. This language is designed to meet the requirements identified in W3C XQuery and XPath Full-Text Requirements [XQuery and XPath Full-Text Requirements] and to support the queries in the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].

XQuery 1.0 and XPath 2.0 Full-Text extends the syntax and semantics of XQuery 1.0 and XPath 2.0.

1.1 Full-Text Search and XML

As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT [SQL/MM] defines extensions to SQL to express full-text searches providing similar functionality as does this full-text language extension to XQuery 1.0 and XPath 2.0.

XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.

Full-text search is different from substring search in many ways:

  1. A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.

  2. There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as "mouse" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens "XML" and "Query" allowing up to 3 intervening words.

  3. Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.

    As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full-Text.

The following definitions apply to full-text search:

  1. [Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of words, units of punctuation, and spaces.

  2. [Definition: A token is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a token consists of one or more consecutive characters. Beyond that, tokens are implementation-defined.] Note that consecutive tokens need not be separated by either punctuation or space, and tokens may overlap. [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]

    Note:

    In some natural languages, tokens and words can be used interchangeably.

  3. Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).

    Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).

    Tokenization also uniquely identifies sentences and paragraphs in which tokens appear. [Definition: A sentence is an ordered sequence of any number of tokens. Beyond that, sentences are implementation-defined. A tokenizer is not required to support sentences.] [Definition: A paragraph is an ordered sequence of any number of tokens. Beyond that, paragraphs are implementation-defined. A tokenizer is not required to support paragraphs.] Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain tokens.

    The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens. Everything else is implementation-defined.

  4. This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.

  5. Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries, while formatting markup sometimes does not. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization.

1.2 Organization of this document

This document is organized as follows. We first present a high level syntax for the XQuery 1.0 and XPath 2.0 Full-Text language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery 1.0 and XPath 2.0 Full-Text language. This is followed by the semantics of the XQuery 1.0 and XPath 2.0 Full-Text language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text extensions, an EBNF for XQuery 1.0 Grammar with Full-Text extensions, acknowledgements and a glossary.

1.3 A word about namespaces

Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:

  • xml = http://www.w3.org/XML/1998/namespace

  • xs = http://www.w3.org/2001/XMLSchema

  • xsi = http://www.w3.org/2001/XMLSchema-instance

  • fn = http://www.w3.org/2005/xpath-functions

  • xdt = http://www.w3.org/2005/xpath-datatypes

  • local = http://www.w3.org/2005/xquery-local-functions

In addition to the prefixes in the above list, this document uses the prefix err to represent the namespace URI http://www.w3.org/2005/xqt-errors, This namespace prefix is not predeclared and its use in this document is not normative. Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0 specifications, particularly [XML Path Language (XPath) 2.0] and [XQuery 1.0 and XPath 2.0 Functions and Operators].

Finally, this document uses the prefix fts to represent a namespace containing a number of functions used in this document to describe the semantics of XQuery 1.0 and XPath 2.0 Full-Text functions. There is no requirement that these functions be implemented, therefore no URI is associated with that prefix.

2 Full-Text Extensions to XQuery and XPath

XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:

  1. Adds a new expression called FTContainsExpr;

  2. Enhances the syntax of FLWOR expressions in XQuery 1.0 and for expressions in XPath 2.0 with optional score variables; and

  3. Adds static context declarations for full-text match options to the query prolog.

Additionally, it extends the data model and processing models in various ways.

2.1 Processing Model

As part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance, an implementation-defined full-text process, called tokenization is usually executed.

Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of token occurrences found in the target text (nodes) of a search. These token occurrences are characterized by unique identifiers that capture the relative position of the token inside the string, the relative position of the sentence containing the token, and the relative position of the paragraph containing the token.

The tokenization process is implementation-dependent. For example, the tokenization may differ from domain to domain and from language to language. This specification will only impose a very few number of constraints on the semantics of a correct tokenizer. As a consequence, all the examples in this document are only given for explanation purposes but they are not mandatory, i.e. the result of such full-text queries will of course depend on the tokenizer that is being used.

A full-text expression or FTContainsExpr, evaluated within the normal Query Processing (XQuery Processing Model), is composed of several parts:

  1. An XPath 2.0 or XQuery 1.0 expression (RangeExpr) that specifies the sequence of items to be searched. Those items are called the search context.

  2. The full-text selection to be applied (FTSelections). FTSelections are, syntactically and semantically, fully composable and contain:

    • Required:

      • Words and phrases for which a search is performed (FTWords).

    • Optional:

      • Match options, such as indicators for case sensitivity and stop words (FTMatchOptions);

      • Boolean full-text operators, that compose an FTSelection from simpler FTSelections;

      • Other full-text operators that are constraints on the positions of matches, such as indicators for distance between tokens and for the cardinality of matches; and

      • The weighing information. Each individual search term in an FTSelection may be annotated with optional weight information. This information may be used during the evaluation of the FTSelections to calculate scoring, information that quantifies the relevance of the result to the given search criteria.

  3. An optional Xpath 2.0 or XQuery 1.0 expression (UnionExpr) that specifies the set of nodes, descendents of the RangeExp, which contents may be ignored for the purpose of determining a match during the search (FTIgnoreOption).

The results of the evaluation of the FTSelection operators are instances of the AllMatches model, which complements the XQuery Data Model (XDM) for processing full-text queries. An AllMatches instance describes all possible solutions to the full-text query for a given search context item. Each solution is described by a Match instance. A Match instance contains the tokens from the search context that must be included (described using StringInclude instances which model the positive terms) and the tokens from search context item that must be excluded (described using StringExclude instances which model the negative terms). Each negative or positive term is modeled as a tuple: the position of the query word or phrase in the FTSelection, and a TokenInfo structure that describes a consecutive sequence of token occurrences in the text string which match the query word or phrase.

Processing Model Extensions

Figure 1 provides a schematic overview of the XQuery 1.0 and XPath 2.0 Full-Text processing steps that are discussed in detail below. Some of these steps are completely outside the domain of XQuery; in Figure 1, these are depicted outside the black line that represents the boundaries of language. The diagram only shows the central pieces of the XQuery Processing Model (see Section 2.2 Processing ModelXQ), however zooms in on the Execution Engine where the processing of the Full-Text extensions takes place. The full-text processing steps are labeled as FTn within the diagram and are referenced within the text.

Like all XQuery expressions, an FTContainsExpr returns an XDM Instance (see Fig. 1). With the exception of FTWords, which consumes TokenInfos, all FTSelections are closed under the AllMatches data model, i.e., their input and output are AllMatches instances. Tokenization normally occurs at the time of parsing of the original XML documents, for example, during the Data Model Generation process (see Figure 1). But here it may also occur "on-the-fly" transforming an XDM instance into TokenInfos, which ultimately get converted into AllMatches instances by the evaluation of FTSelections. Thus, the evaluation of nested full-text and XQuery expressions instances moves back and forth between these two models.

The resulting AllMatches instance obtained by the evaluation of a Full Text expression is converted into a Boolean value before being returned to the enclosing XPath or XQuery operation as follows. If at least one member of the disjunction contains only positive terms then value returned is true. If all members of the disjunction contain negative terms the result is false.

Weighing information, in an implementation-dependent fashion, may be used when calculating the scoring information computed and made available by FTContainsExpr to the optional score construct.

Section 3 describes the syntax and the informal semantics of Full Text operators. Their formal semantics is defined in Section 4. The AllMatches data model is formally defined in Section 4.

Given the components of a given Full Text expression, the evaluation algorithm will proceed according to the following steps, also referenced in the processing model diagram as steps FTn (see Fig. 1):

  1. Evaluate the search context expression, resulting in the set of search context items; (FT1 provides the evaluation of any Xpath 2.0 or XQuery 1.0 expressions that generates or modifies the search context, as well as the query string(s) in a partially evaluated FTSelection expression)

  2. Evaluate the (optional) ignore expression, resulting in the set of ignored nodes and virtually delete the ignore nodes from the search context nodes tree. (Included in FT1)

  3. Apply the tokenization algorithm to query string(s). (FT2.1 -- this is implementation-dependent)

  4. For each search context item:

    1. Apply the tokenization algorithm in order to extract potentially matching terms together with their positional information. This step results in a sequence of token occurrences. (FT2.2 -- this is implementation-dependent)

    2. Evaluate the simple "FTWord" operators in the FTSelection against the tokenized input. This results in a set of AllMatches instances. (FT3)

    3. Evaluate the rest of the FTSelection operator tree in a bottom up fashion. At each step the AllMatches instance produced by the previous steps are given as input, and a new instance of the AllMatches is obtained as output. At each step the FTMatchOptions are controlling the semantics of the application of the FTWords operator. (FT4)

  5. Convert the AllMatches instance into a Boolean value. (FT5)

The additional scoring information (also part of FT5) that is produced by the evaluation of the Full Text expression is implementation dependent and is not specified in this document and is made available at the same time the Boolean value is returned.

2.2 Expression FTContainsExpr

As a syntactic construct an FTContainsExpr behaves similar to a comparison expression (see Section 3.5.2 General ComparisonsXQ). This grammar rule introduces FTContainsExpr.

[50]    ComparisonExpr    ::=    FTContainsExpr ( (ValueComp
| GeneralComp
| NodeComp) FTContainsExpr )?

An FTContainsExpr may be used anywhere a ComparisonExpr may be used. FTContainsExprs have higher precedence than comparison operators, so the results of FTContainsExpr may be compared without enclosing them in parentheses.

2.2.1 FTContainsExpr Description

[51]    FTContainsExpr    ::=    RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )?

An FTContainsExpr returns a Boolean value. It returns true, if there is some node in RangeExpr that, after tokenization, matches FTSelection. For the purpose of determining a match, certain descendants of nodes in RangeExpr may be ignored, as specified in FTIgnoreOption.

2.2.2 FTContainsExpr Examples

The following example in extended XQuery 1.0 returns the author of each book with a title containing a token with the same root as dog and the token cat.

for $b in /books/book
where $b/title ftcontains ("dog" with stemming) && "cat" 
return $b/author

The same example in extended XPath 2.0 is written as:


/books/book[title ftcontains ("dog" with stemming) && "cat"]/author

2.3 Score Variables

Besides specifying a match of a full-text search as a Boolean condition, full-text search applications typically also have the ability to associate scores with the results. [Definition: Scores express the relevance of those results to the full-text search conditions.]

XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 further by adding optional score variables to the for and let clauses of FLWOR expressions.

The production for the extended for clause follows.

[35]    ForClause    ::=    "for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle)*
[37]    FTScoreVar    ::=    "score" "$" VarName

When a score variable is present in a for clause the evaluation of the expression following the in keyword not only needs to determine the result sequence of the expression, i.e., the sequence of items which are iteratively bound to the for variable. It must also determine in each iteration the relevance "score" value of the current item and bind the score variable to that value.

In the following example book elements are determined that satisfy the condition [content ftcontains "web site" && "usability" and .//chapter/title ftcontains "testing"]. The scores assigned to the book elements are returned.

for $b score $s 
    in /books/book[content ftcontains "web site" && "usability" 
                   and .//chapter/title ftcontains "testing"]
return $s

XPath 2.0 Full-Text extends the language of XPath 2.0 in the for expression in the same way: with optional score variables. The example above is also a legal example of the XPath 2.0 extension.

Scores are typically used to order results, as in the following, more complete example.

for $b score $s 
    in /books/book[content ftcontains "web site" && "usability"]
where $s > 0.5
order by $s descending
return <result>  
          <title> {$b//title} </title> 
          <score> {$s} </score> 
       </result>

The score variable is bound to a value which reflects the relevance of the match criteria in the FTSelections to the nodes in the respective RangeExprs. The calculation of relevance is implementation-dependent, but score evaluation must follow these rules:

  1. Score values are of type xs:double in the range [0, 1].

  2. For score values greater than 0, a higher score must imply a higher degree of relevance

Similar to their use in a for clause, score variables may be specified in a let clause. A score variable in a let clause is also bound to the score of the expression evaluation, but in the let clause one score is determined for the complete result. The let variable may be dropped from the let clause, if the score variable is present.

The production for the extended let clause follows.

[38]    LetClause    ::=    (("let" "$" VarName TypeDeclaration? FTScoreVar?) | ("let" "score" "$" VarName)) ":=" ExprSingle ("," (("$" VarName TypeDeclaration? FTScoreVar?) | FTScoreVar) ":=" ExprSingle)*

While when using the score option in a for clause the expression following the in keyword has the dual purpose of filtering, i.e., driving the iteration, and determining the scores, it is possible to separately specify expressions for filtering and scoring by combining a simple for clause with a let clause that uses scoring. The following is an example of this.

for $b in /books/book[.//chapter/title ftcontains "testing"]
let score $s := $b/content ftcontains "web site" && "usability" 
order by $s descending
return <result score="{$s}">{$b}</result>

This example returns book elements with chapter titles that contain "testing". Along with the book elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".

Note that it is not a requirement of the score of an FTContainsExpr to be 0, if the expression evaluates to false, nor to be non-zero, if the expression evaluates to true. Hence, in the example above it is not possible to infer the Boolean value of the FTContainsExpr in the let clause from the calculated score of a returned result element. For instance, an implementation may want to assign a non-zero score to a book that contained only "web site", but not "usability", as this may be considered more relevant than a book that does not contain either of both.

The use of score variables introduces a second-order aspect to the evaluation of expressions which cannot be emulated by (first-order) XQuery functions. Consider the following replacement of the clause let score $s := FTContainsExpr

let $s := score(FTContainsExpr)

where a function score is applied to some FTContainsExpr. If the function score were first-order, it would only be applied to the result of the evaluation of its argument, which is one of the Boolean constants true or false. Hence, there would be at most two possible values such a score function would be able to return and no further differentiation would be possible.

2.3.1 Using Weights Within a Scored FTContainsExpr

[Definition: Scoring may be influenced by adding weight declarations to search tokens, phrases, and expressions.] Syntactically weight declarations are introduced in the FTSelection production, described in FTSelections.

for $b in /books/book
let score $s := $b/content ftcontains ("web site" weight 0.2)
                                  && ("usability" weight 0.8)
return <result score="{$s}">{$b}</result>

The effect of weights on the result score is implementation-dependent. However, weight declarations must follow these rules:

  1. Weights in an FTContainsExpr are significant only in relation to each other; and

  2. When no explicit weight is specified, the default weight is 0.5.

Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.

2.4 Extensions to the Static Context

The XQuery Static Context is extended by a component for each of the full-text match options. Thus, the default of a match option in a query may be changed by providing a setting in the static context using the following declaration syntax.

[6]    Prolog    ::=    ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* ((VarDecl | FunctionDecl | OptionDecl | FTOptionDecl) Separator)*
[14]    FTOptionDecl    ::=    "declare" "ft-option" FTMatchOption

Match options modify the match semantics of full-text expressions. They are described in detail in Section 3.2 FTMatchOptions. When a match option is specified explicitly in a query, that setting overrides the setting of the respective match option in the static context.

3 FTSelections

This section describes FTSelections which contain the full-text operators in the FTContainsExpr, and the match options in FTMatchOptions which modify the matching semantics of the full-text selection expressions.

The FTSelection production specifies the possible full-text search conditions.

[144]    FTSelection    ::=    FTOr (FTMatchOption | FTProximity)* ("weight" RangeExpr)?

The "weight" value is the result of evaluating ExprSingle and can be any numeric value.

The syntax and semantics of the individual full-text selection operators follow.

This XML document fragment is the source document for examples in this section.

Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results may be different for other tokenizations.

Unless stated otherwise, the results assume a case-insensitive match.

<book number="1">
  <title shortTitle="Improving Web Site Usability">Improving  
      the Usability of a Web Site Through Expert Reviews and
      Usability Testing</title>
   <author>Millicent Marigold</author>
   <author>Montana Marigold</author>
   <editor>Véra Tudor-Medina</editor>
   <content>
     <p>The usability of a Web site is how well the  
         site supports the users in achieving specified  
         goals. A Web site should facilitate learning,  
         and enable efficient and effective task  
         completion, while propagating few errors.
     </p>
     <note>This book has been approved by the Web Site  
         Users Association.
     </note>
   </content>
 </book>

3.1 Full-Text Operators

[Definition: Full-text operators perform operations on tokens, phrases, and expressions. Some require that the relative positions of tokens in the document be known (e.g., proximity operators).]

3.1.1 FTWords

FTWords specifies the tokens and phrases that are being searched as the left-hand side argument of FTContainsExpr.

[150]    FTWords    ::=    FTWordsValue FTAnyallOption?
[151]    FTWordsValue    ::=    Literal | ("{" Expr "}")
[166]    FTAnyallOption    ::=    ("any" "word"?) | ("all" "words"?) | "phrase"

An FTWords is an FTWordsValue followed by the optional modifier FTAnyallOption. The right-hand side of FTWordsValue is an XQuery expression which must evaluate to a sequence of string values or nodes of type "xs:string". The result is then atomized into a sequence of strings which is tokenized into a sequence of tokens and phrases. If the atomized sequence is not a subtype of "xs:string*", an error is raised: [err:XPTY0004]XP.

If the "any" option is specified, a match occurs, if and only if at least one token or phrase in the sequence has a match in the searched text.

If the "all" option is specified, a match occurs, if and only if all of the tokens and phrases in the sequence are matched in the searched text.

If the "phrase" option is specified, all words and phrases are used to create a sequence of ordered words representing a new phrase. A match occurs, if and only if the resulting phrase is matched in the searched text.

If the "any word" option is specified, a match occurs, if and only if at least one token in the sequence of tokens and phrases is matched in the searched text.

If the "all word" option is specified, a match occurs, if and only if all tokens in the sequence of tokens and phrases are matched in the searched text.

If no option is specified, "any" is the default.

If the result is a single string, "any", "all", and "phrase" are equivalent.

/book[@number="1" and ./title ftcontains "Expert"]

returns the book element whose number is 1, because its title element contains the token "Expert".

/book[@number="1" and ./title ftcontains "Expert Reviews"]

returns the book element whose number is 1, because its title element contains the phrase "Expert Reviews".

/book[@number="1" and ./title ftcontains {"Expert",
"Reviews"} all]

returns the book element whose number is 1, because its title element contains two tokens "Expert" and "Reviews".

/book[@number="1"]//p ftcontains "Web Site Usability"

returns false, because the p element doesn't contain the phrase "Web Site Usability" although it contains all of the tokens in the phrase.

for $book in /book[.//author ftcontains "Marigold"] 
let score $score := $book/title ftcontains "Web Site Usability" 
where $score > 0.8 
order by $score descending
return $book/@number

returns book numbers of book elements by "Marigold" with a title about "Web Site Usability" sorting them in descending score order.

3.1.2 FTOr

[145]    FTOr    ::=    FTAnd ( "||" FTAnd )*

FTOr finds matches that satisfy at least one of the selection criteria.

A match must satisfy at least one of the FTSelection criteria.

 /book[.//author ftcontains "Millicent" ||
"Voltaire"] 

returns the book element written by "Millicent".

3.1.3 FTAnd

[146]    FTAnd    ::=    FTMildnot ( "&&" FTMildnot )*

FTAnd finds matches that satisfy both of the selection criteria.

A match must satisfy all of the FTSelection criteria which are specified by one or more FTMildNot expressions.

/book[@number="1"]/title ftcontains ("usability" && "testing")

returns true, since the book title contains "usability" and "testing".

/book/author ftcontains "Millicent" && "Montana"

returns false, because "Millicent" and "Montana" are not contained by the same author element in any book element.

3.1.4 FTMildNot

[147]    FTMildnot    ::=    FTUnaryNot ( "not" "in" FTUnaryNot )*

FTMildNot is a milder form of && ! (and not). 'a not in b' matches an expression that contains "a", but not when it is a part of "b". For example, a search for "Mexico" not in "New Mexico" returns, among others, a document which is all about "Mexico" but mentions at the end that "New Mexico was named after Mexico", which would not be returned by an "and not" search.

A match to FTMildNot must contain at least one token occurrence that satisfies the first condition and does not satisfy the second condition. If it contains a token occurrence that satisfies both the first and the second condition, the occurrence is not considered as a result.

/book ftcontains "usability" not in "usability
testing"

returns true, because "usability" appears in the title and the p elements and the occurrence within the phrase "Usability Testing" in the title element is not considered.

The right-hand side of a FTMildNot may not contain an FTSelection that evaluates to an AllMatches that contains a StringExclude. Such FTSelections are FTUnaryNot and FTTimes with at most, from-to, and exactly occurrences ranges.

3.1.5 FTUnaryNot

[148]    FTUnaryNot    ::=    ("!")? FTWordsSelection

FTUnaryNot finds matches that do not satisfy the selection criteria.

/book[. ftcontains ! "usability"]

returns the empty sequence, because all book elements contain "usability".

/book ftcontains "information" &&
"retrieval" && ! "information retrieval"

returns true, because book elements contain "information" and "retrieval" but not "information retrieval".

/book[. ftcontains "web site usability" && 
!"usability testing"]

return book elements containing "web site usability" but not "usability testing".

3.1.6 FTOrder

[153]    FTOrderedIndicator    ::=    "ordered"

FTOrder controls the order of tokens and phrases to be the same as the order in which they are written in the query.

The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.

FTOrder finds matches which must satisfy the nested selection condition and the match must contain the tokens in the order specified in the query.

/book/title ftcontains ("web site" && "usability")
ordered 

returns true, because titles of book elements contain "web site" and "usability" in the order in which they are written in the query, i.e., "web site" must precede "usability".

/book[@number="1"]/title ftcontains ("Montana" &&
"Millicent") ordered 

returns false, because although "Montana" and "Millicent" appear in the title element, they do not appear in the order they are written in the query.

3.1.7 FTScope

[171]    FTScope    ::=    ("same" | "different") FTBigUnit
[173]    FTBigUnit    ::=    "sentence" | "paragraph"

FTScope finds tokens and phrases contained in the same or a different scope.

Possible scopes are sentences and paragraphs.

By default, there are no restrictions on the scope of the matches.

If two tokens appear in the same sentence and in different sentences, then both same sentence and different sentence return true. The same is true for same paragraph and different paragraph.

/book ftcontains "usability"
&& "Marigold" same sentence

returns false, because the tokens "usability" and "Marigold" are not contained within the same sentence.

/book ftcontains "usability"
&& "Marigold" different sentence

returns true, because the tokens "usability" and "Marigold" are contained within different sentences.

/book[. ftcontains "usability" && "testing"
same paragraph] 

returns a book element, because it contains "usability" and "testing" in the same paragraph.

/book[. ftcontains "site" && "errors"
same sentence] 

returns a book element, because "site" and "errors" appear in the same sentence.

Some subtle relationships between FTScope and FTDistance will be discussed in Section 4.

3.1.8 FTDistance

[168]    FTDistance    ::=    "distance" FTRange FTUnit
[167]    FTRange    ::=    ("exactly" UnionExpr)
| ("at" "least" UnionExpr)
| ("at" "most" UnionExpr)
| ("from" UnionExpr "to" UnionExpr)
[172]    FTUnit    ::=    "words" | "sentences" | "paragraphs"

FTDistance finds matches by specifying the distance between tokens and phrases in FTUnits (tokens, sentences, and paragraphs). The number of intervening FTUnits is specified in the integer value of FTRange.

FTRange specifies a range of integer values, providing a minimum and maximum value. Each UnionExpr in an FTRange must evaluate (after atomization) to a singleton sequence with an atomic value of type "xs:integer". Otherwise, an error is raised [err:XPTY0004]XP.

Let the value of the first (or only) UnionExpr be M. If "from" is specified, let the value of the second UnionExpr be N. FTDistance may cross element boundaries when computing distance.

The following rule applies to FTDistance:

  • Zero words (sentences, paragraphs) means adjacent tokens (sentences, paragraphs).

If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the closed interval [0, M]. If "from-to" is specified, then the range is the closed interval [M, N].

Here are some examples of FTRanges:

  1. 'exactly 0' specifies the range [0, 0].

  2. 'at least 1' specifies the range [1,unbounded].

  3. 'at most 1' specifies the range [0, 1].

  4. 'from 5 to 10' specifies the range [5, 10].

The distances computed by FTDistance are not affected by the presence or absence of element boundaries in the text. Stop words are counted in those computations whether they are ignored or not.

/book ftcontains ("information" &&
"retrieval") not in ("information" && "retrieval" 
distance at least 11 words)

returns false, because "information" and "retrieval" are more than at least 11 tokens apart.

/book ftcontains "web" && "site" &&
"usability" distance at most 2 words

returns true, because "web", "site", and "usability" have at most 2 intervening tokens between them.

/book[. ftcontains "web site"
&& "usability" distance at most 1 words]/title 

returns the book title. A similar query for the p element would return false because "web site" and "usability" have two intervening tokens between them.

3.1.9 FTWindow

[169]    FTWindow    ::=    "window" UnionExpr FTUnit

FTWindow finds matches within a number of FTUnits (tokens, paragraphs, and phrases). The number of FTUnits is specified as an integer.

FTWindow may cross element boundaries. The size of the window is not affected by the presence or absence of element boundaries. Stop words are included in those computations whether they are ignored or not.

UnionExpr must evaluate to an atom of type "xs:integer".

A match of an FTSelection is considered a match within a window, if there exists a window of the given number of consecutive units (tokens, sentences, or paragraphs) in the document within which the match lies.

/book/title ftcontains "web" && "site"
&& "usability" window 5 words

returns true, because "web", "site", and "usability" are within a window of 5 tokens in the title element.

/book ftcontains ("web" && "site" ordered)
&& ("usability" || "testing") window 10 words

returns true, because "web" and "site" in the order they are written in the query and either "usability" or "testing" are within a window of at most 10 tokens.

/book//title ftcontains "web site" &&
"usability" window 3 words

returns true, because the title element contains "Web Site Usability". A similar query on the p element would not return true, because its occurrences of "web site" and "usability" are not within a window of 3.

/book[@number="1" and . ftcontains "efficient" 
&& ! "and" window 3 words]

returns the empty sequence, because in the selected book element, there is no occurrence of "efficient" within a window of 3 tokens which would not also contain an occurrence of "and".

3.1.10 FTTimes

[170]    FTTimes    ::=    "occurs" FTRange "times"

FTTimes finds matches in which an FTSelection occurs a specified number of times.

FTTimes limits the number of different occurrences of FTSelection, within the specified range.

In the document fragment "very very big":

  1. The FTSelection "very big" has 1 occurrence consisting of the second "very" and "big".

  2. The FTSelection "very && big" has 2 occurrences; one consisting of the first "very" and "big", and the other containing the second "very" and "big".

  3. The FTSelection "very || big" has 3 occurrences.

  4. The FTSelection ! "small" has 1 occurrence.

/book[. ftcontains "usability" occurs at least 2 times]/@number

returns book numbers because book elements contain 2 or more occurrences of "usability".

/book[@number="1" and title ftcontains "usability" ||
"testing" occurs at most 3 times] 

returns the empty sequence, because there are 4 occurrences of "usability" || "testing" in the designated title.

/book ftcontains "usability" occurs at least 2 times

returns true, because the book element contains 3 occurrences of "usability" in its title element although its p element contains only 1 occurrence.

3.1.11 FTContent

[165]    FTContent    ::=    ("at" "start") | ("at" "end") | ("entire" "content")

FTContent finds matches in which the tokens and phrases are the first, last or all of the tokens and phrases in the tokenized form of the items being searched.

The "at" "start" option finds matches in which the tokens or phrases are the first tokens or phrases in the tokenized string value of the element being searched.

The "at" "end" option finds matches in which the tokens or phrases are the last tokens or phrases in the tokenized string value of the element being searched.

The "entire" content" option finds matches in which the tokens or phrases are the entire content of the tokenized string value of the element being searched.

/books//title[. ftcontains "improving the usability
of a web site" at start]

returns each title element starting with the phrase "improving the usability of a web site".

/books//p[. ftcontains "propagat*" && "few
errors" distance at most 2 words at end]

returns each p element ending with the phrase "propagating few errors".

/books//note[. ftcontains "this site has been
approved by the web site users association" entire content]

returns each note element whose entire content is "this site has been approved by the web site users association".

3.2 FTMatchOptions

FTMatchOptions modify the operational semantics of the FTSelection on which they are applied.

[154]    FTMatchOption    ::=    FTCaseOption
| FTDiacriticsOption
| FTStemOption
| FTThesaurusOption
| FTStopwordOption
| FTLanguageOption
| FTWildCardOption

FTMatchOptions set environments for the matching options of FTSelection. [Definition: Match options modify the set of tokens and phrases in the query. Some of these options (e.g., stemming) have behaviors which depend on the language of the document, the language of the query, or both.] If a match option isn't specified explicitly in the query, its value is given by its static context component. Details about these context components, including their default values, are given in Appendix C Static Context Components.

If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:

/book/title ftcontains "usability" 

is equivalent to the query

/book/title ftcontains "usability" case insensitive 
    diacritics insensitive 
    without stemming without thesaurus  
    without stop words language "none" without wildcards

FTMatchOptions are applied in the order in which they are written in the query. More information on their semantics is given in 4.2.3 Match Options Semantics.

We describe each match option in more detail in the following sections.

3.2.1 FTCaseOption

[155]    FTCaseOption    ::=    "lowercase"
| "uppercase"
| ("case" "sensitive")
| ("case" "insensitive")

FTCaseOption modifies tokens and phrases matching by specifying how upper and lower charcters are considered.

FTCaseOption influences the way FTWords is applied.

There are four possible character case options:

  1. The option "uppercase" matches tokens and phrases with uppercase characters, regardless of the case of characters of the tokens and phrases as they are written in the query.

  2. The option "lowercase" matches tokens and phrases with lowercase characters, regardless of the case of characters of the tokens and phrases as they are written in the query.

  3. The option "case" "insensitive" matches the uppercase and lowercase characters of tokens and phrases. The case of characters as they are written in the query is not considered.

  4. The option "case" "sensitive" matches the case of the characters in tokens and phrases as they are written in the query.

The default is "case insensitive".

The following table summarizes the interactions between the case match options and the use of the default collations.

Case Matrix
Default collation options/Case options UCC (Unicode Codepoint Collation) CCS (some generic case-sensitive collation) CCI (some generic case-insensitive collation)
insensitive compare as if both lower case-insensitive variant of CCS if it exists, else error CCI
sensitive UCC CCS case-sensitive variant of CCI if it exists, else error
uppercase uppercase(Expr) + UCC uppercase(Expr) + CSS CCI
lowercase lowercase(Expr) + UCC lowercase(Expr) + CSS CCI

Note:

In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

/book[@number="1"]/title ftcontains "Usability" lowercase 

returns false, because the title element doesn't contain "usability" in lower-case characters.

/book[@number="1"]/title ftcontains "usability" 
case insensitive

returns true, because the character case is not considered.

3.2.2 FTDiacriticsOption

[156]    FTDiacriticsOption    ::=    ("with" "diacritics")
| ("without" "diacritics")
| ("diacritics" "sensitive")
| ("diacritics" "insensitive")

FTDiacriticsOption modifies token and phrase matching by specifying how diacritics are considered.

There are four possible diacritics options:

  1. The option "with" "diacritics" matches tokens and phrases with diacritics, regardless of whether the diacritics are written in the query.

  2. The option "without" "diacritics" matches tokens and phrases without diacritics, regardless of whether the diacritics are written in the query.

  3. The option "diacritics" "insensitive" matches tokens and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.

  4. The option "diacritics" "sensitive" matches tokens and phrases only if they contain the diacritics as they are written in the query.

The default is "diacritics insensitive".

The following table summarizes the interactions between the diacritics match options and the use of the default collations.

Diacritics Matrix
Default collation options/Diacritics options UCC (Unicode Codepoint Collation) CDS (some generic diacritics-sensitive collation) CDI (some generic diacritics-insensitive collation)
insensitive compare as if with and without diacritics-insensitive variant of CDS if it exists, else error CDI
sensitive UCC CDS diacritics-sensitive variant of CDI if it exists, else error
with diacritics "resume diacritic insensitive" not in "resume" "resume diacritic insensitive" not in "resume" CDI
without diacritics "resume" not in "resume diacritic sensitive" "resume" not in "resume diacritic sensitive" CDI

Note:

In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

/book[@number="1"]//editor ftcontains "Vera" with diacritics 

returns true, because the editor element contains the token "Vera" with an acute accent.

/book[@number="1"]/editors ftcontains "Véra" without diacritics 

returns false, because the editor element does not contain the token "Vera" without an acute accent.

3.2.3 FTStemOption

[157]    FTStemOption    ::=    ("with" "stemming") | ("without" "stemming")

FTStemOption modifies token and phrase matching by specifying whether stemming is applied or not.

FTStemOption influences the way FTWords is applied. It produces a disjunction of the query tokens by expanding the tokens into the list of tokens that share the same stem. By definition, the query tokens are included in that disjunction.

The "with stemming" option specifies that matches may contain tokens that have the same stem as the tokens and phrases written in the query. It is implementation-defined what a stem of a token is.

The "without stemming" option specifies that the tokens and phrases are not stemmed.

It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.

The default is "without stemming".

/book[@number="1"]/title ftcontains "improve" with stemming 

returns true, because the title of the specified book contains "improving" which has the same stem as "improve".

3.2.4 FTThesaurusOption

[158]    FTThesaurusOption    ::=    ("with" "thesaurus" (FTThesaurusID | "default"))
| ("with" "thesaurus" "(" (FTThesaurusID | "default") ("," FTThesaurusID)* ")")
| ("without" "thesaurus")
[159]    FTThesaurusID    ::=    "at" StringLiteral ("relationship" StringLiteral)? (FTRange "levels")?

FTThesaurusOption modifies token and phrase matching by specifying whether a thesaurus is used or not. If thesauri are used, it locates the thesauri by default or URI reference. It also states the relationship to be applied and how many levels within the thesaurus to be traversed.

FTThesaurusOption influences the way FTWords is applied.

The StringLiteral following the keyword at in FTThesaurusID is of the form of a URI Reference.

Thesauri add related tokens and phrases to the search. Thus, the user may narrow, broaden, or otherwise modify the search using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related search tokens and phrases in a disjunction (FTOr).

Note:

A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.

FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.

Relationships include, but are not limited to, the relationships and their abbreviations presented in [ISO 2788] and their equivalents in other languages:

  1. equivalence relationships (synoymns): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);

  2. hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and

  3. associative relationships: RELATED TERM (RT).

The "with thesaurus" option specifies that string matches include tokens that can be found in one of the specified thesauri.

The "without thesaurus" option specifies that no thesaurus will be used.

The "with default thesaurus" option specifies that a system-defined default thesaurus with a system-defined relationship is used. The default thesaurus may be used in combination with other explicitly specified thesauri.

The default is "without thesaurus".

count(.//book/content ftcontains "duties" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "synonyms")>0

returns true, because it finds a content element containing "tasks" which the thesaurus identified as a synonym for "duties".

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(./content ftcontains "web site components" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "narrower terms" at most 2 levels)>0]

returns book elements, because it finds a content element containing "web site components", and narrower terms "navigation" and "layout".

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(. ftcontains "Merrygould" with thesaurus at
"http://bstore1.example.com/UsabilitySoundex.xml" relationship
"sounds like")>0]

returns a book element containing "Marigold which sounds which sound like "Merrygould".

3.2.5 FTStopwordOption

[160]    FTStopwordOption    ::=    ("with" "stop" "words" FTRefOrList FTInclExclStringLiteral*)
| ("without" "stop" "words")
| ("with" "default" "stop" "words" FTInclExclStringLiteral*)
[161]    FTRefOrList    ::=    ("at" StringLiteral)
| ("(" StringLiteral ("," StringLiteral)* ")")
[162]    FTInclExclStringLiteral    ::=    ("union" | "except") FTRefOrList

FTStopWordOption controls word matching by specifying whether stop words are used or not. It can be used to define a set of tokens that will be replaced with a search on any token if used as search tokens.

FTStopWordOption influences the way FTWords is applied.

FTRefOrList specifies the list of stop words either explicitly as a comma-separated list of string literals, or by a URI following the keyword at. If a URI is used, it must point to a sequence of string atoms or nodes of type "xs:string". In both cases, no tokenization is performed on the strings: they are used as they occur in the sequence.

The "with stop words" option specifies that if a token is within the specified collection of stop words, it is removed from the search and any token may be substituted for it. Stop words retain their position numbers and are counted in FTDistance and FTWindow searches.

Multiple stop word lists may be combined using "union" or "except". If "union" is specified, every string occurring in the lists specified by the left-hand side or the right-hand side is a stop word. If "except" is specified, only strings occurring in the list specified by the left-hand side but not in the list specified by the right-hand side are stop words.

The "with default stop words" option specifies that an implementation-defined collection of stop words is used.

The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.

The default is "without stop words".

/book[@number="1"]//p ftcontains "propagation of errors"
with stemming with stop words ("a", "the", "of") 

returns true, because the document contains the phrase "propagating few errors".

Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.

/book[@number="1"]//p ftcontains "propagation of errors" 
with stemming without stop words

returns false, because "of" is not in the p element between "propagating" and "errors".

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then 
conducting" with stop words at 
"http://bstore1.example.com/StopWordList.xml")>0]

uses the stop words list specified at the URL. Assuming that the specified stop word list contains the "then", this query is reduced to a query on the phrase "planning X conducting", allowing any token as a substitute for X. It returns a book element, because its content element contains "planning then conducting". It would have also returned the book if the phrases "planning and conducting" and "planning before conducting" had been in its content.

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then conducting"
with stop words at "http://bstore1.example.com/StopWordList.xml"
except ("the then"))>0]

returns books containing "planning then conducting", but not does not return books containing "planning and conducting", since it is exempting "then" from being a stop word.

3.2.6 FTLanguageOption

[163]    FTLanguageOption    ::=    "language" StringLiteral

FTLanguageOption modifies token matching by specifying the language of search tokens and phrases.

FTLanguageOption influences the way FTWords is applied.

The StringLiteral following the keyword language designates one language. It must either be castable to "xs:language", or be the value "none". Otherwise, an error is raised: [err:XPTY0004]XP.

The "language" option influences tokenization, stemming, and stop words.

If the language "none" option is specified, no language selected.

The set of valid language identifiers is implementation-defined.

By default, there is no language selected.

/book[@number="1"]//editor ftcontains "salon de the"
with default stop words language "fr"

This is an example where the language option is used to select the appropriate stop word list.

3.2.7 FTWildCardOption

[164]    FTWildCardOption    ::=    ("with" "wildcards") | ("without" "wildcards")

FTWildCardOption modifies token and phrase matching by specifying whether wildcards are used or not.

FTWildCardOption influences the way FTWords is applied.

In addition to specifying the "with wildcards"' option, indicators (represented by periods (.)) and qualifiers are appended to or inserted into tokens being searched. Zero or more characters replace each indicator and qualifier.

Indicators are mandatory. When the "with wildcards"' option is present, one or more periods (.) must be appended at the beginning or end of tokens or inserted into tokens. If the period is at the beginning of a token, the wildcard is a prefix wildcard. If the period is at the end of a token, it is a suffix wildcard. If the period is inserted into a token, it is an infix wildcard.

When the "with wildcards" option and one or more periods (.) appended to or inserted into tokens are present, characters are appended or inserted at each of the periods. Any characters may be appended or inserted except newline characters (#xA), return characters (#xD), and tab characters (#x9). The number of characters depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.

  1. If a period is present, but no qualifiers, one character is appended or inserted.

  2. If a period is followed by a question mark (.?), zero or one characters are appended or inserted.

  3. If a period is followed by an asterisk (.*), zero or more characters are appended or inserted.

  4. If a period is followed by a plus sign (.+), one or more characters are appended or inserted.

  5. If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters is appended or inserted.

The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces recognized as regular characters.

The default is "without wildcards".

/book[@number="1"]/title ftcontains "improv.*" with
wildcards

returns true, because the title element contains "improving".

/book[@number="1"]/title ftcontains ".?site" with
wildcards

returns true, because the title element contains "site".

/book[@number="1"]/p ftcontains "w.ll" with
wildcards

returns true, because the p element contains "well".

3.3 FTIgnoreOption

[174]    FTIgnoreOption    ::=    "without" "content" UnionExpr

FTIgnoreOption specifies a set of element nodes whose content are ignored. [Definition: Ignored nodes are the set of element nodes whose content are ignored.] Ignored nodes are identified by the XQuery expression UnionExpr. Let N1, N2, ..., Nk be the sequence of nodes of the search context. The expression UnionExpr is evaluated in the context of each node Ni being searched. That is, the search context expression of the ftcontains predicate creates a new focus for the evaluation of the UnionExpr given with FTIgnoreOption, similar to the creation of the dynamic context of a path expression E1/E2 or a filter expression E1[E2] (see Section 2.1.2 Dynamic ContextXQ).

Now, let I1, I2, ..., In be the sequence of items that UnionExpr evaluates to. For each Ni (i=1..k) a copy is made that omits each node Ij (j=1..n) that is not Ni. Those copies form the new search context. If UnionExpr evaluates to an empty sequence no nodes are omitted.

In the following fragment, if .//annotation is ignored, "Web Usability" will be found 2 times: once in the title element and once in the editor element. The 2 occurrences in the 2 annotation elements are ignored. On the other hand, "expert" will not be found, as it appears only in an annotation element.

<book>
   <title>Web Usability and Practice</title>
   <author>Montana <annotation> this author is an expert in Web Usability</annotation> 
           Marigold
   </author>
   <editor>Véra Tudor-Medina on Web <annotation> best editor on Web Usability</annotation>
           Usability
   </editor>
 </book>

By default, no element content is ignored.

4 Semantics

This section describes the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. The figure below shows how XQuery 1.0 and XPath 2.0 Full-Text integrates with XQuery 1.0 and XPath 2.0.

The following diagram represents the interaction of XQuery 1.0 and XPath 2.0 Full-Text with the rest of XQuery 1.0 and XPath 2.0 languages. It specifies how full-text expression can be nested within XQuery 1.0 and XPath 2.0 expressions and vice versa.

Composability diagram

The functions and schemas defined in this section are considered to be within the fts: namespace. These functions and schemas are used only for describing the semantics. There is no requirement that these functions and schemas be implemented, so there is no URI is associated with the fts: prefix.

4.1 Tokenization

[Definition: Formally, tokenization is the process of converting the string value of a node to a sequence of token occurrences, taking the structural information of the node into account to identify token, sentence, and paragraph boundaries.]

Tokenization is subject to the following constraint:

  1. Attribute values are not tokenized.

4.1.1 Examples

The following document fragment is the source document for examples in this section. Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results might be different for other tokenizations.

Unless stated otherwise, the results assume a case-insensitive match.

<offers>
    <offer id="1000" price="10000">
        Ford Mustang 2000, 65K, excellent condition, runs 
        great, AC, CC, power all
    </offer>
    <offer id="1001" price="8000">
        Honda Accord 1999, 78K, A/C, cruise control, runs 
        and looks great, excellent condition
    </offer>
    <offer id="1005" price="5500">
        Ford Mustang, 1995, 150K highway mileage, no rust, 
        excellent condition
    </offer>
</offers>
        

In this sample tokenization, tokens are delimited by punctuation and whitespace symbols.

  • The token "Ford" is at relative position 1.

  • The token "Mustang" is at relative position 2.

  • The token "2000" is at relative position 3.

  • Relative position numbers are assigned sequentially through the end of the document.

Hence each token occupies exactly one position, and no overlapping of tokens occurs. The relative positions of token occurrences are shown below in parentheses.

<offers>
    <offer id="1000" price="10000">
        Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5)
        condition(6), runs(7) great(8), AC(9), CC(10), 
        power(11) all(12)
    </offer>
    <offer id="1001" price="8000">
        Honda(13) Accord(14) 1999(15), 78K(16), A(17)/C(18),
        cruise(19) control(20), runs(21) and(22) looks(23)
        great(24), excellent(25) condition(26)
    </offer>
    <offer id="1005" price="5500">
        Ford(27) Mustang(28), 1995(29), 150K(30) highway(31)
        mileage(32), little(33)  rust(34), excellent(35) 
        condition(36)
    </offer>
</offers>
        

The relative positions of paragraphs are determined similarly. In this sample tokenization, the paragraph delimiters are start tags, end tags, and end of line characters.

  • The tokens in the first element are assigned relative paragraph number 1.

  • The tokens from the next element are assigned relative paragraph number 2.

  • Relative paragraph numbers are assigned sequentially through the end of the document.

The relative positions of sentences are determined similarly using sentence delimiters.

Implementations may provide for the means to ignore or side-step certain structural elements when performing tokenization. In the following example, the implementation has decided to ignore the markup for <bold> and prune out the entire subtree headed by <deleted>.

<para><deleted>This sentence was deleted.</deleted>
This <bold>entire paragraph</bold> is one sentence
as far as the tokenizer is concerned.
</para>

Using the same notation as before, this sample tokenization is shown below. All the token occurrences marked with a token position also have the same sentence and paragraph relative positions. Note that there are no tokens marked for the ignored subtree.

<para><deleted>This sentence was deleted.</deleted>
This(1) <bold>entire(2) paragraph(3)</bold> is(4) one(5) sentence(6)
as(7) far(8) as(9) the(10) tokenizer(11) is(12) concerned(13).
</para>

4.1.2 Representations of Tokenized Text and Matching

Two representations of tokenized text will be employed in the formal semantics functions, one for the search strings of a query and one for matched token occurrences of search context items.

A [Definition: SearchItem is a sequence of SearchTokenInfos representing the sequence of tokens derived from tokenizing one search string. ]

A [Definition: SearchTokenInfo is the identity of a token inside a search string. ] Each SearchTokenInfo is associated with a unique identifier that captures the relative position of the search string in the query in document order.

A [Definition: TokenInfo represents a sequence of consecutive token occurrences inside an XML document. ] Each TokenInfo is associated with:

  • a unique identifier that captures the relative position of the first token occurrence of the sequence in the document order: startPos

  • a unique identifier that captures the relative position of the last token occurrence of the sequence in the document order: endPos

  • the relative position of the sentence containing the first token occurrence or zero if the tokenizer does not report sentences: startSent

  • the relative position of the sentence containing the last token occurrence or zero if the tokenizer does not report sentences: endSent

  • the relative position of the paragraph containing the first token occurrence or zero if the tokenizer does not report paragraphs: startPara

  • the relative position of the paragraph containing the last token occurrence or zero if the tokenizer does not report paragraphs: endPara

The following matching function is the central implementation-defined primitive performing the full-text retrieval.

declare function fts:matchTokenInfos (
      $searchContext as item(),
      $matchOptions as element(fts:matchOptions),
      $stopWords as xs:string*,
      $searchTokens as element(fts:searchToken)* )
   as element(fts:tokenInfo)*  external;
            

The above function returns the TokenInfos in items in $searchContext that match the search string represented by the sequence $searchTokens, when using the match options in $matchOptions and stop words in $stopWords. If $searchTokens is a sequence of more than one search token, each returned TokenInfo must represent a phrase matching that sequence.

Note:

While this matching function assumes a tokenized representation of the search strings, it does not assume a tokenized representation of the input items in $searchContext, i.e. the texts in which the search happens. Hence, the tokenization of the search context is implicit in this function and coupled to the retrieval of matches. Of course, this does not imply that tokenization of the search context cannot be done a priori. Because tokenization is implementation-defined, the tokenization of each item in $searchContext does not necessarily take into account the match options in $matchOptions or the search tokens in $searchTokens. This allows implementations to tokenize and index input data without the knowledge of particular match options used in full-text queries.

4.2 Evaluation of FTSelections

The sequence of nodes in the XQuery 1.0 and XPath 2.0 Data Model is inadequate to support fully composable FTSelections. Full-text operations, such as FTSelections, operate on linguistic units, such as positions of tokens, and which are not captured in the XQuery 1.0 and XPath 2.0 Data Model (XDM).

XQuery 1.0 and XPath 2.0 Full-Text adds relative token, sentence, and paragraph position numbers via AllMatches. AllMatches make FTSelections fully composable.

4.2.1 AllMatches

4.2.1.1 Formal Model

[Definition: An AllMatches describes the possible results of an FTSelection.] The UML Static Class diagram of AllMatches is shown on the diagram given below.

AllMatches class diagram

The AllMatches object contains zero or more Matches.

[Definition: Each Match describes one result to the FTSelection.] The result is described in terms of zero or more StringIncludes and zero or more StringExcludes.

[Definition: A StringMatch is a possible match of a sequence of search tokens with a corresponding sequence of consecutive token occurrences in a document. A StringMatch may be a StringInclude or StringExclude.] The queryPos attribute specifies the position of the search token in the query. This attribute is needed for FTOrders. The matched document token sequence is described in the TokenInfo associated with the StringMatch.

[Definition: A StringInclude is a StringMatch that describes a TokenInfo that must be contained in the document.]

[Definition: A StringExclude is a StringMatch that describes a TokenInfo that must not be contained in the document.]

Intuitively, AllMatches specifies the TokenInfos that a node contains and does not contain to satisfy an FTSelection.

The AllMatches structure resembles the Disjunctive Normal Form (DNF) in propositional and first-order logic. The AllMatches is a disjunction of Matches. Each Match is a conjunction of StringIncludes, and StringExcludes.

4.2.1.2 Examples

Since in most of the examples below the tokens span only a single position, we characterize the TokenInfo instance by simply giving this position, written as "Pos:X". This should be read as the value for both, the startPos and the endPos attribute. Furthermore, for expository reasons, we include in each StringMatch example an attribute "query string", set to the original query string, in order to facilitate the association from which query string that match came from.

The simplest example of an FTSelection is an FTWords such as "Mustang". The AllMatches corresponding to this FTWords is given below.

Sample AllMatches

As shown, the AllMatches consists of two Matches. Each Match represents one possible result of the FTWords "Mustang". The result represented by the first Match, represented as a StringInclude, contains the token "Mustang" at position 2. The result described by the second Match contains the token "Mustang" at position 28.

A more complex example of an FTSelection is an FTWords such as "Ford Mustang". The AllMatches for this FTWords is given below.

Sample AllMatches

There are two possible results for this FTWords, and these are represented by the two Matches. Each of the Matches requires two tokens to be matched. The first Match is obtained by matching "Ford" at position 1 and matching "Mustang" at position 2. Similarly, the second Match is obtained by matching "Ford" at position 27 and "Mustang" at position 28.

An even more complex example of an FTSelection is an FTSelection such as "Mustang" && ! "rust" that searches for "Mustang" but not "rust". The AllMatches for this FTSelection is given below.

Sample AllMatches

This example introduces StringExclude. StringExclude corresponds to negation in DNF. It specifies that the result described by the corresponding Match must not match the token at the specified position. In this example, the first Match specifies that "Mustang" is matched at position 2, and that the token "rust" at position 34 is not matched.

4.2.1.3 XML representation

AllMatches has a well-defined hierarchical structure. Therefore, the AllMatches can be easily modeled in XML. This XML representation and those which follow formally describe the semantics of FTSelections. For example, the XML representation of AllMatches formally specifies how an FTSelection operates on zero or more AllMatches to produce a resulting AllMatches.

The XML schema for representing AllMatches is given below.

<xs:schema 
     xmlns:xs="http://www.w3.org/2001/XMLSchema" 
     xmlns:fts="http://www.w3.org/2006/xquery-full-text"
     targetNamespace="http://www.w3.org/2006/xquery-full-text"
     elementFormDefault="qualified" 
     attributeFormDefault="unqualified">

  <xs:complexType name="AllMatches">
    <xs:sequence>
      <xs:element ref="fts:match" 
                  minOccurs="0" 
                  maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="stokenNum" type="xs:integer" use="required" />
  </xs:complexType>

  <xs:element name="allMatches" type="fts:AllMatches"/>

  <xs:complexType name="Match">
    <xs:sequence>
      <xs:element ref="fts:stringInclude" 
                  minOccurs="0" 
                  maxOccurs="unbounded"/>
      <xs:element ref="fts:stringExclude" 
                  minOccurs="0" 
                  maxOccurs="unbounded"/>
   </xs:sequence>
  </xs:complexType>
  
  <xs:element name="stringInclude" 
              type="fts:StringMatch" />

  <xs:element name="stringExclude" 
              type="fts:StringMatch" />

  <xs:element name="match" type="fts:Match"/>

  <xs:complexType name="StringMatch">
    <xs:sequence>
      <xs:element ref="fts:tokenInfo"/>
    </xs:sequence>
    <xs:attribute name="queryPos" 
                  type="xs:integer" 
                  use="required"/>
  </xs:complexType>

  <xs:complexType name="TokenInfo">
    <xs:attribute name="startPos" 
                  type="xs:integer" 
                  use="required"/>
    <xs:attribute name="endPos" 
                  type="xs:integer" 
                  use="required"/>
    <xs:attribute name="startSent" 
                  type="xs:integer" 
                  use="required"/>
    <xs:attribute name="endSent" 
                  type="xs:integer" 
                  use="required"/>
    <xs:attribute name="startPara" 
                  type="xs:integer" 
                  use="required"/>
    <xs:attribute name="endPara" 
                  type="xs:integer" 
                  use="required"/>
  </xs:complexType>

  <xs:element name="tokenInfo" type="fts:TokenInfo"/>

  <xs:complexType name="SearchItem">
    <xs:sequence>
      <xs:element ref="fts:searchToken" 
                  minOccurs="0" 
                  maxOccurs="unbounded"/>
   </xs:sequence>
  </xs:complexType>

  <xs:complexType name="SearchTokenInfo">
    <xs:attribute name="word" 
                  type="xs:string" 
                  use="required"/>
    <xs:attribute name="queryPos" 
                  type="xs:integer" 
                  use="required"/>
  </xs:complexType>

  <xs:element name="searchToken" type="fts:SearchTokenInfo"/>
</xs:schema>
                

The stokenNum attribute in AllMatches is related to the representation of the semantics as XQuery functions.