SRU (Search/Retrieval Using URL)

SRU VERSION 1.1 ARCHIVE

Common Query Language

CQL Version 1.1 13th February 2004

Sample Queries - BNF - Rules - Features - Conformance - Context Sets - the CQL Context Set - Relations - Modifiers - Masking - Result Sets - Proximity

CQL is a formal language for representing queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. The design objective is that queries be human readable and writable, and that the language be intuitive while maintaining the expressiveness of more complex languages.

Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accomodate complex concepts when necessary.


Sample Queries

Following are examples of simple CQL queries. These are all self-explanatory:

dinosaur
"complete dinosaur"
title = "complete dinosaur"
title exact "the complete dinosaur"
dinosaur or bird
dinosaur and "ice age"
dinosaur not reptile
dinosaur and bird or dinobird
(bird or dinosaur) and (feathers or scales)
"feathered dinosaur" and (yixian or jehol)
publicationYear < 1980
lengthOfFemur > 2.4
bioMass >= 100

The following are a bit more complicated:

Example Explanation

title all "complete dinosaur"

Title contains all of the words: "complete", and "dinosaur"


title any "dinosaur bird reptile"

Title contains any of the words: "dinosaur", "bird", or "reptile"

(caudal or dorsal) prox vertebra

A proximity query: either "caudal" or "dorsal" near 'vertebra"

ribs prox/distance<=5 chevrons

A more specific proximity query: "ribs" within 5 words of "chevrons"

ribs prox/unit=sentence chevrons

"ribs" in the same sentence as "chevrons"

ribs prox/distance>0/unit=paragraph chevrons

"ribs" and "chevrons" occuring in the same document in different paragraphs

subject any/relevant "fish frog"

find documents that would seem relevant either to "fish" or "frog"

subject any/rel.lr "fish frog"

Same as previous, but use a specific relevance algorithm (linear regression)


Formal Definition: CQL BNF

Following is the Backus Naur Form (BNF) definition for CQL. ["::=" represents "is defined as"]

cqlQuery

::=

prefixAssignment cqlQuery | scopedClause

prefixAssignment

::=

'>' prefix '=' uri | '>' uri

scopedClause

::=

scopedClause booleanGroup searchClause | searchClause

booleanGroup

::=

boolean [modifierList]

boolean

::=

'and' | 'or' | 'not' | 'prox'

searchClause

::=

'(' cqlQuery ')'
| index relation searchTerm
| searchTerm

relation

::=

comparitor [modifierList]

comparitor

::=

comparitorSymbol | namedComparitor

comparitorSymbol

::=

'=' | '>' | '<' | '>=' | '<=' | '<>'

namedComparitor

::=

identifier

modifierList

::=

modifierList modifier | modifier

modifier

::=

'/' modifierName [comparitorSymbol modifierValue]

prefix, uri, modifierName, modifierValue, searchTerm, index

::=

term

term

::=

identifier | 'and' | 'or' | 'not' | 'prox'

identifier

::=

charString1 | charString2

charString1

:=

Any sequence of characters that does not include any of the following:

whitespace
( (open parenthesis )
) (close parenthesis)
=
<
>
'"' (double quote)
/

If the final sequence is a reserved word, that token is returned instead. Note that '.' (period) may be included, and a sequence of digits is also permitted. Reserved words are 'and', 'or', 'not', and 'prox' (case insensitive). When a reserved word is used in a search term, case is preserved.

charString2

:=

Double quotes enclosing a sequence of any characters except double quote (unless preceded by backslash (\)). Backslash escapes the character following it. The resultant value includes all backslash characters except those releasing a double quote (this allows other systems to interpret the backslash character). The surrounding double quotes are not included.


General Rules

  1. CQL Query
    A CQL query is essentially a search clause, or multiple search clauses connected by boolean operators. (In addition it may include prefix assignments which assign short names to known contexts. See context sets.)

  2. Search Clause
    A search clause consists of an index, relation, and search term, or a search term alone. Thus every search clause has a search term, but both the index and relation may be omitted -  the clause must include either both or neither of the index and relation. (Note that the use of the "index" concept in CQL is not intended to have any implementation implications; it does not imply the presence of a physical index.)

    Examples:  
             Index/relation/search term:  title = cat
            Search term only:   cat

  3. Search Term
    Search terms may be enclosed in double quotes. Search terms must be enclosed in double quotes if they contain any of the following characters: < > = / ( ) and whitespace. The search term may be empty, but must be present in a search clause. An empty search term is expressed as "" and has no defined semantics.

  4. Index Name
    An index name always includes a base name and may also include a prefix, which provides a context for the index name, the name of the context set of which the index is a part. If the context is not supplied, it is determined by the server.  If the index is not supplied it is determined by the server. (Note that the index may be omitted only when the relation is also omitted. Either both must be supplied, or both omitted.)

    Examples:
        title = cat      context determined by the server
        dc.title = cat    index context is dc
        
    cat       context and index determined by the server

  5. Relation
    The relation in a search clause specifies the relationship between the index and search term.  It also always includes a base name and may also include a prefix providing a context for the relation. If a relation is supplied with no accompanying context, the context is 'cql'  (the cql context set).  If no relation is supplied, then cql.scr (server choice) is assumed, which means that the relation is determined by the server. (Note that the relation may be omitted only when the index is also omitted. Either both must be supplied, or both omitted.)

    Examples:
       title = cat  context for relation is 'cql' ; fully qualified relation is cql.=
      
    title cql.any cat    relation is 'any'; relation context is 'cql'.  Equivalent to: title any cat
       cat      index and relation are determined by the server (formally the relation is 'cql.scr')

  6. Relation Modifiers
    Relation modifiers may accompany a relation. These also may be accompanied by a context.  If a context is not supplied for a modifier, the default is the cql context set. Relation modifiers are separated from each other and from the relation by slashes ( /). Whitespace may be present on either side of a / character, but the relation plus modifiers group may not end in a /.

    Examples:
          dc.title any/relevant/rel.CORI "cat fish"
              the relation 'any' is modified by (1) 'relevant' whose context is 'cql' and (2) 'CORI' whose context is 'rel'.
          dc.author exact/stem "smith, j."    the relation 'exact' is modified by 'stem' whose context is 'cql'.
      

  7. Boolean Operators
    Search clauses may be linked by boolean operators. These are: and, or, not and prox. (Note that  not is really and-not, that is, it may not be used as a unary operator.) Boolean operators all have the same precedence; they are evaluated left-to-right. Parentheses may be used to overide left-to-right evaluation.

  8. Boolean Modifiers
    As a relation may have modifiers, similiarly, a boolean operator may have modifiers, separated by '/' characters. Boolean modifiers may come from any context set. If not supplied, the context is the CQL context set. (Note that Boolean operators themselves are limited to the built-in set of four.)

    Example:  dc.title=cat and/rel.sum dc.title=dog

  9. Case Insensitive
    All parts of CQL are case insensitive apart from user supplied search terms, which may or may not be case sensitive. 'OR','or', 'Or' and 'oR' are all the same boolean operator, just as 'dc.title', 'DC.Title' and 'dC.TiTLe' are all the same context set plus index name.


Additional CQL Features

The following are all formally defined by the CQL context set but described here for convenience.

Relations

For ordered (e.g. numeric) terms:
<
, >, <=, >=, and <> mean "less than", "greater than", "less or equal", "greater or equal", and "not equal".

when the term is a list of words:

  •   '=' is used for word adjacency -- the words appear in that order with no others intervening.  (Note the dual use of '=', it is used for numeric equality as described above.)

  •   'any' means "any of these words"

  •   'all' means "all of these words"

When the term is a character string:
'exact' is used for exact string matching.

When the term has multiple dimensions:
'within' may be used to search for values that fall within the range, area or volume described by the search term.

When the index's data has multiple dimensions:
'encloses' may be used to search for values of the database's term fully encloses the search term.

  Examples:
This query Would match this but not this

title =  "cat in the hat"

 "a day in the life of the cat in the hat"

"hat in the cat" or "cat in the green hat"

title all "cat hat"

"hat in the cat"

"cat in the grass"

title any "cat hat"

"cat in the grass"

"dog in the grass"

title exact "cat in the hat"

"cat in the hat"

"a day in the life of the cat in the hat"

date within "2002 2005"

2004

2006

dateRange encloses 2003

"2002 2005"

"2004 2005"


Relation Modifiers - Term Functions

These relation modifiers request that the server perform some algorithm on the term before processing.

  • stem
    The server should apply a stemming algorithm to the words within the term. for example, walked, walking, walker etc. would all be represented by the stem word walk. This allows a search like title =/stem "these completed dinosaurs" to match  The Complete Dinosaur.

  • relevant
    The server should use a relevancy algorithm for determining matches and the order of the result set.

    Example: subject any/relevant "fish frog"
    would find records relevant to "fish" or "frog" and order the result set by relevance to fish or frog.

Relation Modifiers - Qualifiers

These modifiers qualify the relation to more precisely determine its semantics.

  • word
    The term consists of words (rather than being an opaque string).

  • string
    The term is a single item, and should not be broken up.

  • isoDate
    Each item within the term conforms to the ISO 8601 specification for expressing dates.

  • number
    Each item within the term is a number.

  • uri
    Each item within the term is a URI.

  • masked
    This means that the masking rules (see next) apply. Masking is assumed even if not specified, unless 'unmasked' is specified (so there is never any reason to include 'masked').

  • unmasked
    Do not apply masking rules.

Masking Rules

  • A single asterisk (*) is used to mask zero or more characters.

  • A single question mark (?) is used to mask a single character, thus N consecutive question-marks means mask N characters.

  • Carat/hat (^) is used as an anchor character for terms that are word lists, that is, where the relation is 'all' or 'any', or '=' when used for word adjacency. It may not be used to anchor a string, that is, when relation is 'exact' (string matches are, by definition, anchored). It may occur at the beginning or end of a word (with no intervening space) to mean right or left anchored."^" has no special meaning when it occurs within a word (not at the beginning or end) or string but must be escaped nevertheless.

  • Backslash (\) is used to escape '*', '?', quote (") and '^' , as well as itself. The use of a backslash not followed immediately by one of these characters is reserved for future definition.

Masking examples:

  • dc.title = c*t (matches cat and coast etc.)

  • dc.title = c?t (matches cat and cot, not coast or ct)
    " ?" (matches any single character)

  • dc.title = "^cat in the hat" (matches 'cat in the hat' where it is at the beginning of the field)

  • dc.title any "^cat eats rat" (matches 'cat eats rat', 'cat eats dog', 'cat', but not 'rat eats cat')

  • dc.title any "^cat ^dog eats rat" (matches 'cat eats rat', 'dog eats cat', 'cat loves bat', but not 'bat loves cat')

  • dc.title = "\"Of Couse\" she said"

Result Set Name Used in Query

A search clause may be a result set name. This is a special case, employing the context set 'cql'. The index and relation are expressed as "cql.resultSetId =" and the term is a result set name that has been returned by the server in the 'resultSetName' parameter of the response. It may be used by itself in a query to refer to an existing result set from which records are desired. It may also be used in conjunction with other resultSetName clauses or other indexes, combined by boolean operators. The semantics of resultSetId with relations other than "=" is undefined.

Example: cql.resultSetId = "resultA" and cql.resultSetId = "resultB"

Proximity

The proximity boolean boolean operator is expressed in terms of distance, unit, and ordering.

Examples:

  • dc.title = "cat" prox/distance=1/unit=word dc.title = "in"
  • "cat" prox/distance>2/ordered "hat"

distance takes the form:
     distance [relation] [value]
where relation is one of: "<", ">" ,"<=" ,">=" ,"=" , "<>"; default "<="
and value is a non-negative integer; default: 1 for word, zero otherwise

unit takes the form
     unit=[value]
where value is one of "word", "sentence", "paragraph", or "element"(default "word"),

ordering is "ordered" or "unordered"; default "unordered"

 


CQL Context Sets

Context sets permit CQL users to create their own indexes, relations, relation modifiers and boolean modiers without fear of chosing the same name as someone else and thereby having an ambiguous query. All of these four aspects of CQL must come from a context set, however there are rules for determining the prevailing default if one is not supplied. Context sets allow CQL to be used by communities in ways which the designers could not have foreseen, while still maintaining the same rules for parsing which allow interoperability.

When defining a new context set, it is necessary to provide a description of the semantics of each item within it. While context sets may contain indexes, relations, relation modifiers and boolean modifiers, there is no requirement that all should be present; in fact it is expected that most context sets will only define indexes.

Each context set has a unique identifier, a URI. When sending the context set in a query, a short form is used. These short names may be sent as a mapping within the query itself (see next), or be published by the recipient of the query in some protocol dependent fashion. The prefix 'cql' is reserved for the CQL context set, but authors may wish to recommend a short name for use with their set.

An index, relation, or modifier qualified by a context is represented in the form prefix.value, where prefix is a short name for a unique context set identifier.

Binding Short Name to URI
The binding of short name to URI is defined either within the query or by the server. A prefix map may occur at any place in the query and applies to anything which follows. Example:

>dc="http://www.dublincore.org/" dc.title = "cat"

In the following query:

>a="http:/x.com/y" a.title=cat and (>a="http:/f.com/g" a.title=hat) and a.title=rat

both the "a" in "a.title=cat" and in "a.title=rat" refer to http:/x.com/y, while the "a" in "a.title=rat" refers to http:/f.com/g.

Default Context
When no context is attached to a relation, relation modifier, or boolean modifier, the context is the cql context set.  When no context is attached to an index the context is determined by the server.

 


Conformance

In order to claim conformance to CQL a server must support one of the following three levels:

Level 0

  1. Must be able to process a term-only query.
    (The term is either a single word or if multiple words separated by spaces then the entire search term is quoted). If the term includes quote marks, they must be a escaped by preceding them with a backslash, e.g."raising the \"titanic\"".)

  2. If an unsupported query is supplied, must be able to respond with a diagnostic to say that the query is not supported.

Level 1

  1. Support for Level 0.

  2. Ability to parse both:
    (a) search clauses consisting of 'index relation searchTerm'; and
    (b) queries where search terms are combined with booleans, e.g. "term1 AND term2"

  3. Support for at least one of (a) and (b).

Note that (b) does not necessarily include queries such as:

index relation term1 AND index relation term2

but rather queries where the search clauses are terms-only (do not include index or relation).

Level 2

  1. Support for Level 1.

  2. Ability to parse all of CQL and respond with appropriate diagnostics.

Note that Level 2 does not require support for all of CQL, it requires that the server be able to parse all of CQL (and respond with proper diagnostics for the parts not supported.).