SRU (Search/Retrieval Using URL)

The CQL Context Set version 1.2   

see also version 1.1

The CQL context set defines a set of indexes, relations and relation modifiers. The indexes supplied are 'utility' indexes which are generally useful across all applications of the language. These utility indexes are for instances when CQL is required to express a concept not directly related to the records, or for indexes applicable in practically every context.

  • The reserved name for this context set is: cql
  • The identifier for this context set is: info:srw/cql-context-set/1/cql-v1.2

Sections: Indexes | Relations | Relation Modifiers | Booleans | Boolean Modifiers


INDEXES

  • resultSetId
    A search clause may be a result set id. This is a special case, where the index and relation are expressed as "cql.resultSetId =" and the term is the result set id returned by the server in the 'resultSetId' parameter of the SRU response. It may be used by itself in a query to refer to an existing result set from which records are desired. It may also be used in conjunction with other resultSetId clauses or other indexes, combined by boolean operators. The semantics of resultSetId with relations other than "=" is undefined. The semantics of resultSetId with scan is also undefined.

    Examples:

    • cql.resultSetId = "5940824f-a2ae-41d0-99af-9a20bc4047b1"
      Match the result set with the given identifier.
  • allRecords
    A special index which matches every record available. Every record is matched no matter what values are provided for the relation and term, but the recommended syntax is: cql.allRecords = 1. The semantics for scanning allRecords is not defined.

    Examples:

    • cql.allRecords = 1 NOT dc.title = fish
      Search for all records that do not match 'fish' as a word in title.
  • allIndexes
    Alias: anywhere
    The 'allIndexes' index will result in a search equivalent to searching all of the indexes (in all of the context sets) that the server has access to. The semantics for scanning allIndexes is not defined.

    Examples:

    • cql.allIndexes = fish
      If the server had three indexes title, creator and date, then this would be the same as title = fish or creator = fish or date = fish
  • anyIndexes
    Alias: serverChoice
    The 'anyIndexes' index allows the server to determine how to search for the given term. The server may choose one or more indexes in which to search, which may or may not be generally available via CQL. It may choose a different index to search every time, based on the term for example, and hence may not produce consistent results via scan.

    This is the default when the index and relation is omitted from a search clause. The relation used when the index is omitted is '='.

    Examples:

    • cql.anyIndexes = fish
      Search in any one or more indexes for the term fish
  • keywords
    The keywords index is an index of terms from the record, determined by the server as being generally descriptive or meaningful to search on. It might include the full text of a document, descriptive metadata fields, or anything else generally useful to search as an initial entry point to the data. Exactly which fields make up this index is determined by the server, however the choice must be consistent, unlike anyIndexes above, when the choice can be different for different searches.

    Examples:

    • cql.keywords any/relevant "code computer calculator programming"
      Search in descriptive locations for the given terms

RELATIONS

Implicit Relations
These relations are defined as such in the grammar of CQL. The cql context set only defines their meaning, rather than their existence.

Note: the relations 'scr' and 'exact' have been replaced by '=' and '==', respectively, in this version.  The relation '=' in the previous version had been used for adjacency, and in this version adjacency is now 'adj'.

  • =
    This is the default relation, and the server can choose any appropriate relation or means of comparing the query term with the terms from the data being searched. If the term is numeric, the most commonly chosen relation is '=='. For a string term, either 'adj' or '==' as appropriate for the index and term.

    Examples:

    • animal.numberOfLegs = 4
      Recommended to use '=='
    • dc.identifer = "gb 141 staff a-m"
      Recommended to use '=='
    • dc.title = "lord of the rings"
      Recommended to use 'adj'
    • dc.date = "2004 2006"
      Recommended to use 'within'
  • ==
    This relation is used for exact equality matching. The term in the data is exactly equal to the term in the search.

    Examples:

    • dc.identifier == "gb 141 staff a-m"
      Search for the string 'gb 141 staff a-m' in the identifier index.
    • dc.date == "2006-09-01 12:00:00"
      Search for the given datestamp.
    • animal.numberOfLegs == 4
      Search for animals with exactly 4 legs.
  • <>
    This relation means 'not equal to' and matches anything which is not exactly equal to the search term.

    Examples:

    • dc.date <> 2004-01-01
      Search for any date except the first of January, 2004
    • dc.identifier <> ""
      Search for any identifier which is not the empty string.
  • <, >, <=,>=
    These relations retain their regular meanings as pertaining to ordered terms (less than, greater than, less than or equal to, greater than or equal to).

    Examples:

    • dc.date > 2006-09-01
      Search for dates after the 1st of September, 2006
    • animal.numberOfLegs < 4
      Search for animals with less than 4 legs.

Defined Relations
These relations are defined as being widely useful as part of a default context set.

  • adj
    This relation is used for phrase searches. All of the words in the search term must appear, and must be adjacent to each other in the record in the order of the search term. The query could also be expressed using the PROX boolean operator.

    Examples:

    • dc.title adj "lord of the rings"
      Search for the phrase 'lord of the rings' somewhere in the title.
    • dc.description adj "blue shirt"
      Search for 'blue' followed by 'shirt' in the description.
  • all, any
    These relations may be used when the term contains multiple items to indicate "all of these items" or "any of these items". These queries could be expressed using boolean AND and OR respectively. These relations have an implicit relation modifier of 'cql.word', which may be changed by use of alternative relation modifiers.

    Examples:

    • dc.title all "lord rings"
      Search for both lord and rings in the title.
    • dc.description any "computer calculator"
      Search for either computer or calculator in the description.
  • within
    Within may be used with a search term that has multiple dimensions. It matches if the database's term falls completely within the range, area or volume described by the search term, inclusive of the extents given.

    Examples:

    • dc.date within "2002 2003"
      Search for dates between 2002 and 2003 inclusive.
    • animal.numberOfLegs within "2 5"
      Search for animals that have 2,3,4 or 5 legs.
  • encloses
    Conversely, encloses is used when the index's data has multiple dimensions. It matches if the database's term fully encloses the search term.

    Examples:

    • foo.dateRange encloses 2002
      Search for ranges of dates that include the year 2002.
    • geo.area encloses "45.3, 19.0"
      Search for any area that encloses the point 45.3, 19.0

RELATION MODIFIERS

Functional  Modifiers

  • stem
    The server should apply a stemming algorithm to the words within the term. For example such that computing and computer both match the stem of 'compute'.

  • relevant
    The server should use a relevancy algorithm for determining matches and the order of the result set.

  • phonetic
    The server should use a phonetic algorithm for determining words which sound like the term.

  • fuzzy
    The server should be liberal in what it counts as a match. The exact details of this are left up to the server, but might include permutations of character order, off-by-one for numerical terms and so forth.

  • partial
    When used with within or encloses, there may be some section which extends outside of the term. This permits for the database term to be partially enclosed, or fall partially within the search term.

Note: all of the following functional relation-modifiers are new in version 1.2.

  • ignoreCase, respectCase
    The server is instructed to either ignore or respect the case of the search term, rather than its default behaviour (which is unspecified). This modifier may be used in sort keys to ensure that terms with the same letters in different cases are sorted together or separately, respectively.

  • ignoreAccents, respectAccents
    The server is instructed to either ignore or respect diacritics in terms, rather than its default behaviour (which is unspecified, but respectAccents is recommended). This modifier may be used in sort keys, to ensure that characters with diacritics are sorted together or separately from those without them.

  • locale=value
    The term should be treated as being from the specified locale. Locales will in general include specifications for whether sort order is case-sensitive or insensitive, how it treats accents, and so forth. The default locale is determined by the server. The value is usually of the form C, french, fr_CH, fr_CH.iso88591 or similar. This modifier may be used in sort keys.

Examples:

  • dc.title any/stem "computing disestablishmentarianism"
    Find the local stemmed form of 'computing' and 'disestablishmentarianism', and search for those stems in the stemmed forms of the terms in titles.
  • person.phoneNumber =/fuzzy "0151 795-4252"
    Search for a phone number which is something similar to '0151 795-4252' but not necessarily exactly that
  • "fish" sortBy dc.title/ignoreCase
    Search for 'fish', and then sort the results by title, case insenstively.
  • dc.title within/locale=fr "l m"
    Find all titles between l and m, ensure that the locale is 'fr' for determining the order for what is between l and m.

Term-format Modifiers

These modifiers specify the format of the search term to ensure that the correct comparison is performed by the server. These modifiers may all be used in sort keys.

  • word
    The term should be broken into words, according to the server's definition of a 'word'

  • string
    The term is a single item, and should not be broken up.

  • isoDate
    Each item within the term conforms to the ISO 8601 specification for expressing dates.

  • number
    Each item within the term is a number.

  • uri
    Each item within the term is a URI.

  • oid
    Each item within the term is an ISO object identifier, dot-separated format.

Examples:

  • dc.title =/string Jaws
    Search in title for the string 'Jaws', rather than Jaws as a word. (Equivalent to the use of == as the relation)
  • zeerex.set ==/oid "1.2.840.10003.3.1"
    Search for the given OID as an attribute set.
  • squirrel sortby numberOfLegs/number
    Search for squirrel, and sort by the numberOfLegs index ensuring that it is treated as a number, not a string. (eg '2' would sort after '10' as a string, but before it as a number)

Masking

  • masked (default modifier)
    The following masking rules and special characters apply for search terms, unless overridden in a profile via a relation modifier. To explicitly request this functionality, add 'cql.masked' as a relation modifier.

    1. A single asterisk (*) is used to mask zero or more characters.

    2. A single question mark (?) is used to mask a single character, thus N consecutive question-marks means mask N characters.

    3. Carat/hat (^) is used as an anchor character for terms that are word lists, that is, where the relation is 'all' or 'any', or 'adj'. It may not be used to anchor a string, that is, when the relation is '==' (string matches are, by default, anchored). It may occur at the beginning or end of a word (with no intervening space) to mean right or left anchored."^" has no special meaning when it occurs within a word (not at the beginning or end) or string but must be escaped nevertheless.

    4. Backslash (\) is used to escape '*', '?', quote (") and '^' , as well as itself. Backslash not followed immediately by one of these characters is an error. 

    Examples:

    • dc.title = c*t
      Matches words that start with c and end in t
    • dc.title adj "*fish food*"
      Matches a word that ends in fish, followed by a word that starts with food
    • dc.title = c?t
      Matches a three letter word that starts with c and ends in t.
    • dc.title adj "^cat in the hat"
      Matches 'cat in the hat' where it is at the beginning of the field
    • dc.title any "^cat ^dog rat^"
      Matches cat at the beginning, dog at the beginning or rat at the end
    • dc.title == "\"Of Couse\", she said"
      Escape internal double quotes within the term.
  • unmasked
    Do not apply masking rules, all characters are literal.

Note: The following modifiers are New in version 1.2.

  • substring
    The 'substring' modifier may be used to specify a range of characters (first and last character) indicating the desired substring within the field to be searched. The modifier takes a value, of the form "start:end" where start and end obey the following rules:
    1. Positive integers count forwards through the string, starting at 1. The first character is 1, the tenth character is 10.
    2. Negative integers count backwards through the string, with -1 being the last character.
    3. Both start and end are inclusive of that character.
    4. If omitted, start defaults to 1 and end defaults to -1.

    Examples:

    • marc.008 =/substring="1:6" 920102
    • dc.title =/substring=":" "The entire title"
    • dc.title =/substring="2:2" h
    • dc.title =/substring="-5:" title
  • regexp
    The term should be treated as a regular expression. Any features beyond those found in modern POSIX regular expressions are considered to be server dependent. This modifier overrides the default 'masked' modifier, above. It may be used in either a string or word context.

    Examples:

    • dc.title adj/regexp "(lord|king|ruler) of th[ea] r.*s"
      Match lord or king or ruler, followed by of, followed by the or tha, followed by r plus zero or more characters plus s


BOOLEANS

The CQL context set does not define booleans, as these can only be defined by the CQL grammar. It gives the semantics of the booleans defined.

  • AND
    The combination of two sets of records with AND will result in the set of records that appear in both of the sets.

  • OR
    The combination of two sets of records with OR will result in the set of records that appear in either or both of the sets. It is therefor inclusive OR, not exclusive OR.

  • NOT
    The combination of two sets of records with NOT will result in the set of records that appear in the left set, but not in the right hand set. It cannot be used as a unary operator.

  • PROX
    The prox (short for proximity) boolean operator allows for the relative locations of the terms to be used in order to determine the resulting set of records. The semantics of when a match occurs is defined by the modifiers or defaults for those modifiers, as described below.



BOOLEAN MODIFIERS

The CQL context set defines four boolean modifiers, which are only used with the prox boolean operator.

  • distance symbol value
    The distance that the two terms should be separated by.
    1. Symbol is one of: < > <= >= = <>
      If the modifier is not supplied, it defaults to <=.
    2. Value is a non-negative integer.
      If the modifier is not supplied, it defaults to 1 when unit=word, or 0 for all other units.
  • unit=value
    The type of unit for the distance.
    Value is one of: 'paragraph', 'sentence', 'word' and 'element', and defaults to 'word'.  These values are explicitly undefined. They are subject to interpretation by the server.  See Proximity Units.
  • unordered
    The order of the two terms is unimportant. This is the default.

  • ordered
    The order of the two terms must be as per the query.

Examples:

  • cat prox/unit=word/distance>2/ordered hat
    Find 'cat' where it appears more than two words before 'hat'.
    ("ordered" means 'cat' and 'hat' in that order. "distance >2" means that the proximity between 'cat' and 'hat' is greater than two words. Would exclude "Cat in the Hat" but would find "The Big Red Cat in the Big Red Hat".)
  • cat prox/unit=paragraph hat
    Find cat and hat appearing in the same paragraph (distance defaulting to 0) in either order (unordered default)
  • zeerex.set = cql prox/unit=element/distance=0 zeerex.index = resultSetId
    Find the cql context set in the same element as the index name resultSetId. E.g. search for cql.resultSetId

 

Proximity Units

As noted above proximity units 'paragraph', 'sentence', 'word' and 'element' are explicitly undefined, that is, they are undefined when used by the CQL context set. Other context sets may assign them specific values.

Thus compare  "prox/unit=word"  with "prox/xyz.unit=word". In the first, 'unit' is a prox modifier from the CQL set, and as such its values are undefined, so 'word' is subject to interpretation by the server. In the second, 'unit' is a prox modifier defined by the xyz context set, which may assign  the unit 'word' a specific meaning.

Other context sets may define additional units, for example, 'street':

prox/xyz.unit="street"

Note that this approach, 'prox/xyz.unit="street"', is preferable to 'Prox/unit=xyz.street'. In the first case, 'unit' is a modifier defined in the xyz context set, and 'street' is a value defined for that modifier. In the second, 'unit' is a modifier from the cql context set, with a value defined in a different set. so its value would have to be one that is defined in the cql context set. Pairing a modifier from one set with a value from another is not a good practice.