LC_COLLATE Category for the Locale Definition Source File Format

Purpose

Defines character or string collation information.

Description

A collation element is the unit of comparison for collation. A collation element may be a character or a sequence of characters. Every collation element in the locale has a set of weights, which determine if the collation element collates before, equal to, or after the other collation elements in the locale. Each collation element is assigned collation weights by the localedef command when the locale definition source file is converted. These collation weights are then used by applications programs that compare strings.

Comparison of strings is performed by comparing the collation weights of each character in the string until either a difference is found or the strings are determined to be equal. This comparison may be performed several times if the locale defines multiple collation orders. For example, in the French locale, the strings are compared using a primary set of collation weights. If they are equal on the basis of this comparison, they are compared again using a secondary set of collation weights. A collating element has a set of collation weights associated with it that is equal to the number of collation orders defined for the locale.

Every character defined in the charmap file (or every character in the portable character set if no charmap file is specified) is itself a collating element. Additional collating elements can be defined using the collating-element statement. The syntax is:

collating-element character-symbol from string

The LC_COLLATE category begins with the LC_COLLATE keyword and ends with the END LC_COLLATE keyword.

The following keywords are recognized in the LC_COLLATE category:

Item Description
copy The copy statement specifies the name of an existing locale to be used as the definition of this category. If a copy statement is included in the file, no other keyword can be specified.
collating-element The collating-element statement specifies multicharacter collating elements.

The syntax for the collating-element statement is:

collating-element <collating-symbol> from <string>

The collating-symbol value defines a collating element that is a string of one or more characters as a single collating element. The collating-symbol value cannot duplicate any symbolic name in the current charmap file, or any other symbolic name defined in this collation definition. The string value specifies a string of two or more characters that define the collating-symbol value. Following are examples of the syntax for the collating-element statement:

collating-element <ch> from <c><h> collating-element <e-acute> from <acute><e> collating-element <11> from <1><1>

A collating-symbol value defined by the collating-element statement is recognized only with the LC_COLLATE category.

Item Description
collating-symbol The collating-symbol statement specifies collation symbols for use in collation sequence statements.

The syntax for the collating-symbol statement is:

collating-symbol <collating-symbol>

The collating-symbol value cannot duplicate any symbolic name in the current charmap file, or any other symbolic name defined in this collation definition. Following are examples of the syntax for the collating-symbol statement:

collating-symbol <UPPER_CASE> collating-symbol <HIGH>

A collating-symbol value defined by the collating-symbol statement is recognized only within the LC_COLLATE category.

Item Description
order_start The order_start statement must be followed by one or more collation order statements, assigning collation weights to collating elements. This statement is mandatory.

The syntax for the order_start statement is:

order_start <sort-rules>, <sort-rules>,...<sort-rules> collation order statements order_end

The <sort-rules> directives have the following syntax:

keyword, keyword,...keyword; keyword, keyword,...keyword

where keyword is one of the keywords forward, backward, and position.

The sort-rules directives are optional. If present, they define the rules to apply during string comparison. The number of specified sort-rules directives defines the number of weights each collating element is assigned (that is, the number of collation orders in the locale). If no sort-rules directives are present, one forward keyword is assumed and comparisons are made on a character basis rather than a string basis. If present, the first sort-rules directive applies when comparing strings using primary weight, the second when comparing strings using the secondary weight, and so on. Each set of sort-rules directives is separated by a ; (semicolon). A sort-rules directive consists of one or more comma-separated keywords. The following keywords are supported:

Item Description
forward Specifies that collation weight comparisons proceed from the beginning of a string toward the end of the string.
backward Specifies that collation weight comparisons proceed from the end of a string toward the beginning of the string.
position Specifies that collation weight comparisons consider the relative position of elements in the string not subject to the special symbol IGNORE. That is, if strings compare equal, the element with the shortest distance from the starting point of the string collates first.

The forward and backward keywords are mutually exclusive. Following is an example of the syntax for the <sort-rules> directives:

order_start forward; backward, position

The optional operands for each collation element are used to define the primary, secondary, or subsequent weights for the collating element. The special symbol IGNORE is used to indicate a collating element that is to be ignored when strings are compared.

A collation statement with the ellipsis keyword on the left-hand side results in the collating-element-list on the right-hand side being applied to every character with an encoding that falls numerically between the character on the left-hand side in the preceding statement and the character on the left-hand side of the following statement. If the ellipsis occur in the first statement, it is interpreted as though the preceding line specified the NUL character. (The NUL character is a character with all bits set to 0.) If the ellipsis occur in the last statement, it is interpreted as though the following line specified the greatest encoded value.

An ellipsis keyword appearing in place of a collating-element-list indicates the weights are to be assigned, for the characters in the identified range, in numerically increasing order from the weight for the character symbol on the left-hand side of the preceding statement.

Note: The use of the ellipsis keyword results in a locale that may collate differently when compiled with different character set description (charmap) source files. For this reason, the localedef command issues a warning when the ellipsis keyword is encountered.

All characters in the character set must be placed in the collation order, either explicitly or implicitly by using the UNDEFINED special symbol. The UNDEFINED special symbol includes all coded character set values not specified explicitly or with an ellipsis symbol. These characters are inserted in the character collation order at the point indicated by the UNDEFINED special symbol in the order of their character code set values. If no UNDEFINED special symbol exists and the collation order does not specify all collation elements from the coded character set, a warning is issued and all undefined characters are placed at the end of the character collation order.

Examples

The following is an example of a collation order statement in the LC_COLLATE locale definition source file category:

order_start     forward;backward
UNDEFINED       IGNORE;IGNORE
<LOW>           <LOW>;<space>
...             <LOW>;...
<a>             <a>;<a>
<a-acute>       <a>;<a-acute>
<a-grave>       <a>;<a-grave>
<A>             <a>;<A>
<A-acute>       <a>;<A-acute>
<A-grave>       <a>;<A-grave>
<ch>            <ch>;<ch>
<Ch>            <ch>;<Ch>
<s>             <s>;<s>
<ss>            <s><s>;<s><s>
<eszet> <s><s>;<eszet><eszet>
...             <HIGH>;...
<HIGH>
order_end

This example is interpreted as follows:

Files

Item Description
/usr/lib/nls/loc/* Specifies locale definition source files for supported locales.
/usr/lib/nls/charmap/* Specifies character set description (charmap) source files for supported locales.