Defines UCS-2 (Unicode) conversion mappings for input to the uconvdef command.
Conversion mapping values are defined using UCS-2 symbolic character names followed by character encoding (code point) values for the multibyte code set. For example,
<U0020> \x20
represents the mapping between the <U0020> UCS-2 symbolic character name for the space character and the \x20 hexadecimal code point for the space character in ASCII.
In addition to the code set mappings, directives are interpreted by the uconvdef command to produce the compiled table. These directives must precede the code set mapping section. They consist of the following keywords surrounded by < > (angle brackets), starting in column 1, followed by white space and the value to be assigned to the symbol:
Item | Description |
---|---|
<code_set_name> | The name of the coded character set, enclosed in quotation marks (" "), for which the character set description file is defined. |
<mb_cur_max> | The maximum number of bytes in a multibyte character. The default value is 1. |
<mb_cur_min> | An unsigned positive integer value that defines the minimum number of bytes in a character for the encoded character set. The value is less than or equal to <mb_cur_max>. If not specified, the minimum number is equal to <mb_cur_max>. |
<escape_char> | The escape character used to indicate that the character following is interpreted in a special way. This defaults to a backslash (\). |
<comment_char> | The character that, when placed in column 1 of a charmap line, is used to indicate that the line is ignored. The default character is the number sign (#). |
<char_name_mask> | A quoted string consisting of format specifiers for the UCS-2 symbolic names. This must be a value of AXXXX, indicating an alphabetic character followed by 4 hexadecimal digits. Also, the alphabetic character must be a U, and the hexadecimal digits must represent the UCS-2 code point for the character. An example of a symbolic character name based on this mask is <U0020> Unicode space character. |
<uconv_class> | Specifies the type of the code set. It must be one
of the following:
This type is used to direct uconvdef on what type of table to build. It is also stored in the table to indicate the type of processing algorithm in the UCS conversion methods. |
<locale> | Specifies the default locale name to be used if locale information is needed. |
<subchar> | Specifies the encoding of the default substitute character in the multibyte code set. |
The mapping definition section consists of a sequence of mapping definition lines preceded by a CHARMAP declaration and terminated by an END CHARMAP declaration. Empty lines and lines containing <comment_char> in the first column are ignored.
Symbolic character names in mapping lines must follow the pattern specified in the <char_name_mask>, except for the reserved symbolic name, <unassigned>, that indicates the associated code points are unassigned.
Each noncomment line of the character set mapping definition must be in one of the following formats:
For example:
<U3004> \x81\x57
This format defines a single symbolic character name and a corresponding
encoding. The encoding part is expressed as one or more concatenated decimal, hexadecimal, or octal constants in the following formats:
Each constant represents a single-byte value. When constants are concatenated for multibyte character values, the last value specifies the least significant octet and preceding constants specify successively more significant octets.
For example:
<U3003>...<U3006> \x81\x56
This format defines
a range of symbolic character names and corresponding encodings. The
range is interpreted as a series of symbolic names formed from the
alphabetic prefix and all the values in the range defined by the numeric
suffixes. The listed encoding value is assigned to the first symbolic name, and subsequent symbolic names in the range are assigned corresponding incremental values. For example, the line:
<U3003>...<U3006> \x81\x56
is interpreted as:
<U3003> \x81\x56
<U3004> \x81\x57
<U3005> \x81\x58
<U3006> \x81\x59
This format defines a range of one or more unassigned encodings. For example, the line:
<unassigned> \x9b...\x9c
is interpreted as: <unassigned> \x9b
<unassigned> \x9c