NAME

NLS — Native Language Support Overview

DESCRIPTION

Native Language Support (NLS) provides commands for a single worldwide operating system base. An internationalized system has no built-in assumptions or dependencies on language-specific or cultural-specific conventions such as:

Character classifications
Character comparison rules
Character collation order
Numeric and monetary formatting
Date and time formatting
Message-text language
Character sets

All information pertaining to cultural conventions and language is obtained at program run time.

“Internationalization” (often abbreviated “i18n”) refers to the operation by which system software is developed to support multiple cultural-specific and language-specific conventions. This is a generalization process by which the system is untied from calling only English strings or other English-specific conventions. “Localization” (often abbreviated “l10n”) refers to the operations by which the user environment is customized to handle its input and output appropriate for specific language and cultural conventions. This is a specialization process, by which generic methods already implemented in an internationalized system are used in specific ways. The formal description of cultural conventions for some country, together with all associated translations targeted to the native language, is called the “locale”.

NetBSD provides extensive support to programmers and system developers to enable internationalized software to be developed. NetBSD also supplies a large variety of locales for system localization.

Localization of Information

All locale information is accessible to programs at run time so that data is processed and displayed correctly for specific cultural conventions and language.

A locale is divided into categories. A category is a group of language-specific and culture-specific conventions as outlined in the list above. ISO C specifies the following six standard categories supported by NetBSD:

Localization of the system is achieved by setting appropriate values in environment variables to identify which locale should be used. The environment variables have the same names as their respective locale categories. Additionally, the LANG, LC_ALL, and NLSPATH environment variables are used. The NLSPATH environment variable specifies a colon-separated list of directory names where the message catalog files of the NLS database are located. The LC_ALL and LANG environment variables also determine the current locale.

The values of these environment variables contains a string format as:

	language[_territory][.codeset][@modifier]

Valid values for the language field come from the ISO639 standard which defines two-character codes for many languages. Some common language codes are:

Language Name	Code	Language Family
ABKHAZIAN	AB	IBERO-CAUCASIAN
AFAN (OROMO)	OM	HAMITIC
AFAR	AA	HAMITIC
AFRIKAANS	AF	GERMANIC
ALBANIAN	SQ	INDO-EUROPEAN (OTHER)
AMHARIC	AM	SEMITIC
ARABIC	AR	SEMITIC
ARMENIAN	HY	INDO-EUROPEAN (OTHER)
ASSAMESE	AS	INDIAN
AYMARA	AY	AMERINDIAN
AZERBAIJANI	AZ	TURKIC/ALTAIC
BASHKIR	BA	TURKIC/ALTAIC
BASQUE	EU	BASQUE
BENGALI	BN	INDIAN
BHUTANI	DZ	ASIAN
BIHARI	BH	INDIAN
BISLAMA	BI
BRETON	BR	CELTIC
BULGARIAN	BG	SLAVIC
BURMESE	MY	ASIAN
BYELORUSSIAN	BE	SLAVIC
CAMBODIAN	KM	ASIAN
CATALAN	CA	ROMANCE
CHINESE	ZH	ASIAN
CORSICAN	CO	ROMANCE
CROATIAN	HR	SLAVIC
CZECH	CS	SLAVIC
DANISH	DA	GERMANIC
DUTCH	NL	GERMANIC
ENGLISH	EN	GERMANIC
ESPERANTO	EO	INTERNATIONAL AUX.
ESTONIAN	ET	FINNO-UGRIC
FAROESE	FO	GERMANIC
FIJI	FJ	OCEANIC/INDONESIAN
FINNISH	FI	FINNO-UGRIC
FRENCH	FR	ROMANCE
FRISIAN	FY	GERMANIC
GALICIAN	GL	ROMANCE
GEORGIAN	KA	IBERO-CAUCASIAN
GERMAN	DE	GERMANIC
GREEK	EL	LATIN/GREEK
GREENLANDIC	KL	ESKIMO
GUARANI	GN	AMERINDIAN
GUJARATI	GU	INDIAN
HAUSA	HA	NEGRO-AFRICAN
HEBREW	HE	SEMITIC
HINDI	HI	INDIAN
HUNGARIAN	HU	FINNO-UGRIC
ICELANDIC	IS	GERMANIC
INDONESIAN	ID	OCEANIC/INDONESIAN
INTERLINGUA	IA	INTERNATIONAL AUX.
INTERLINGUE	IE	INTERNATIONAL AUX.
INUKTITUT	IU
INUPIAK	IK	ESKIMO
IRISH	GA	CELTIC
ITALIAN	IT	ROMANCE
JAPANESE	JA	ASIAN
JAVANESE	JV	OCEANIC/INDONESIAN
KANNADA	KN	DRAVIDIAN
KASHMIRI	KS	INDIAN
KAZAKH	KK	TURKIC/ALTAIC
KINYARWANDA	RW	NEGRO-AFRICAN
KIRGHIZ	KY	TURKIC/ALTAIC
KURUNDI	RN	NEGRO-AFRICAN
KOREAN	KO	ASIAN
KURDISH	KU	IRANIAN
LAOTHIAN	LO	ASIAN
LATIN	LA	LATIN/GREEK
LATVIAN	LV	BALTIC
LINGALA	LN	NEGRO-AFRICAN
LITHUANIAN	LT	BALTIC
MACEDONIAN	MK	SLAVIC
MALAGASY	MG	OCEANIC/INDONESIAN
MALAY	MS	OCEANIC/INDONESIAN
MALAYALAM	ML	DRAVIDIAN
MALTESE	MT	SEMITIC
MAORI	MI	OCEANIC/INDONESIAN
MARATHI	MR	INDIAN
MOLDAVIAN	MO	ROMANCE
MONGOLIAN	MN
NAURU	NA
NEPALI	NE	INDIAN
NORWEGIAN	NO	GERMANIC
OCCITAN	OC	ROMANCE
ORIYA	OR	INDIAN
PASHTO	PS	IRANIAN
PERSIAN (farsi)	FA	IRANIAN
POLISH	PL	SLAVIC
PORTUGUESE	PT	ROMANCE
PUNJABI	PA	INDIAN
QUECHUA	QU	AMERINDIAN
RHAETO-ROMANCE	RM	ROMANCE
ROMANIAN	RO	ROMANCE
RUSSIAN	RU	SLAVIC
SAMOAN	SM	OCEANIC/INDONESIAN
SANGHO	SG	NEGRO-AFRICAN
SANSKRIT	SA	INDIAN
SCOTS GAELIC	GD	CELTIC
SERBIAN	SR	SLAVIC
SERBO-CROATIAN	SH	SLAVIC
SESOTHO	ST	NEGRO-AFRICAN
SETSWANA	TN	NEGRO-AFRICAN
SHONA	SN	NEGRO-AFRICAN
SINDHI	SD	INDIAN
SINGHALESE	SI	INDIAN
SISWATI	SS	NEGRO-AFRICAN
SLOVAK	SK	SLAVIC
SLOVENIAN	SL	SLAVIC
SOMALI	SO	HAMITIC
SPANISH	ES	ROMANCE
SUNDANESE	SU	OCEANIC/INDONESIAN
SWAHILI	SW	NEGRO-AFRICAN
SWEDISH	SV	GERMANIC
TAGALOG	TL	OCEANIC/INDONESIAN
TAJIK	TG	IRANIAN
TAMIL	TA	DRAVIDIAN
TATAR	TT	TURKIC/ALTAIC
TELUGU	TE	DRAVIDIAN
THAI	TH	ASIAN
TIBETAN	BO	ASIAN
TIGRINYA	TI	SEMITIC
TONGA	TO	OCEANIC/INDONESIAN
TSONGA	TS	NEGRO-AFRICAN
TURKISH	TR	TURKIC/ALTAIC
TURKMEN	TK	TURKIC/ALTAIC
TWI	TW	NEGRO-AFRICAN
UIGUR	UG
UKRAINIAN	UK	SLAVIC
URDU	UR	INDIAN
UZBEK	UZ	TURKIC/ALTAIC
VIETNAMESE	VI	ASIAN
VOLAPUK	VO	INTERNATIONAL AUX.
WELSH	CY	CELTIC
WOLOF	WO	NEGRO-AFRICAN
XHOSA	XH	NEGRO-AFRICAN
YIDDISH	YI	GERMANIC
YORUBA	YO	NEGRO-AFRICAN
ZHUANG	ZA
ZULU	ZU	NEGRO-AFRICAN

For example, the locale for the Danish language spoken in Denmark using the ISO 8859-1 character set is da_DK.ISO8859-1. The da stands for the Danish language and the DK stands for Denmark. The short form of da_DK is sufficient to indicate this locale.

The environment variable settings are queried by their priority level in the following manner:

If the LC_ALL environment variable is set, all six categories use the locale it specifies.
If the LC_ALL environment variable is not set, each individual category uses the locale specified by its corresponding environment variable.
If the LC_ALL environment variable is not set, and a value for a particular LC_* environment variable is not set, the value of the LANG environment variable specifies the default locale for all categories. Only the LANG environment variable should be set in /etc/profile, since it makes it most easy for the user to override the system default using the individual LC_* variables.
If the LC_ALL environment variable is not set, a value for a particular LC_* environment variable is not set, and the value of the LANG environment variable is not set, the locale for that specific category defaults to the C locale. The C or POSIX locale assumes the ASCII character set and defines information for the six categories.

Character Sets

A character is any symbol used for the organization, control, or representation of data. A group of such symbols used to describe a particular language make up a character set. It is the encoding values in a character set that provide the interface between the system and its input and output devices.

The following character sets are supported in NetBSD:

Font Sets

A font set contains the glyphs to be displayed on the screen for a corresponding character in a character set. A display must support a suitable font to display a character set. If suitable fonts are available to the X server, then X clients can include support for different character sets. xterm(1) includes support for Unicode with UTF-8 encoding. xfd(1) is useful for displaying all the characters in an X font.

The NetBSD wscons(4) console provides support for loading fonts using the wsfontload(8) utility. Currently, only fonts for the ISO8859-1 family of character sets are supported.

Internationalization for Programmers

To facilitate translations of messages into various languages and to make the translated messages available to the program based on a user's locale, it is necessary to keep messages separate from the programs and provide them in the form of message catalogs that a program can access at run time.

Access to locale information is provided through the setlocale(3) and nl_langinfo(3) interfaces. See their respective man pages for further information.

Message source files containing application messages are created by the programmer and converted to message catalogs. These catalogs are used by the application to retrieve and display messages, as needed.

NetBSD supports two message catalog interfaces: the X/Open catgets(3) interface and the Uniforum gettext(3) interface. The catgets(3) interface has the advantage that it belongs to a standard which is well supported. Unfortunately the interface is complicated to use and maintenance of the catalogs is difficult. The implementation also doesn't support different character sets. The gettext(3) interface has not been standardized yet, however it is being supported by an increasing number of systems. It also provides many additional tools which make programming and catalog maintenance much easier.

Support for Multi-byte Encodings

Some character sets with multi-byte encodings may be difficult to decode, or may contain state (i.e., adjacent characters are dependent). ISO C specifies a set of functions using 'wide characters' which can handle multi-byte encodings properly. The behaviour of these functions is affected by the LC_CTYPE category of the current locale.

A wide character is specified in ISO C as being a fixed number of bits wide and is stateless. There are two types for wide characters: wchar_t and wint_t. wchar_t is a type which can contain one wide character and operates like 'char' type does for one character. wint_t can contain one wide character or WEOF (wide EOF).

There are functions that operate on wchar_t, and substitute for functions operating on 'char'. See wmemchr(3) and towlower(3) for details. There are some additional functions that operate on wchar_t. See wctype(3) and wctrans(3) for details.

Wide characters should be used for all I/O processing which may rely on locale-specific strings. The two primary issues requiring special use of wide characters are:

All I/O is performed using multibyte characters. Input data is converted into wide characters immediately after reading and data for output is converted from wide characters to multi-byte encoding immediately before writing. Conversion is controlled by the mbstowcs(3), mbsrtowcs(3), wcstombs(3), wcsrtombs(3), mblen(3), mbrlen(3), and mbsinit(3).
Wide characters are used directly for I/O, using getwchar(3), fgetwc(3), getwc(3), ungetwc(3), fgetws(3), putwchar(3), fputwc(3), putwc(3), and fputws(3). They are also used for formatted I/O functions for wide characters such as fwscanf(3), wscanf(3), swscanf(3), fwprintf(3), wprintf(3), swprintf(3), vfwprintf(3), vwprintf(3), and vswprintf(3), and wide character identifier of %lc, %C, %ls, %S for conventional formatted I/O functions.

BUGS

This man page is incomplete.

February 21, 2007

NetBSD 6.1