C    Using Internationalization Features

This appendix describes the internationalization features of the operating system. These features provide users with the ability to process data and to interact with the system in a manner appropriate to their native language, customs, and geographic region (their locale).

After reading this appendix, you will be able to do the following:

If your site is in the United States and you plan to use the American English language and its conventions, there is no need to set a locale because the system default is American English.

If your site is outside the United States, the locale will most likely have already been specified by the system administrator. If the locale has already been set, you may want to only skim this appendix for background information on internationalization. If the locale has not been set, the information in this appendix is essential to you.


C.1    Understanding Locale

Because Digital UNIX is an internationalized operating system, it can present information in a variety of ways. Users tell the operating system how to process and present information in a way appropriate for their language, country, and cultural customs by specifying a locale. See Section C.4 for information about how to specify a locale.

A locale generally consists of three parts: language, territory, and codeset. All three are important for specifying how information is processed and displayed:

At this point, some background information about codesets may be helpful.

The ASCII codeset has traditionally been used on UNIX systems to express American English. Each letter of the English alphabet (A to Z, a to z) as well as digits, control characters, and symbols are uniquely identified using only 7 of the 8 bits in a standard byte. However, the introduction of new codesets or expansion of old ones has been necessary to include non-English characters. Because so many programs rely on ASCII characters in one way or another, the most commonly used codesets begin with ASCII and build from there.

By using all 8 bits of a standard byte, a single codeset can uniquely identify characters in several alphabetic languages. The most popular codesets are a series called ISO 8859. The first in the series is called ISO 8859/1, the second is ISO 8859/2, and so on through ISO 8859/10. The ISO 8859/1 codeset, often called Latin-1, supports English and other Western European languages.

To identify all ideographic symbols in Asian languages, such as Chinese and Japanese, character encoding requires more than one byte. Numerous codesets using multibyte character encoding, which is not supported by the ISO 8859 series of codesets, have been developed for Asian languages.


C.2    How Locale Affects Processing and Display of Data

As previously mentioned, the locale specified on your system influences how information is processed and displayed. Specifically, locale affects how the software:

The following sections describe the items in this list.


C.2.1    Collation

Collation is the action of arranging elements of a set into a particular order. Collation always follows a set of rules. Some languages require collation rules that are not used in English.

Note

This means that you cannot assume that the range [A to z, a to z] includes every letter of an alphabet. For example, the Danish alphabet includes three characters that sort after z.


C.2.2    Date and Time Formats

Users around the world express dates and times with different formatting conventions. When specifying day and month names, people in the United States generally express dates with an expression like the following one:

Tuesday, May 22, 1996

The French, on the other hand, express dates this way:

mardi, 22 mai 1996

The following examples show alternative formats for the date, March 20, 1996. A given format is not the only way to write the date in the listed country:

3/20/96 (United States)

20/3/96 (Great Britain)

20.3.96 (France and Germany)

20-III-96 (Italy)

96/3/20 (Japan)

2/3/20 (Japan, Emperor format)

In Japan's Emperor format, the year (2, in the preceding example) is expressed as the number of years that the current emperor has reigned.

As with dates, there are many conventions for expressing the time of day. In the United States, people often use the 12-hour clock with its a.m. and p.m. designations. People in most other countries use the 24-hour clock to express the time.

In addition to the 12-hour/24-hour clock differences, punctuation for written times can vary, for example:

3:20 p.m. (United States)

15h20 (France)

15.20 (Germany)

15:20 (Japan)


C.2.3    Numeric and Monetary Formats

The characters used to format numeric and monetary values vary from place to place. In the United States, the convention is to use a period (.) as the radix character (the character that separates whole and fractional quantities), and a comma (,) as the thousands separator. In many European countries, these conventions are reversed. For example:

1,234.56 (United States)

1.234,56 (France)

Here are some sample formats for monetary items:

$1,234.56 (United States, dollars)

kr1.234,56 (Norway, krona)

SFrs.1,234.56 (Switzerland, Swiss francs)

Note that some formats for monetary amounts include more than two places for fractional digits.


C.2.4    Messages

Programs are sometimes written with English messages embedded in the program itself. In an internationalized program, messages are kept in a separate file and replaced in the program with calls to a messaging system. Messages kept in a separate file can be translated and made available to the program. When translated messages are available, users can interact with the system in their native language.


C.2.5    Yes/No Prompts

Many programs ask questions that need a positive or negative response. Those programs typically look for the English string literals y or yes, n or no. An internationalized program lets users enter the characters or words that are appropriate to their language. For example, a French user should be able to enter o or oui.


C.3    Determining Whether a Locale Has Been Set

If your system is functioning in accordance with the language and conventions of your country, you can assume that the locale has been set correctly. If you are not sure whether or not your locale has been set, enter the locale command to display current settings of the locale environment variables, for example:

locale

LANG=fr_FR.ISO8859-1
LC_COLLATE="fr_FR.ISO8859-1"
LC_CTYPE="fr_FR.ISO8859-1"
LC_MONETARY="fr_FR.ISO8859-1"
LC_NUMERIC="fr_FR.ISO8859-1"
LC_TIME="fr_FR.ISO8859-1"
LC_MESSAGES="fr_FR.ISO8859-1"
LC_ALL=

The locale environment variables, described in Section C.4.1, define the locale names used for messages, collation, codeset, numeric formats, monetary formats, date and time formats, and yes/no responses:

LANG

LC_COLLATE

LC_CTYPE

LC_NUMERIC

LC_MONETARY

LC_TIME

LC_MESSAGES

LC_ALL

In most cases, only the LANG variable has been set to a locale name, which then applies to other locale variables with the exception of LC_ALL.


C.4    Setting a Locale

When you specify a locale, you specify a locale name that indicates language, territory, and codeset. On Digital UNIX systems, locale names adhere to the following format:

lang_terr.codeset

lang
Is a 2-letter, lowercase abbreviation for the language name. The abbreviations are specified in ISO 639 Code for the Representation of Names of Languages, for example: en (English), fr (French), de (German, from "Deutsch"), ja (Japanese).

terr
Is a 2-letter, uppercase abbreviation for the territory name. The abbreviations are specified in ISO 3116 Codes for the Representation of Names of Countries, for example: US (United States), NL (the Netherlands), FR (France), DE (Germany, from "Deutschland"), JP (Japan).

codeset
Is a string that identifies the codeset, for example: ISO8859-1 (ISO 8859/1), SJIS (Shift Japanese Industrial Standard), AJEC (Advanced Japanese EUC).

Full locale names include: en_US.ISO8859-1 (English, incorporating customs for the United States), fr_FR.ISO8859-1 (French, incorporating customs for France), de_DE.ISO8859-1 (German, incorporating customs for Germany).

A locale can be set by the system administrator or an individual user. If your system administrator sets the locale at your site, it is likely that a default locale has been specified for all systems, including yours. You can override the default locale if you want to do that.

To set a locale, you assign a locale name to one or more environment variables. The easiest way to do this is to assign a locale name to the LANG environment variable because this variable covers all the pieces of a locale (codeset, collating sequence, numeric, monetary, and date and time formats, messages, and so forth).

Table C-1 lists the locales available when you install the subset, Single-byte European Locales. Additional locales may be available if language-variant software for the operating system is installed on your system.

Table C-1: Locale Names

Language Country Codeset Locale Name
- - ASCII C
- - ASCII POSIX
Danish Denmark Latin-1 da_DK.ISO8859-1
German Switzerland Latin-1 de_CH.ISO8859-1
German Germany Latin-1 de_DE.ISO8859-1
Greek Greece Latin-7 el_GR.ISO8859-7
English Great Britain Latin-1 en_GB.ISO8859-1
English United States Latin-1 en_US.ISO8859-1
Spanish Spain Latin-1 es_ES.ISO8859-1
Finnish Finland Latin-1 fi_FI.ISO8859-1
French Belgium Latin-1 fr_BE.ISO8859-1
French Canada Latin-1 fr_CA.ISO8859-1
French Switzerland Latin-1 fr_CH.ISO8859-1
French France Latin-1 fr_FR.ISO8859-1
Italian Italy Latin-1 it_IT.ISO8859-1
Dutch Belgium Latin-1 nl_BE.ISO8859-1
Dutch The Netherlands Latin-1 nl_NL.ISO8859-1
Norwegian Norway Latin-1 no_NO.ISO8859-1
Portuguese Portugal Latin-1 pt_PT.ISO8859-1
Swedish Sweden Latin-1 sv_SE.ISO8859-1
Turkish Turkey Latin-9 tr_TR.ISO8859-9

The C locale is the default if no locales are set on your system. The POSIX locale is equivalent to the C locale; only letters in the English alphabet are included in the ASCII codeset that is specified for the POSIX and C locales.


C.4.1    Locale Categories

Table C-2 describes environment variables that influence locale functions.

Table C-2: Environment Variables That Influence Locale Functions

Variable Description
LC_COLLATE Specifies the collating sequence to use when sorting strings and when character ranges occur in patterns.
LC_CTYPE Specifies the character classification (codeset) information.
LC_MONETARY Specifies monetary formats.
LC_NUMERIC Specifies numeric formats.
LC_MESSAGES Specifies the language in which messages will appear if translations are available. In addition, this variable specifies strings for affirmative and negative responses.
LC_TIME Specifies date and time formats.
LC_ALL Overrides all preceding variables and the LANG environment variable. In general, this variable is used only in programs and should not be set by system managers and users. See the following section on limitations of locale variables for more information.

As is true for the LANG variable, all of the variables in Table C-2 can be assigned locale names. Consider the case where your company is located in the United States but the prevalent language spoken by employees is Spanish. The LANG environment variable could be set to the name of a Spanish language locale and the LC_NUMERIC and LC_MONETARY variables set to the name of a United States English locale. The explicit setting of the LC_NUMERIC and LC_MONETARY variables overrides what they were implicitly set to by LANG. The LC_CTYPE, LC_MESSAGES, LC_TIME, and LC_COLLATE variables would still be implicitly set to the Spanish locale. The following are the variable assignments for the C shell to implement this example:

setenv LANG es_ES.ISO8859-1
setenv LC_NUMERIC en_US.ISO8859-1
setenv LC_MONETARY en_US.ISO8859-1

The following are the same variable assignments for the Bourne and Korn shells:

LANG=es_ES.ISO8859-1
export LANG
LC_NUMERIC=en_US.ISO8859-1
export LC_NUMERIC
LC_MONETARY=en_US.ISO8859-1
export LC_MONETARY

Sometimes different versions of the same locale are available locally to meet the needs of certain languages or software applications. The names of such locales end with the at sign (@) plus a modifier field. For example, the collating sequence used for the telephone book in some languages is different from the collating sequence used for dictionaries. If the standard locale for a language defined the dictionary collating sequence, another version of the locale might exist to support the telephone book collating sequence. In this case the alternative locale version might have a name like en_FR.ISO8859-1@phone.


C.4.2    Limitations of Locale Settings

The ability to set locale allows you to tailor your environment, but it does not protect you from making mistakes. The following sections discuss problems that can arise when you define locale variables.


C.4.2.1    Locale Settings Are Not Validated

There is nothing to prevent you from defining implausible combinations of locale names for different aspects of a locale. For example, you could set the LANG environment variable to a French locale and the LC_CTYPE variable to a Norwegian locale. The results would probably be undesirable; for example, French message translations would likely contain characters not specified in the Norwegian locale. If you define locale variables in addition to LANG, you are responsible for ensuring a valid combination of locale settings.


C.4.2.2    File Data Is Not Bound to a Locale

The system has no way of knowing what locale was set when a file was created. Therefore, the system cannot prevent you from processing the file's data using a different locale. For example, suppose you copy to your system a file that was created when the LANG variable was set to a German locale. If, on your system, LANG is set to a French locale and you use the grep command to search for a string in the file, the grep command will use French collation and pattern matching rules on the German data. It is therefore your responsibility to know what kind of language data a file contains and to set the locale accordingly.


C.4.2.3    Setting LC_ALL Overrides All Other Locale Variables

The LC_ALL variable overrides all other locale-dependent environment variables, even if you set it before setting category-specific variables, such as LC_COLLATE. The only way to cancel the influence of LC_ALL is to undefine the variable. For example, enter the command unsetenv LC_ALL.

The LC_ALL variable is available for users familiar with the System V environment. In that environment, users set locale either by setting LC_ALL or by setting all the locale category variables individually.