This appendix describes the internationalization features of the operating system. These features provide users with the ability to process data and to interact with the system in a manner appropriate to their native language, customs, and geographic region (their locale).
After reading this appendix, you will be able to do the following:
If your site is in the United States and you plan to use the American English language and its conventions, there is no need to set a locale because the system default is American English.
If your site is outside the United States, the locale will most likely have already been specified by the system administrator. If the locale has already been set, you may want to only skim this appendix for background information on internationalization. If the locale has not been set, the information in this appendix is essential to you.
Because Digital UNIX is an internationalized operating system, it can present information in a variety of ways. Users tell the operating system how to process and present information in a way appropriate for their language, country, and cultural customs by specifying a locale. See Section C.4 for information about how to specify a locale.
A locale generally consists of three parts: language, territory, and codeset. All three are important for specifying how information is processed and displayed:
At this point, some background information about codesets may be helpful.
The ASCII codeset has traditionally been used on UNIX systems to express American English. Each letter of the English alphabet (A to Z, a to z) as well as digits, control characters, and symbols are uniquely identified using only 7 of the 8 bits in a standard byte. However, the introduction of new codesets or expansion of old ones has been necessary to include non-English characters. Because so many programs rely on ASCII characters in one way or another, the most commonly used codesets begin with ASCII and build from there.
By using all 8 bits of a standard byte, a single codeset can uniquely identify characters in several alphabetic languages. The most popular codesets are a series called ISO 8859. The first in the series is called ISO 8859/1, the second is ISO 8859/2, and so on through ISO 8859/10. The ISO 8859/1 codeset, often called Latin-1, supports English and other Western European languages.
To identify all ideographic symbols in Asian languages, such as Chinese and Japanese, character encoding requires more than one byte. Numerous codesets using multibyte character encoding, which is not supported by the ISO 8859 series of codesets, have been developed for Asian languages.
As previously mentioned, the locale specified on your system influences how information is processed and displayed. Specifically, locale affects how the software:
The following sections describe the items in this list.
Collation is the action of arranging elements of a set into a particular order. Collation always follows a set of rules. Some languages require collation rules that are not used in English.
Some languages include groups of characters that all sort to the same primary location. Additional sort rules apply to order characters within the same group. For example, the French characters a, á, à, and â all sort to the same primary location. Words that begin with these characters collate the same location, at which point words are sorted within the group. These words are in correct French order:
a
á
abord
âpre
après
âpreté
azur
In some languages, certain single characters are treated as if they were two characters. For example, the German sharp s () is sorted as if it were "ss".
Some languages treat a string of characters as if it were a single element. For example, the Spanish ch and ll sequences are treated as unique characters in the Spanish alphabet. The following words are in correct Spanish order:
canto
construir
curioso
chapa
chocolate
dama
Some collation rules ignore certain characters. For example, if the hyphen (-) is defined as a character to be ignored, the strings "re-locate" and "relocate" sort to the same position.
Note
This means that you cannot assume that the range [A to z, a to z] includes every letter of an alphabet. For example, the Danish alphabet includes three characters that sort after z.
Users around the world express dates and times with different formatting conventions. When specifying day and month names, people in the United States generally express dates with an expression like the following one:
Tuesday, May 22, 1996
The French, on the other hand, express dates this way:
mardi, 22 mai 1996
The following examples show alternative formats for the date, March 20, 1996. A given format is not the only way to write the date in the listed country:
3/20/96
(United States)
20/3/96
(Great Britain)
20.3.96
(France and Germany)
20-III-96
(Italy)
96/3/20
(Japan)
2/3/20
(Japan, Emperor format)
In Japan's Emperor format, the year
(2
,
in the preceding example) is expressed as
the number of years that the current emperor has reigned.
As with dates, there are many conventions for expressing the time of day. In the United States, people often use the 12-hour clock with its a.m. and p.m. designations. People in most other countries use the 24-hour clock to express the time.
In addition to the 12-hour/24-hour clock differences, punctuation for written times can vary, for example:
3:20 p.m.
(United States)
15h20
(France)
15.20
(Germany)
The characters used to format numeric and monetary values vary from place to place. In the United States, the convention is to use a period (.) as the radix character (the character that separates whole and fractional quantities), and a comma (,) as the thousands separator. In many European countries, these conventions are reversed. For example:
1,234.56
(United States)
1.234,56
(France)
Here are some sample formats for monetary items:
$1,234.56
(United States, dollars)
kr1.234,56
(Norway, krona)
SFrs.1,234.56
(Switzerland, Swiss francs)
Note that some formats for monetary amounts include more than two places for fractional digits.
Programs are sometimes written with English messages embedded in the program itself. In an internationalized program, messages are kept in a separate file and replaced in the program with calls to a messaging system. Messages kept in a separate file can be translated and made available to the program. When translated messages are available, users can interact with the system in their native language.
Many programs ask questions that need a positive or negative
response. Those programs typically look for the English
string literals
y
or
yes
,
n
or
no
.
An internationalized program lets users enter the characters or words
that are appropriate to their language. For example, a French user
should be able to enter
o
or
oui
.
If your system is functioning in accordance with the language and
conventions of your country, you can assume that the locale has been
set correctly.
If you are not sure whether or not
your locale has been set, enter the
locale
command to display current settings of the locale environment
variables, for example:
%
locale
LANG=fr_FR.ISO8859-1 LC_COLLATE="fr_FR.ISO8859-1" LC_CTYPE="fr_FR.ISO8859-1" LC_MONETARY="fr_FR.ISO8859-1" LC_NUMERIC="fr_FR.ISO8859-1" LC_TIME="fr_FR.ISO8859-1" LC_MESSAGES="fr_FR.ISO8859-1" LC_ALL=
The locale environment variables, described in Section C.4.1, define the locale names used for messages, collation, codeset, numeric formats, monetary formats, date and time formats, and yes/no responses:
LANG
LC_COLLATE
LC_CTYPE
LC_NUMERIC
LC_MONETARY
LC_TIME
LC_MESSAGES
LC_ALL
In most cases, only the
LANG
variable has been set to a locale name, which then applies to
other locale variables with the exception of
LC_ALL
.
When you specify a locale, you specify a locale name that indicates language, territory, and codeset. On Digital UNIX systems, locale names adhere to the following format:
lang_
terr.
codeset
lang
en
(English),
fr
(French),
de
(German, from "Deutsch"),
ja
(Japanese).
terr
US
(United States),
NL
(the Netherlands),
FR
(France),
DE
(Germany, from "Deutschland"),
JP
(Japan).
codeset
ISO8859-1
(ISO 8859/1),
SJIS
(Shift Japanese Industrial Standard),
AJEC
(Advanced Japanese EUC).
Full locale names include:
en_US.ISO8859-1
(English, incorporating customs for the United States),
fr_FR.ISO8859-1
(French, incorporating customs for France),
de_DE.ISO8859-1
(German, incorporating customs for Germany).
A locale can be set by the system administrator or an individual user. If your system administrator sets the locale at your site, it is likely that a default locale has been specified for all systems, including yours. You can override the default locale if you want to do that.
To set a locale, you assign a locale name to one or more
environment variables. The easiest way to do this is to assign a
locale name to the
LANG
environment variable because this variable covers all the pieces
of a locale (codeset, collating sequence, numeric, monetary, and
date and time formats, messages, and so forth).
Table C-1 lists the locales available when you install the subset, Single-byte European Locales. Additional locales may be available if language-variant software for the operating system is installed on your system.
Language | Country | Codeset | Locale Name |
- | - | ASCII | C |
- | - | ASCII | POSIX |
Danish | Denmark | Latin-1 | da_DK.ISO8859-1 |
German | Switzerland | Latin-1 | de_CH.ISO8859-1 |
German | Germany | Latin-1 | de_DE.ISO8859-1 |
Greek | Greece | Latin-7 | el_GR.ISO8859-7 |
English | Great Britain | Latin-1 | en_GB.ISO8859-1 |
English | United States | Latin-1 | en_US.ISO8859-1 |
Spanish | Spain | Latin-1 | es_ES.ISO8859-1 |
Finnish | Finland | Latin-1 | fi_FI.ISO8859-1 |
French | Belgium | Latin-1 | fr_BE.ISO8859-1 |
French | Canada | Latin-1 | fr_CA.ISO8859-1 |
French | Switzerland | Latin-1 | fr_CH.ISO8859-1 |
French | France | Latin-1 | fr_FR.ISO8859-1 |
Italian | Italy | Latin-1 | it_IT.ISO8859-1 |
Dutch | Belgium | Latin-1 | nl_BE.ISO8859-1 |
Dutch | The Netherlands | Latin-1 | nl_NL.ISO8859-1 |
Norwegian | Norway | Latin-1 | no_NO.ISO8859-1 |
Portuguese | Portugal | Latin-1 | pt_PT.ISO8859-1 |
Swedish | Sweden | Latin-1 | sv_SE.ISO8859-1 |
Turkish | Turkey | Latin-9 | tr_TR.ISO8859-9 |
The
C
locale is the default if no locales are set on your system.
The
POSIX
locale is equivalent to the
C
locale; only letters in the English alphabet are included in
the ASCII codeset that is specified for the
POSIX
and
C
locales.
Table C-2 describes environment variables that influence locale functions.
Variable | Description |
LC_COLLATE
|
Specifies the collating sequence to use when sorting strings and when character ranges occur in patterns. |
LC_CTYPE
|
Specifies the character classification (codeset) information. |
LC_MONETARY
|
Specifies monetary formats. |
LC_NUMERIC
|
Specifies numeric formats. |
LC_MESSAGES
|
Specifies the language in which messages will appear if translations are available. In addition, this variable specifies strings for affirmative and negative responses. |
LC_TIME
|
Specifies date and time formats. |
LC_ALL
|
Overrides all preceding variables and the
LANG
environment variable. In general, this variable is used
only in programs and should not be set by system managers
and users. See the following section on limitations of locale
variables for more information.
|
As is true for the
LANG
variable, all of the variables in
Table C-2
can be assigned locale names. Consider the case where your
company is located in the United States but the prevalent
language spoken by employees is Spanish. The
LANG
environment variable could be set to
the name of a Spanish language locale and the
LC_NUMERIC
and
LC_MONETARY
variables set to the name of a United States English locale.
The explicit setting of the
LC_NUMERIC
and
LC_MONETARY
variables overrides what they were implicitly set to by
LANG
.
The
LC_CTYPE
,
LC_MESSAGES
,
LC_TIME
,
and
LC_COLLATE
variables would still be implicitly set to the Spanish locale.
The following are the
variable assignments for the C shell to implement this example:
setenv LANG es_ES.ISO8859-1 setenv LC_NUMERIC en_US.ISO8859-1 setenv LC_MONETARY en_US.ISO8859-1
The following are the same variable assignments for the Bourne and Korn shells:
LANG=es_ES.ISO8859-1 export LANG LC_NUMERIC=en_US.ISO8859-1 export LC_NUMERIC LC_MONETARY=en_US.ISO8859-1 export LC_MONETARY
Sometimes different versions of the same locale are available locally
to meet the needs
of certain languages or software applications. The names of such locales
end with the at sign (@) plus a modifier field. For example, the
collating sequence used for the telephone book in some languages
is different
from the collating sequence used for dictionaries. If the standard
locale for a language defined the dictionary collating sequence,
another version of the
locale might exist to support the telephone book collating sequence.
In this case the alternative locale
version might have a name like
en_FR.ISO8859-1@phone
.
The ability to set locale allows you to tailor your environment, but it does not protect you from making mistakes. The following sections discuss problems that can arise when you define locale variables.
There is nothing to prevent you from defining implausible
combinations of locale names for different aspects of a locale.
For example, you could set the
LANG
environment variable to a French locale and the
LC_CTYPE
variable to a Norwegian locale. The results would probably be
undesirable; for example, French message translations would
likely contain characters not specified in the Norwegian locale.
If you define locale variables in addition to
LANG
,
you are responsible for ensuring a valid combination of
locale settings.
The system has no way of knowing what locale was set when a file
was created. Therefore, the system cannot prevent you from
processing the file's data using a different locale. For example,
suppose you copy to your system a file that was created
when the
LANG
variable was set to a German locale. If, on your system,
LANG
is set to a French locale and you use the
grep
command to search for a string in the file, the
grep
command will use French collation and pattern matching rules
on the German data. It is therefore your responsibility to
know what kind of language data a file contains and to set
the locale accordingly.
The
LC_ALL
variable overrides all other locale-dependent environment variables, even if you set it before
setting category-specific variables, such as
LC_COLLATE
.
The only way to cancel the influence of
LC_ALL
is to undefine the variable. For example, enter the command
unsetenv LC_ALL
.
The
LC_ALL
variable is available for users familiar with the System V environment. In that environment, users
set locale either by setting
LC_ALL
or by setting all the locale category variables individually.