Internationalization refers to the process of developing software programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. In system terms, internationalization refers to the provision of interfaces that let programs produce varying output, depending on the specific environment in which they are run. The mnemonic I18N is frequently used as an abbreviation for internationalization.
This manual describes Tru64 UNIX
interfaces and utilities that help you develop internationalized programs.
These interfaces and utilities conform to specifications in the X/Open UNIX
standard, which allows for implementation-defined behavior in certain areas.
This manual identifies those software characteristics that are specific to
the Tru64 UNIX operating system.
1.1 Language
An internationalized program makes no assumptions about the language of character data (text) that the program is designed to handle. The term data refers to data generated internally, data extracted from or written to files, and message text used for communication with the program's user.
Language has implications for processing text for such things as character handling and word ordering. Tru64 UNIX provides interfaces that allow internationalized programs to manipulate text according to the language requirements of individual users.
Language differences require the separation of message text from program code. Tru64 UNIX provides facilities that allow message text to be separated from the code, translated into different languages, and accessed by the program at run time. Chapter 3 explains how an internationalized program that uses the Worldwide Portability Interfaces (WPI) generates and accesses messages.
An internationalized program that uses X and Motif interfaces can separate message text from program code in the following ways:
By defining menu items, titles, text fields, and messages in UIL (User Interface Language) files
By specifying titles and font lists in application resource files
By specifying help messages in files that the Help widget uses
For information about separating message text from program code for X and Motif interfaces, refer to the following books:
X Window System Toolkit
OSF/Motif Programmer's Guide
Common Desktop Environment: Internationalization Programmer's Guide
Cultural data refers to the conventions of a geographic area or territory for such things as date, time, and currency formats.
An internationalized program cannot assume how these formats are set
in advance and uses system facilities to determine formats at run time.
This
capability is provided through a language information database that programs
can query for the required formats of cultural data items.
1.3 Character Sets
A character set is a set of alphabetic or other characters used to construct the words and other elementary units of a native language or computer language. A coded character set (or codeset) is a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation.
For a program to be able to handle text recorded in different codesets,
the program cannot make assumptions about the size or bit assignment of character
encodings.
In particular, the program cannot assume that any part of an area
used to store a character is available for other uses.
1.4 Localization
Localization refers to the process of implementing local requirements within a computer system. Some of these requirements are addressed by locales. Each locale is a set of data that supports a particular combination of native language, cultural data, and codeset. The type of information a locale can contain and the interfaces that use a locale are subject to standardization. However, where locales reside on the system and how they are named can vary from one vendor to another.
There is more to localization than providing locales. For example, the localization process means making sure that translations are available for software messages; appropriate fonts, and measurement systems are supported and available for display and printing devices; and, in some cases, additional software is written to handle local requirements.
The mnemonic
L10N
is frequently used as an abbreviation for localization.
1.4.1 Collating Sequence
The ordering of characters may be implicit in underlying hardware but can be defined for software to conform to the way language is used in a particular territory. Many languages have more complex rules for sorting than English. The following list describes some collating rules that do not exist for English:
A single letter is not necessarily represented by a single
character.
In traditional Spanish, for example, the character combination
ch
sorts between the characters
c
and
d
.
A single character can be equivalent to a combined set of
characters.
For example, the ß character is equivalent to
ss
in standard and Swiss German and to
sz
in
Austrian German.
Accented letters do not always follow unaccented letters. In many languages, this is true only if the words that contain those letters are otherwise identical. In other languages, a particular accented letter may be considered unique and sort after a letter that is different from the unaccented counterpart.
Characters can be sorted in multiple ways for the same language. The ideographic characters in Asian languages have sort orders based on pronunciation and on two visually recognized components (radicals, which are pictograms for elements of meaning, and the number of strokes).
Each locale contains information about collating sequences that informs
string comparison functions about the relative ordering of characters defined
in the associated codeset.
Internationalized regular expressions also use
the collating sequence for implementing character ranges, collating symbols,
and equivalence classes.
1.4.2 Character Classification
Character classification information describes the characteristics associated
with each valid character code; that is, whether the code defines an alphabetic,
uppercase, lowercase, punctuation, control, space, or other kind of character.
Character classification functions and internationalized regular expressions
use this information to determine character classes.
1.4.3 Case Conversion
Case conversion refers to information that identifies the possible alternative
case of each valid character code.
Case conversion functions use this information
to change characters from uppercase to lowercase or from lowercase to uppercase.
Note that case is not a characteristic of all of the letters, or even of any
characters, in some languages.
1.4.4 Language Information
Language
information (or
langinfo database) refers to localization
data that describes the format and setting of cultural data that can vary
from one locale to another.
This information includes the appropriate formats
and characters for date and time, currency, and numeric values.
1.4.5 Message Catalogs
A message catalog is a file or storage area that contains program
messages, command prompts,
and responses to prompts for a particular language.
Motif applications also
use resource files and UIL files in addition to or in place of message catalogs
for text and other values that can vary from one locale to another.
Chapter 3
describes the messaging system.
1.5 Language Announcement
Language announcement is the mechanism by which language, cultural data, and codeset requirements are set either for the system as a whole, by an application, or by individual users. Language announcement is performed by setting a locale name in a set of reserved environment variables. System managers can set the default values for these variables for different shell environments; refer to the System Administration book for information about setting locale defaults for shells. Users can also set locale variables on a per-process basis.
Typically, internationalized programs read locale variables at run time
and use them to attach settings to locale categories in the programs' operational
environment.
However, programs can also set these categories internally when
appropriate.
Therefore, the binding to a particular locale need not be general
for all parts of a program.
Within one execution cycle, different parts of
the program can request different localizations.
1.6 Terms and Definitions
This section defines terms used extensively in this guide.
Less common
terms are defined when they first appear.
1.6.1 Characters and Strings
A
character
is a sequence of one or more bytes that represent a
single graphic symbol or control code.
Do not confuse the term
character
with the C programming language
char
data type, which represents an object large enough to store any member of
the basic execution character set and which is usually mapped as an 8-bit
value.
Unlike
the
char
data type in C, a character can be represented
by a value that is one or more bytes.
The expression
multibyte
character
is synonymous with the term
character;
that is, both refer to character values of any length, including single-byte
values.
A
character string
or
string
is a contiguous
sequence of bytes terminated by and including the null byte.
A string is
an array of type
char
in the C programming language.
The
null byte is a value with all bits set to zero (0
).
A
wide character
is an integral type that is large enough to hold
any member of the extended execution character set.
In program terms, a wide
character is an object of type
wchar_t
, which is defined
in the header files
/usr/include/stddef.h
(for conformance
to the X/Open XSH specification) and
/usr/include/stdlib.h
(for conformance to the ANSI C standard).
The file locations where this data
type is defined are determined by standards organizations; however, the definition
itself is implementation specific.
For example, implementations that support
only single-byte codesets (not the case for Tru64 UNIX) might define
wchar_t
as a byte value.
A
wide-character string
is a contiguous sequence of wide characters
terminated by and including the null wide character.
A wide-character string
is an array of type
wchar_t
.
The null wide character is
a
wchar_t
value with all bits set to zero (0
).
An
empty string
is a character string
whose first element is the null byte.
Similarly, an
empty wide-character
string
is a wide-character string whose first element is the
null wide character.
1.6.2 Portable Character Set
The Portable Character Set (PCS) is supported in both compile-time (source) and run-time (executable) environments for all locales. The PCS contains:
The 26 uppercase letters of the English alphabet:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
The 26 lowercase letters of the English alphabet:
a b c d e f g h i j k l m n o p q r s t u v w x y z
The 10 decimal digits:
0 1 2 3 4 5 6 7 8 9
The following 32 graphic characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
The space character, plus control characters that represent the horizontal tab, vertical tab, and form feed.
In addition to the preceding characters, the execution version of the PCS contains control characters that represent alert, backspace, carriage return, and new line.
The PCS as defined by X/Open is similar to the basic source and basic execution character sets defined in ISO/IEC 9899:1990, except that the X/Open version also includes the dollar sign ($), commercial at sign (@), and grave accent ( ` ) characters.
Some locales (for example, ISO 646 variants) may make substitutions for one or more of the preceding characters. In such cases, the substituted character has the same syntactic meaning as the character it replaces in the PCS. An example of a character substitution might be the British pound sign ( £ ) for the number sign (#) that is the default.
The definition of a character set that is portable across all codesets
is particularly relevant to encoding formats that support a limited set of
native languages.
This is typical for most of the character encoding formats
developed for UNIX systems.
In other words, the codeset used for a Chinese
locale must include all the PCS characters in addition to characters that
are part of the Chinese language.
However, that same codeset probably would
not include characters needed to support Russian or Icelandic.
Similarly,
the codeset used for the Russian language probably would not include any Chinese
characters but must include all the PCS characters.
Therefore, no matter what
the locale setting, programs can assume that characters in the PCS are available.
1.6.3 The Universal Character Set
The Universal Character Set (UCS) was developed to support all characters in all native languages. This character set supports the philosophy that applications should be able to manipulate characters in any language by using the same encoding format and set of rules. The first implementation of this character enoding format, widely known as Unicode, was limited to the 16-bit values supported by early PC systems. However, current standards (ISO/IEC 10646 and the Unicode Standard) specify a 32-bit (UCS-4) encoding format that expands the number of characters that can be supported and is more efficiently manipulated as process code on larger computer systems.
The operating system supports various UCS encoding formats through a set of locales and codeset converters. The locales and some library functions allow applications to use UCS-4 as internal process code. The codeset converters allow file data to be converted to encoding formats supported by fonts and other software resident on the system.