[Contents] [Previous Chapter] [Next Section] [Next Chapter] [Index] [Help]


2     Codesets and Codeset Conversion

DIGITAL UNIX fully supports the following Japanese codesets by including locales and codeset conversion support:

It also provides codeset conversion support for the following codesets:


[Contents] [Previous Chapter] [Next Section] [Next Chapter] [Index] [Help]


2.1   DEC Kanji

DEC Kanji is the codeset currently used by all DIGITAL Japanese products. Thus, software supporting this codeset can exchange data with existing Japanese products. This codeset is denoted as deckanji in the DIGITAL UNIX system.

DEC Kanji is formed by the following character sets:

DEC Kanji uses a combination of single-byte data and two-byte data to represent ASCII characters, symbols, and ideographic characters.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.1.1   ASCII or JIS X 0201 Roman Letter Code

All ASCII characters or JIS X 0201 Roman letters can be represented in the form of single-byte 7-bit data in DEC Kanji. That is, the most significant bit (MSB) of these characters is always set off.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.1.2   JIS X 0208 Code

Each JIS X 0208 character is represented by a two-byte code in DEC Kanji. The MSB of both bytes is always set on to distinguish it from an ASCII/JIS Roman character or a user-defined character.

Figure 2-1: Representation of a JIS X 0208 Character in DEC Kanji

Representation of a JIS X 0208 Character in DEC Kanji

The first byte of a two-byte code determines its row number, while the second determines its column number. The following formula illustrates the code of a JIS X 0208 character in relation to its row and column numbers:

1st byte = A0 + row number

2nd byte = A0 + column number

For example, if a character is positioned at the first column of the 36th row, its encoding value can be calculated as follows:

1st byte = A0 (hex) + 36 = C4 (hex)

2nd byte = A0 (hex) + 01 = A1 (hex)

In this case, the character code is C4A1.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.1.3   User-Defined Character Code

In addition to the ASCII or JIS Roman Code and the JIS X 0208 Code, DEC Kanji provides an area of 2,914 positions for user-defined characters. This UDC code range is shown in Table 2-1.

Table 2-1: DEC Kanji UDC Code Range

Area Usage

Row Range

Number of Characters

Code Range

User Area

1-31

2,914

A121-BF7E

DEC Reserved

32-94

   

A UDC is also represented by a two-byte code, just like a JIS X 0208 character. However, the MSB of the second byte is set off to distinguish it from a JIS X 0208 character, as shown in Figure 2-2.

Figure 2-2: Representation of a UDC in DEC Kanji

Representation of a UDC in DEC Kanji

The following formula illustrates the code of a UDC in relation to its row and column numbers:

1st byte = A0 + row number

2nd byte = 20 + column number

For example, if a UDC is positioned at the first column of the 16th row, its encoding value can be calculated as follows:

1st byte = A0 (hex) + 16 = B0 (hex)

2nd byte = 20 (hex) + 01 = 21 (hex)

In this case, the character code is B021.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.1.4   Two-Byte Code Space

Figure 2-3 illustrates the division of the two-byte code space and the position of JIS X 0208 and User-Defined Characters in DEC Kanji:

Figure 2-3: Two-Byte Code Space for DEC Kanji

Two-Byte Code Space for DEC Kanji


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.2   Japanese EUC

Extended UNIX Code (EUC) is an encoding method that allows up to four character sets to be combined in a single data stream. Japanese EUC, denoted as eucJP, is the EUC codeset for representing Japanese data.

Figure 2-4: Encoding of Japanese EUC

Encoding of Japanese EUC

CSO is called the primary character set while CS1 through CS3 are the supplementary character sets. The MSB of the primary character set must be off while the MSB of all bytes in the supplementary character sets must be on. This scheme is used to determine the character set to which a character belongs.

The representation of ASCII/JIS Roman and JIS X 0208 characters is similar to that of DEC Kanji. In addition, two more character sets, JIS Katakana and JIS X 0212, are encoded in Japanese EUC by making use of the Single-Shift 2 (SS2) and Single-Shift 3 (SS3) control characters.

Japanese EUC provides two areas for defining a UDC as shown in Table 2-2.

Table 2-2: Japanese EUC UDC Code Range

Area Usage

Row Range

Number of Characters

Code Range

JIS X 0208

85-94

940

F5A1-FEFE

JIS X 0212

78-94

1,598

SS3 + EEA1-FEFE

Note

JIS X 0212 characters (JIS Supplementary Kanji) are not supported in this release of the DIGITAL UNIX operating system.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.3   Super DEC Kanji

Super DEC Kanji, denoted as sdeckanji, is an extension to DEC Kanji which supports the CS2 (JIS Katakana) and CS3 (JIS X 0212) character sets as encoded in Japanese EUC. It is a superset of both DEC Kanji and Japanese EUC. Data encoded in both DEC Kanji and Japanese EUC can be handled with this unified codeset. This codeset was invented to ease the transition from DEC Kanji to Japanese EUC. Figure 2-5 illustrates the encoding of Super DEC Kanji.

Figure 2-5: Encoding of Super DEC Kanji

Encoding of Super DEC Kanji

Super DEC Kanji provides three areas for defining UDCs, as shown in Table 2-3.

Table 2-3: Super DEC Kanji UDC Code Range

Area Usage

Row Range

Number of Characters

Code Range

JIS X 0208

85-94

940

F5A1-FEFE

JIS X 0212

78-94

1,598

SS3 + EEA1-FEFE

UDC

1-94

8,836

A121-FE7E


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.4   Shift JIS

Shift JIS, denoted as SJIS, is a popular codeset which is widely used in the PC market.

Shift JIS codes use a combination of single-byte data and two-byte data to represent characters defined in JIS X 0201 and JIS X 0208. To allow the characters defined in these standards to be encoded in a single codeset, the first byte of each JIS X 0208 character is encoded in the ranges 81-9F and EO-FC, while the second byte is between 40 and FC, as shown in Table 2-4.

Table 2-4: Code Range of JIS X 0208 Characters in Shift JIS

Byte

Range

First byte

81-9F, E0-FC

Second byte

40-FC (except 7F)

Figure 2-6 illustrates the first and second byte code space of Shift JIS.

Figure 2-6: Code Space of Shift JIS

Code Space of Shift JIS

Table 2-5 illustrates the mapping from the encoding of the first byte to the corresponding character sets in the Shift JIS encoding.

Table 2-5: Character Set Mapping in Shift JIS

Code Range of First Byte

Character Set

Bytes per Character

00-7F

JIS Roman (X 0201)

1

81-9F

JIS X 0208

2

A1-DF

JIS Katakana (X 0201)

1

E0-FC

JIS X 0208

2

Shift JIS provides an area for defining UDC as follows:

Number of characters:

2,444

Code range:

F040 - FCFC


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.5   JIS Kanji

The JIS Kanji codesets use the ISO 2022 methodology for encoding the JIS X 0208 and JIS X 0201 character sets. There are two types of JIS Kanji encoding: 7-bit JIS Kanji code and 8-bit JIS Kanji code.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.5.1   7-Bit JIS Kanji Code

In 7-bit JIS Kanji encoding, all characters are represented as 7 bits. Characters are interpreted according to control sequences as follows:

Kanji in sequence (ESC $ B)

Code values following the Kanji-in sequence (ESC $ B) are treated as characters in the JIS X 0208 Kanji character set.

Kanji out sequence (ESC ( B)

Code values following the Kanji-out sequence (ESC ( B) are treated as ASCII characters.

Supplementary Kanji in sequence (ESC $ ( D)

Code values following the supplementary Kanji in sequence (ESC $ ( D) are treated as characters in the JIS X 0212 supplementary Kanji character set.

User-Defined Character (UDC) in sequence (ESC $ ( 0)

Code values following the UDC in sequence (ESC $ ( 0) are treated as characters in the vendor-defined or user-defined character set.

Kana in (SO) and Kana out (SI) sequences

Code values following the Shift-Out (SO) control character (0x0e) and preceding the Shift-In (SI) control character (0x0f) are treated as characters in the JIS X 0201 Katakana character set.

Katakana in sequence (ESC ( I)

Code values following the Katakana in sequence (ESC ( I) are treated as characters in the JIS X 0201 Katakana character set. In this case, the Kanji out sequence is used to switch back to ASCII code.

The Katakana in and Kanji out sequences are an alternative to using the Kana in and out sequences (SO/SI).


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.5.2   8-Bit JIS Kanji Code

In 8-bit JIS Kanji encoding, the JIS X 0201 Katakana characters are represented as 8 bits. Using this form of encoding, control sequences have the following effect:

Kanji in sequence (ESC $ B)

Code values following the Kanji in sequence (ESC $ B) are treated as characters in the JIS X 0208 Kanji character set.

Supplementary Kanji in sequence (ESC $ ( D)

Code values following the supplementary Kanji in sequence (ESC $ ( D) are treated as characters in the JIS X 0212 supplementary Kanji character set.

User-Defined Character (UDC) in sequence (ESC $ ( 0)

Code values following the UDC in sequence (ESC $ ( 0) are treated as vendor-defined or user-defined characters.

Kanji out sequence (ESC ( B)

Code values following the Kanji out sequence (ESC ( B) are treated as ASCII characters.

Kana in and out sequences (SI/SO)

These sequences are ignored.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]



2.5.3   Restrictions

The JIS Kanji codesets can be used in codeset conversion and terminal display.

For codeset conversion using the iconv utility, the string JIS7 indicates 7-bit JIS Kanji code that follows a Katakana in sequence and the string jiskanji7 indicates 7-bit JIS Kanji code entered between Kana in and out sequences. The following sequences are valid within the input data that iconv does not generate these sequences when converting to JIS Kanji:

For terminal display using tty, the string jis7 indicates 7-bit JIS Kanji code and the string jis8 indicates 8-bit JIS Kanji code. When the terminal code is set to jis7, the Kana in and out sequences (SI/SO) are used for JIS X 0201 Katakana character representation.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.6   ISO-2022-JP

The ISO-2022-JP codeset consists of the following character sets:

Note

JIS X 0208-1990 is a revised version of JIS X 0208-1978. Some characters of JIS X 0208-1978 were mapped to other positions.

Before a character set is used, it must be identified using an escape sequence as follows:

Escape Sequence

Character Set

ESC ( B

ASCII

ESC ( J

JIS X 0201-1976 (left-hand part)

ESC $ @

JIS X0208-1978

ESC $ B

JIS X 0208-1990

It is assumed that the starting code of a line is ASCII (including CR alone and LF alone, but not including the combination CRLF). If there are JIS X 0208 characters on a line, there must be a switch to ASCII or to the left-hand part of (Roman letters) before the end of the line (in other words, before the CRLF, or carriage return and line feed).

For example, if a line starts with the ASCII character 9, followed by the JIS X 0208-1978 character at row 16 column 1, the line is encoded as follows:

39h ESC $ @ 30h 21h .... ESC ( B .... CRLF

If a line starts with the JIS X 0208-1978 character at row 16 column 1, followed by the ASCII character 9, then the line is encoded as follows:

ESC $ @ 30h 21 ESC ( B 39h .... CRLF

Once a character set is designated, there is no need to redesignate the character set if the adjacent character belongs to the same character set. For example, the following practice is not recommended:

ESC $ B .... ESC $ B ....

Currently, the ISO-2022-JP codeset can be used in codeset conversion.

The iconv utility uses the following escape sequences when code is converted to ISO-2022-JP.

Escape Sequence

Character Set

ESC ( B

ASCII

ESC $ B

JIS X 0208


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.7   Extended ISO-2022-JP

The extended ISO-2022-JP codeset, denoted as ISO-2022-JPext, is an extended version of the ISO-2022-JP codeset. It is extended to support narrow JIS X 0201 Katakana characters, JIS X 0212 characters, and user-defined characters (UDC).

This codeset can be used in codeset conversion.

The iconv utility uses the following escape sequences when code is converted to ISO-2022-JPext:

Escape Sequence

Character Set

ESC ( B

ASCII

ESC $ B

JIS X 0208

ESC ( I

JIS X 0201 Katakana

ESC $ ( D

JIS X 0212

ESC $ ( 0

UDC


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.8   UCS-2/UCS-4

UCS is a standard character encoding for the universal character set specified in the Unicode and ISO/IEC 10646 standards. UCS has two forms; UCS-2 (16-bit, or 2 octet units) and UCS-4 (32-bit, or 4 octet units). Unicode uses the UCS-2 form, which is commonly used on personal computers. ISO/IEC allows either UCS-2 or UCS-4 encoding. UCS-4 encoding is in use on systems that can support the larger data unit size.

The current version of the DIGITAL UNIX operating system supports both UCS-2 and UCS-4 encoding. UCS-4 is available in some Japanese locales, and can be used in codeset conversion. For information about codeset conversion, see Section 2.10. For information about locales, see Chapter 3, Locales.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.9   UTF-8

Unicode and ISO/IEC 10646 standards define transformation formats for the universal character set. For the most part, the following UCS transformation formats (UTFs) exist to transform UCS values into sequences of bytes for handling by various byte-oriented protocols:

The current version of the DIGITAL UNIX operating system supports UTF-8 and UTF-16 but not UTF-7. UTF-8 can be used in codeset conversion and in the universal.utf8 locale. For information about codeset conversion, see Section 2.10. For information about locale variants, see Chapter 3, Locales.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.10   Codeset Conversion

The iconv utility provided by DIGITAL UNIX converts the encoding of characters in one codeset to another and writes the results to standard output. The following pairs of Japanese codeset converters are provided:

DEC
Kanji

Japan-
ese EUC

Super
DEC
Kanji

Shift
JIS

JIS7

ISO-
2022-
JP

ISO-
2022-
JPext

UCS-4

UTF-8

DEC Kanji

-

Y

Y

Y

Y

Y

Y

Y

Y

Japanese EUC

Y

-

Y

Y

Y

Y

Y

Y

Y

Super DEC Kanji

Y

Y

-

Y

Y

Y

Y

Y

Y

Shift JIS

Y

Y

Y

-

Y

Y

Y

Y

Y

JIS7

Y

Y

Y

Y

-

N

N

N

N

ISO-2022-JP

Y

Y

Y

Y

N

-

N

N

N

ISO-2022-JPext

Y

Y

Y

Y

N

N

-

N

N

UCS-4

Y

Y

Y

Y

N

N

N

-

Y

UTF-8

Y

Y

Y

Y

N

N

N

Y

-

IBM Kanji

Y

Y

Y

Y

N

N

N

N

N

JEF

Y

Y

Y

Y

N

N

N

N

N

KEIS

Y

Y

Y

Y

N

N

N

N

N

For example, you can enter the following command to convert a DEC Kanji file to a Shift JIS file:

% iconv -f deckanji -t SJIS <file>

Use the strings shown in Table 2-6 as the parameters to the iconv utility.

Table 2-6: Codeset Names

Codeset

String

DEC Kanji

deckanji

Japanese EUC

eucJP

Super DEC Kanji

sdeckanji

Shift JIS

SJIS

JIS7 (ESC ( I for katakana)

JIS7

JIS7 (SO/SI for katakana)

jiskanji7

ISO-2022-JP

ISO-2022-JP

Extended ISO-2022-JP

ISO-2022-JPext

UCS-2

UCS-2

UCS-4

UCS-4

UTF-8

UTF-8


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


2.10.1   User-Defined Character Mappings

There are four supported Japanese codesets that are used in the Japanese locales. They are DEC Kanji, super DEC Kanji, Japanese EUC, and SJIS. Each one has its own UDC ranges. There is a predefined mapping for UDC among these four codesets, as shown in the following table:

SJIS Deckanji sdeckanji eucJP
0xf040-0xf4fc 0xa121-0xaa7e 0xa121-0xaa7e 0xf5a1-0xfefe
0xf540-0xf9fc 0xab21-0xb47e 0xab21-0xb47e 0x8ff5a1-0x8ffefe
0xfa40-0xfcfc 0xb521-0xbb7e 0xb521-0xbb7e 0x8feea1-0x8ff3fe

If you try to modify the codeset of a UDC, the UDC manager will ask if you want the other codeset values to be changed accordingly. Always choose the default answer to avoid problems with other software. For instance, if you define a SJIS UDC value of 0xf040, it will be mapped to the deckanji and sdeckanji value of 0xa121 and the eucJP value of 0xf5a1 automatically.

You should not use UDC outside the ranges defined in the above table; if you do, the automatic mapping wil not work properly.


[Contents] [Previous Chapter] [Previous Section] [Next Chapter] [Index] [Help]


2.11   Codeset for Peripheral Devices

The DIGITAL UNIX operating system provides a mechanism by which you configure your system to run applications with peripherals, such as terminals and printers, supporting different codesets. You can specify the codesets for the applications, terminals, and printers independently as shown in Table 2-7. The DIGITAL UNIX software automatically converts data to the appropriate codeset.

The DEC terminal codeset is similar to DEC Kanji, but has support for Kana characters (in eucJP) as well. It has support for JISX0208 and JISX0208-1978, but not JISX0212 of eucJP. The dec78 codeset supports an older version of JISX0208-1978 which has characters that are slightly different from JISX0208-1983 supported in dec and deckanji.

Table 2-7: Feasible Codeset for Applications, Terminals, and Printers

Application Code

Terminal Code

Printer Code

DEC Kanji

DEC (dec)
DEC78 (dec78)

DEC Kanji

Japanese EUC

Japanese EUC

Japanese EUC

Super DEC Kanji

 

Super DEC Kanji

Shift JIS

Shift JIS (SJIS)
JIS7 (jis7)
JIS8 (jis8)

Shift JIS (SJIS)

Note

Japanese DECterm software supports the deckanji, sdeckanji, or eucJP codeset (except for the user-defined characters) as its terminal code.

For the details about setting up terminal code and printer code, please see Writing Software for the International Market or Nihongo Kinou Guide Book (written in Japanese).


[Contents] [Previous Chapter] [Previous Section] [Next Chapter] [Index] [Help]