Unicode(5) — Macro Packages and Conventions

NAME

Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, iso10646 − Support for the Unicode and ISO/IEC 10646 standards

DESCRIPTION

The operating system provides locales and codeset converters that support the following standards:

•The Unicode Standard, Version 2.0, Unicode, Inc., 1996

•Information Technology−Universal Multiple-Octet Coded Character Set, ISO/IEC 10646:1993

The Basic Multilingual Plane defined by this standard is identical with the main body of Unicode character encoding.

These standards define generalized character encoding rules that can be applied to characters in most native language scripts. The Unicode Standard specifies a universal character set (UCS) that contains definitions in Version 2.0 for 38,885 characters and also includes a Private Use Area for vendor- or user-defined characters. The following list summarizes the main features of this character set:

•All characters are treated as 16-bit units.

•Each 16-bit unit has an abstract character identity.

•Certain sequences of 16-bit characters in a text stream are transformed into other characters, called composed characters.

•Characters have properties, such as base, numeric, spacing, combination, and directionality. The Unicode standard provides rules for ordering characters with different properties so that parsing of character sequences is unambiguous.

•The relationship between Unicode characters and the glyphs in the native language script that users see, type, or print is not necessarily one-to-one. A glyph may be mapped to a single abstract character or a composed character. Conversely, more than one glyph can be mapped to a character.

•The ISO 8859-1 character set occupies the first 256 code positions (and the ASCII character set the first 128 positions) of the UCS.

The ISO/IEC 10646 standard specifies a 32-bit unit, rather than 16-bit unit, for each abstract character defined in the the UCS. The 16-bit character values in Unicode are zero-extended through a second 16-bit unit to conform to ISO/IEC 10646. The second, or low-surrogate, 16-bit unit is reserved for future use in both standards.

The Unicode and ISO/IEC 10646 standards specify a uniform character size and allow character units to be processed for all languages by using the same set of rules. Therefore, system support for the universal character set does not need to include multiple algorithms (one or more per language) for converting between file code and internal process code. However, the two different character sizes (16-bit or 32-bit) that the standards support require different parsing schemes for data input and output. Universal character encoding that an implementation parses in 16-bit units (2 octets) is known as UCS-2. This is the canonical Unicode encoding in wide use on PC systems. Universal character encoding that an implementation parses in 32-bit units (4 octets) is known as UCS-4. This is the canonical ISO/IEC 10646 encoding that is in use on systems that can support the larger data unit size.

The standards define three transformation formats for the universal character set. For the most part, the following UCS transformation formats (UTFs) exist to transform UCS values into sequences of bytes for handling by various byte-oriented protocols:

•UTF-8, the standard method for transforming UCS-4 encoding into a sequence of 8-bit bytes and ensuring interchange transparency for characters in C0 code positions (0 to 31), the SPACE (32) character, and the DEL (127) character

•UTF-7, the standard interchange format for environments that strip the eighth bit from each byte

•UTF-1, which is similar to UTF-8 but also ensures interchange transparency of characters in C1 code positions (128 to 159)

The ISO/IEC 10646:1993 standard includes a fourth transformation format, UTF-16. This transformation format is equivalent to surrogate character extensions defined within UCS-2 by Version 2.0 of the Unicode Standard. DIGITAL UNIX provides locales and codeset converters that provide limited support for UCS-4 and UTF-8. The operating system supports UCS-2 only through codeset converters. The operating system provides no support for the UTF-1 and UTF-7 transformation formats.

Codeset Conversion

If the worldwide support subsets are installed on your system, you can enter the following commands to find the converters that are available for converting file data to and from UCS-2, UCS-4, and UTF-8 format:

% cd /usr/lib/nls/loc/iconv
% ls | grep UTF
% ls | grep UCS

Among the converters listed, you will find some that handle conversion of data in the code-page format used on PC systems. See the code_page(5) reference page for more information about converting between codeset and code-page formats. All codeset converters can be used with the iconv command and associated library functions.

Note

There was a change in mapping of Korean Hangul characters between Version 1.1 and Version 2.0 of the Unicode Standard. By default, UCS-2, UCS-4, and UTF-8 conversion assumes Version 2.0 character mapping for Hangul characters. Therefore, if data is in Version 1.1 format, the data must first be converted to Version 2.0 format before converting from UCS-2, UCS-4, or UTF-8 to an entirely different format. The format of a codeset converter name is from-codeset_to-codeset. In converter names, the Version 1.1 codeset formats for UCS-2, UCS-4, and UTF-8 are represented by UNICODE-1-1, UNICODE-1-1-UCS-4, and UNICODE-1-1-UTF-8, respectively. The Version 2.0 codeset names are represented by UCS-2, UCS-4, and UTF-8. For example, if Korean data is currently in UCS-4 Version 1.1 format, the data must first be processed by the UNICODE-1-1-UCS-4_UCS-4 converter before being processed by the UCS-4_deckorean converter.

See the iconv_intro(5) reference page for general information on codeset conversion.

Locales

The worldwide support subsets provide the universal.utf8@ucs4 locale, plus @ucs4 variants for locales that support a specific combination of language, country, and codeset. When an application runs in any of the following locales, ucs4 is the internal process code; applications can use these locales to test a UCS-4 character to determine its classification:

•universal.utf8@ucs4

This locale converts data in UTF-8 file format to ucs4 process code. The locale can be used to test any UCS-4 character to determine if it is included in one of the following classes defined for the LC_CTYPE category: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, or xdigit.

In the universal.utf8@ucs4 locale, the LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions match those for the POSIX (C) locale.

•native_locale_name@ucs4

These locales (for example, fr_FR.ISO8859-1@ucs4) perform the same function as the universal.utf8@ucs4 locale but are different in the following ways:

—
The file code is specified by the codeset portion (for example, ISO8859-1) of native_locale_name.

—
Classification information is not provided for the full set of UCS-4 characters, but only for those in a particular native language (for example, French).

—
Country-specific data is also available to the application. The LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions match those defined in native_locale_name.

Font Support

The operating system does not provide display or printer fonts for UCS characters; however, UCS data can be converted into other codeset formats for which there is font support.

Museum