Museum

Home

Lab Overview

Retrotechnology Articles

⇒ Online Manual

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

genxlt(1)

iconv(1)

phrase(1)

iconv(3)

iconv_close(3)

iconv_open(3)

i18n_intro(5)

l10n_intro(5)

iconv_intro(5)  —  Macro Packages and Conventions

NAME

iconv_intro, iconv − Introduction to codeset conversion

DESCRIPTION

Conversion of character encoding from one coded character set (codeset) to another is an operation that often has to be performed by the operating system and some applications. For example, the man command supports codeset conversion to allow one set of reference page files to meet the needs of locales that support the same language and territory but different codesets (see man(1)). 

The following commands and library interfaces give users and application developers direct access to codeset conversion operations:

       •The iconv command converts characters in a data file from one codeset to another (see iconv(1)). 

       •The iconv(), iconv_open(), and iconv_close() functions convert a string of characters from one codeset to another (see iconv(3), iconv_open(3), and iconv_close(3)).  The iconv command uses these interfaces to convert characters. 

There are two types of codeset converters: algorithmic and table. Algorithmic converters, which reside in the /usr/lib/nls/loc/iconv directory, are shared libraries with a predefined entry point for invocation by functions in the libiconv.so library.  Algorithmic converters are more common for the conversion of multibyte codesets, in part because early versions of table converters could not handle the required number of character values and also because some of these codesets require complex handling (see NOTES). Algorithmic converters are supplied as part of the operating system product; the internal interfaces that they require are not published for external use. 

Table converters, which reside in the /usr/lib/nls/loc/iconvTable directory, can be created by using the genxlt command (see genxlt(1)). They can handle up to 65,536 encoded values. 

Names of codeset converters are in the following form:

from-codeset_to-codeset

For example, the following converter converts values from Super DEC Kanji to Japanese Extended UNIX Code:

sdeckanji_eucJP

ENVIRONMENT VARIABLES

Some codeset converters require more complex algorithms than can be provided through tables. The following environment variables provide control over conversion behavior for different kinds of codeset converters:

ICONV_ACTION
Controls the behavior for the one-to-many value conversions for conversion of Simplified Chinese to Traditional Chinese (except for Traditional Chinese encoded in Telecode). The valid settings for this environment variable are as follows:

batch
Specifies that the preferred mapping value (the first one in the one-to-many mapping list) is always taken. The batch setting is the ICONV_ACTION default. 

conv_all
Specifies that all the possible values are printed to the standard output, enclosed by braces ({ }), so that the user can later manually edit the converted file and select the one to use. 

conv_all_nosym
Specifies that all the possible values are printed to the standard output except for punctuation symbols, for which only the preferred mapping value is printed. As is true for conv-all, the conv_all_nosym setting prints value choices enclosed by braces so that the converted file can later be edited. 

ICONV_BYTEORDER
Sets byte ordering for UCS-2 or UCS-4 converters only. Valid values are little-endian (the default) or big-endian. Setting this environment variable may be necessary when producing UCS-2 or UCS-4 output that will be processed by codeset converters on platforms other than DIGITAL UNIX. 

ICONV_DEFSTR[_from-codeset_to-codeset]
Defines the default string to be substituted in output for those characters that cannot be converted from the source codeset to the destination codeset. This environment variable affects all converters except those that convert from one Japanese codeset to another or from a Korean codeset to another codeset. The variable value can be an arbitrary string or a code number. If the value is a code number (for example, 10, 07, 0x10, or U+1234), the corresponding character in the output codeset, or to-codeset, is printed. Code numbers are the only valid values for this variable when the output is in UCS-2, UCS-4 or UTF-8 format; for converters handling these formats, any string value other than a null string is ignored. 

For a given type of codeset conversion, a matching ICONV_DEFSTR_from-codeset_to-codeset variable has precedence over the ICONV_DEFSTR variable without the from-codeset_to-codeset suffix.  When defining the variable with the suffix, replace from-codeset_to-codeset with the name of the codeset converter to which the variable applies. The ICONV_DEFSTR variable (defined without the  suffix) is used by a converter when no ICONV_DEFSTR_from-codeset_to-codeset variable has been defined specifically for the type of conversion being done. 

If these variables are not defined or are set to the null string, the characters that cannot be converted are skipped and have no representation in converted output. 

ICONV_NOBOM
Disables generation of the byte-order mark at the beginning of UCS-2 or UCS-4 output. A valid setting is any value other than a null string. By default, or if this variable is set to a null string, the byte-order mark is generated at the beginning of UCS-2 or UCS-4 output.

Codeset converters that process UCS-2 or UCS-4 data on platforms other than DIGITAL UNIX usually require the byte-order mark. Therefore, the current default behavior of DIGITAL UNIX codeset converters produces output that is more likely to be supported as input to codeset converters on other platforms.  Use the ICONV_NOBOM variable only if you need backward compatibility with output produced by codeset converters that were included in versions of DIGITAL UNIX prior to DIGITAL UNIX Version 4.0D. 

ICONV_PHRCONV
Activates phrase conversion for converters that convert between Chinese codesets (except for Traditional Chinese encoded in Telecode). When phrase conversion is activated, a whole phrase in Traditional Chinese is converted to a different phrase in Simplified Chinese or the reverse.

If ICONV_PHRCONV is set to mark, the converted phrases are be bracketed by [ and ] to highlight the conversion result for visual checking. 

The phrase conversion databases in the /usr/share/phrdb directory are normal text files with the same file names as those of the algorithmic converters in /usr/lib/nls/loc/iconv/∗.  These phrase conversion databases contain entries for phrase conversion pairs. 

FILES

/usr/lib/nls/loc/iconv/∗
Algorithmic converters

/usr/lib/nls/loc/iconvTable/∗
Table converters

/usr/share/phrdb/∗
Phrase conversion databases

SEE ALSO

Commands: genxlt(1), iconv(1), phrase(1)

Functions: iconv(3), iconv_close(3), iconv_open(3)

Others: i18n_intro(5), l10n_intro(5)

Typewritten Software • bear@typewritten.org • Edmonds, WA 98026