COLLATE8(4) — HP-UX

NAME

collate8 − collating sequence table for languages with 8-bit character sets

DESCRIPTION

There are four language dependent collation algorithms for European languages. These algorithms are: Two_to_one conversions: Some languages such as Spanish require two adjacent characters to occupy one position in the collating sequence. Examples are “CH” (which follows “C”) and “LL” (which follows “L”). One_to_two conversions: Some languages such as German require one character (e.g. “sharp S”) to occupy two adjacent positions in the collating sequence. Don’t care characters: Some languages designate certain characters to be ignored in character comparisons. For example, if “−” is a “don’t care” character, then the strings “REACT” and “RE−ACT” would equal each other when compared. Case and accent priority: Many languages require a “two pass” collating algorithm: in pass one, the accents are stripped off the letters and the resulting two strings are compared; if they are equal, a second pass with the accents back in place is performed to break the tie. The case of letters may also be used in this fashion. This table has four sections: a file header, a sequence table, a two_to_one mapping table, and a one_to_two mapping table. The file header has the following format: struct header {

short int	table_len;	/* Table length */
short int	lang_id;	/* Language id number */
short int	reserved1;	/* Reserved */
short int	seq_tab;	/* Address of sequence table */
short int	seq_len;	/* Length of sequence table */
short int	two_to_one;	/* Address of two_to_one table */
short int	two_to_one_len;	/* Length of two_to_one table */
short int	one_to_two;	/* Address of one_to_two table */
short int	one_to_two_len;	/* Length of one_to_two table */
char	low_char;	/* Lowest character */
char	high_char;	/* Highest character */

}
Sequence Table Entries in the sequence table have the following format: struct seq_ent {

	unsigned char	seq_no;	/* Sequence number */
	unsigned char	type_info;	/* Character type */

}
The byte value of a given character is used as an index into the sequence table. The first two bits of type_info are used to keep track of the character type. A value zero means the character is a one_to_one character, and the other six bits in type_info contain its priority. A value of one or two means that type_info contains an index value into either the two_to_one or the one_to_two mapping table respectively. A value zero in seq_no means the character is a “don’t care” character. Mapping Table for two_to_one Mapped Characters

Entries in the two_to_one table have the following format: struct two_to_one {

char	reserved1;	/* Reserved */
char	legal_char;	/* Legal character */
struct seq_ent	seq2;	/* Sequence entry for this pair */

} “Legal” two_to_one characters are listed for each particular character. “Legal” means that the combination of two characters is treated as a single character. If a match is found, then the corresponding sequence entry is used for the two. Whenever a legal successor is not found in table, the character is treated according to one_to_one mapping, and the priority in the last entry combined with sequence number of the character creates the sequence entry. Mapping Table for one_to_two Mapped Characters

Entries in the one_to_two mapping table have the same format as entries in the sequence table. The sequence number of the first character is known from the entry in the sequence table. The sequence number of the second character is found in the one_to_two mapping entry, and the priority is used for both characters.

WARNING

This file is provided for historical reasons only. The recommended interface for native language support collation is the routines nl_strcmp and nl_strncmp (see string(3C)).

AUTHOR

Collate8 was developed by the Hewlett-Packard Company.

INTERNATIONAL SUPPORT

8-bit data.

Hewlett-Packard Company — Version B.1, May 11, 2021

Museum