COLLATE8(4) — HP-UX
NAME
collate8 − collating sequence table for languages with 8-bit character sets
DESCRIPTION
There are four language dependent collation algorithms for European languages. These algorithms are: Two_to_one conversions: Some languages such as Spanish require two adjacent characters to occupy one position in the collating sequence. Examples are “CH” (which follows “C”) and “LL” (which follows “L”). One_to_two conversions: Some languages such as German require one character (e.g. “sharp S”) to occupy two adjacent positions in the collating sequence. Don’t care characters: Some languages designate certain characters to be ignored in character comparisons. For example, if “−” is a “don’t care” character, then the strings “REACT” and “RE−ACT” would equal each other when compared. Case and accent priority: Many languages require a “two pass” collating algorithm: in pass one, the accents are stripped off the letters and the resulting two strings are compared; if they are equal, a second pass with the accents back in place is performed to break the tie. The case of letters may also be used in this fashion. This table has four sections: a file header, a sequence table, a two_to_one mapping table, and a one_to_two mapping table. The file header has the following format: struct header {
| short int | table_len; | /* Table length */ | |
| short int | lang_id; | /* Language id number */ | |
| short int | reserved1; | /* Reserved */ | |
| short int | seq_tab; | /* Address of sequence table */ | |
| short int | seq_len; | /* Length of sequence table */ | |
| short int | two_to_one; | /* Address of two_to_one table */ | |
| short int | two_to_one_len; | /* Length of two_to_one table */ | |
| short int | one_to_two; | /* Address of one_to_two table */ | |
| short int | one_to_two_len; | /* Length of one_to_two table */ | |
| char | low_char; | /* Lowest character */ | |
| char | high_char; | /* Highest character */ |
}
Sequence Table Entries in the sequence table have the following format: struct seq_ent {
| unsigned char | seq_no; | /* Sequence number */ | |
| unsigned char | type_info; | /* Character type */ |
}
The byte value of a given character is used as an index into the sequence table. The first two bits of type_info are used to keep track of the character type. A value zero means the character is a one_to_one character, and the other six bits in type_info contain its priority. A value of one or two means that type_info contains an index value into either the two_to_one or the one_to_two mapping table respectively. A value zero in seq_no means the character is a “don’t care” character. Mapping Table for two_to_one Mapped Characters
Entries in the two_to_one table have the following format: struct two_to_one {
| char | reserved1; | /* Reserved */ | |
| char | legal_char; | /* Legal character */ | |
| struct seq_ent | seq2; | /* Sequence entry for this pair */ |
} “Legal” two_to_one characters are listed for each particular character. “Legal” means that the combination of two characters is treated as a single character. If a match is found, then the corresponding sequence entry is used for the two. Whenever a legal successor is not found in table, the character is treated according to one_to_one mapping, and the priority in the last entry combined with sequence number of the character creates the sequence entry. Mapping Table for one_to_two Mapped Characters
Entries in the one_to_two mapping table have the same format as entries in the sequence table. The sequence number of the first character is known from the entry in the sequence table. The sequence number of the second character is found in the one_to_two mapping entry, and the priority is used for both characters.
WARNING
This file is provided for historical reasons only. The recommended interface for native language support collation is the routines nl_strcmp and nl_strncmp (see string(3C)).
AUTHOR
Collate8 was developed by the Hewlett-Packard Company.
SEE ALSO
INTERNATIONAL SUPPORT
8-bit data.
Hewlett-Packard Company — Version B.1, May 11, 2021