Page 1: Category: Developer and Licensee Technical Support Documentation - PFS-102 Title: Character Name Designation Methodology in Production First Software Fonts Date: November 11, 1994 Company: Production First Software Address: P.O.Box 31528 ; San Francisco, CA 94131-0528 ; (415)-431-FONT(-3668) Page 2: 1.0 Introduction Character and character glyph names are important for at least two reasons: 1)They provide human-readable data tags which help to provide more immediate and direct human access to data, and also are useful in documentation of data. 2)They are required in the architecture and management of PostScript and PostScript fonts. Names are used in encoding, composition and decomposition (for composite characters), data sorting, and other uses. There are many ways from which to derive a complete set of character names or a procedure to derive them: 1)Pre-existing sets of character nameswhich have been in wide use since PostScript was first introduced ("legacy names"); 2)ISO standards relating to character descriptions or names (e.g. SGML standard names in ISO 8879, or ISO-originated names in ISO 10036); 3)Algorithmically-derived character names; 4)Names based on a catalog or registry (e.g. IBM standard names, names derived from the AFII glyph register, names derived from an all-encompassing character set or encoding like ISO 10646); 5)Names derived from a meaningful sequence of numbers (e.g. from 0 to 16N+1); and 6)possibly other ways, including a hierarchical combination of the above. This report describes the principles leading to the convention used to name characters and character glyphs for Production First Software fonts created from 1990 to the present, and the reasons for using it. It should be noted that the ISO 8879 SGML character names are not consistent with the goals and principles mentioned above. Also, the ISO 10036 standard lists character description names (many of which are several words long), are not suitable character names due to their length, descriptive inconsistency, algorithmic irregularity, and inconsistency with legacy names. Note: at the time this report was prepared, Unicode 1.1 was being introduced. This was the first version of Unicode comformant with ISO 10646. Previous versions were not conformant. Whether or not future versions will continue to be conformant is another question, because the time required to adopt amendments or changes to Unicode is much shorter than the time required for a corresponding process at ISO. It should be emphasized that the naming rules and principles described in this report were designed after careful consideration of the ramifications and needs of nationalism, but counterbalanced with the fact that modern computer technology was originated and evolved in the English-speaking world. This is particularly sensitive in the case of character naming for non-Latin (Roman) scripts, since an alphabet letterform name may refer to one character glyph shape (and character) in one country but a different glyph shape and character in another country. While not present in the Latin script, this probem does arise in Arabic, Cyrillic, Greek, and other scripts. This sensitivity may also have influenced the evolution of the multibyte encoding standards like Unicode and ISO 10646. It has been proposed by some developers that AFII (Association for Font Information Interchange) Glyph Register signature numbers be used to identify glyphs for the case where a numerically-derived name is desired. This would be implemented as 'AFII_ _ _ _ _' where 9 decimal characters could be used. Production First Software does not use AFII-based descriptors because: 1) The names cannot span a 32-bit code space inherent in ISO 10646 (over 4 million numbers) using 5 decimal digits. 2 The names are too long, especially if AFII names are used for a base character and one or more diacritical marks constituting a composite character which is not already in the Glyph Register. 3. It is conceivable that cases could arise where different character names could be used which refer to the same character glyph. Such a case would be where a new composite character name uses AFII Page 3: names for the glyph components, and then at a later time, that new composite is registered so as to be referable using a single AFII name. Another case is where consistent English names can be constructed or assigned, but an AFII name could also be used. 4) All variants (due to script, style within same script, combining characters, non-combining characters, true alternates, set position, or scale) are assignable to the AFII Glyph Register. Therefore, the register is like a Minestrone soup, some ingredients of which represent style variations (which should logically be classified as font variants) and some of which do not. What agency arbitrates whether or not an AFII-derived name, or which AFII-derived name if there is more than one possible, is the "official" name for a character or glyph? 5. AFII-derived names are not immediately human-understandable names, and this could become a very significant disadvantage. 2.0 Character and Character Glyph Name Requirements Note: both characters and glyphs can have assignable names. Production First Software glyph names are most of the time a subset of the set of character names, exceptions existing for Arabic and Indic scripts where there are glyph names which do not necessarily match corresponding character names due to the complex contextual relationships of the scripts. However, for the sake of textual brevity, the term "character name" will also be understood to apply to "glyph name" except for specific cases which will be identified when they arise. The naming convention used for Production First Software fonts is a hierarchical collection of naming rules. The following major principles apply: I. Names must be constructed for certain specific classes of characters: 1)composite characters (base character + ... + base character + diacritical mark + ... + diacritical mark); 2)characters composed from combinations of other base characters (ligatures, tied letters, N-graphs); 3)characters with physical descriptors as part of their name (connoting some aspect of shape, direction, or placement); 4)characters with descriptors connoting alternate styles of functionally identical characters. II. Composite character names must be created from existing base character names and diacritical marks, with ISOLatin-1 and legacy names given preference. A hierarchical stacking order is generally envoked in constructing composite character names. III. Names must be parseable if possible, so as to be able to extract descriptive information usable for categorization, character substitution, and post-creation complex or composite synthetic on-the-fly construction. IV. Except for ISOLatin-1 and legacy names, a base character name particle precedes adjectives and adverbs which describe it in the formation of a name, so as to enable similar names to sort together. V. Complicated composite names must possess a commutative property for glyph names which are used to derive the full name or to be usable in synthetic on-the-fly construction. VI. Qualifier descriptors are defined so as to be backward compatible and consistent with as many widely-used ISOLatin-1 and legacy character names as possible. VII. Cultural, geographic, or linguistic name variants or name components and transliterated or Latin-phonetisized non-Latin name components are to be avoided unless they have been in common typographic use (like the letterforms of the Greek alphabet), with two exceptions. (Note 1, Note 2) VIII. Subject to VII, non-numerically derived alphabetic name components must be in English. IX. Names used for diacritical marks are to be spelled out in transliterated English in the native language in which they are first commonly used in typography, provided that language uses the Latin script. If the language does not use the Latin script, a numerically-derived name is to be used. This was acually the method used to name many legacy-named diacritical marks. For example, the Vietnamese tone mark has a character name 'hoi' because Latin script is used to spell modern Vietnamese, and some of the other tone marks used in Vietnamese are represented by diacritical marks already coded in ISO Page 4: 10646/Unicode; in Greek, 'dieresis' is used, rather than "dialaktica" spelled in Greek; 'u055A' is used for the Armenian apostrophe; and 'v0196', 'v0197' for Uppercase and lowercase Cyrillic short sound (which are not currently encoded in Unicode). Diacritical mark characters are peculiar in that some of them are used cross-script. For example, Chuang and Hausa use hybrid alphabets (a mixture of Cyrillic, Greek, Latin) and a specific diacritical mark might be applied to base character letterforms drawn from those three scripts. Therefore, it makes no sense to encode a diacritical mark clone for a specific script, unless (like the short sound) it is only used in a specific script. Regardless of which naming principle is used, name components are concatenated to construct composite character names in the writing direction of the script of the character being named. X. Numerically-derived names are to be used for non-Latin alphabetic scripts, han ideographs, or in cases of last resort. (Note 2) XI.~Names and name components must be characterized in the fewest number of letters so as to minimize file and resource sizes. (Note 3) XII. Otherwise duplicate character glyphs, except for a placement or behavioral aspect, must be consistently named starting with a name chosen for the glyph not subject to the placement or behavioral aspect. XIII. General naming of characters and character glyphs have been and will continue to be constructed based on names from the following hierarchy: 1) legacy names, if existing 2) names based on the names from ISO 10036, if existing and suitable 3) names based on the American Mathematical Society database, if existing and suitable 4) commonly known names, if existing and suitable 5) composed by Production First Software XIV. Ligature names are constructed from component character names in the writing order of the script. 3.0 General Hierarchy of Chosen Name Categories Starting with the highest and proceding to the lowest: 1)ISOLatin-1 and legacy names (except with a conflict from non-Latin names). 2)Algorithmically-derived composite character names. 3)Formal symbol names. 4)Miscellaneous descriptive names. 5)Qualifier-modified names. 6)Numerically-derived names. 4.0 Qualifier Descriptors and Descriptor Hierarchy The "Hierarchical Order" is (highest at the left, reducing towards the right):