CJK Unified Ideograph: Jargon Notes

On “CJK Unified Ideograph”: an apology

In Unicode/ISO parlance, certain blocks of 漢 Hàn characters are called “CJK Unified <a href='#NOT' title='Info-tech Jargon Alert!'>Ideographs</a>”. CJK (a trademark of the <a href="http://www.rlg.org/">RLG</a>) stands for “Chinese, Japanese, and Korean”, and is sometimes extended to <a href='http://www.oreilly.com/catalog/cjkvinfo/' title='Ken Lunde’s CJKV Information Processing'>CJKV</a> “Chinese, Japanese, Korean and Vietnamese” (and it could be extended further, to include all <a href='http://appsrv.cse.cuhk.edu.hk/~irg/' title='Ideographic Rapporteur Group'>IRG</a> <a href='http://unicode.org/~rscook/html/CJKU_SR_stats_20100404.html' title='statistics on Unicode CJK repertory (including forthcoming Extension D)'>contributors</a>). Scripts in all of these locales make use of CJKV (Chinese-derived) characters. These characters are “Chinese-derived” in that the principles for character creation originated in China (more than 3,000 years ago). These characters are sometimes also termed 漢 (“<a title='漢 Hàn ‘Chinese’ [in Modern Standard (Beijing) Chinese the pronunciation of the “a” in pinyin “Hàn” is between that of English “man” and “father” (but more like the latter than the former); the word 漢字 Hànzì ‘Chinese character(s)’ in Modern Standard Chinese is read Kanji in Japanese'>Hàn</a>” as in the name of Unicode’s Hàn database [a.k.a. <a href='http://www.unicode.org/reports/tr38/'>UniHan</a>]), reflecting the legacy of the influential 東漢 Dōng Hàn ‘Eastern Hàn’ Dynasty script analyses (<a href='#EHC' title='Shuō Wén Jiě Zì, The Eastern Hàn Chinese Grammaticon'>《說文解字》</a>, c. 121 AD). These characters are “Unified” in that (many though not all) locale-specific differences in character forms (stylistic conventions, typeface expectations) have been ignored (as non-distinctive) for encoding purposes. Of course, there are characters in all locales which are unique to those locales, and so unification also involves superset <a href='http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg30/IRGN1468IVS_Recommendation.pdf' title='rules governing CJK unification first set forth in Annex S of ISO/IEC 10646 are currently being revised, with special focus on use of CDL & Variation Selectors, as in this IRG document by Cook & Lunde'>definition</a>.

Like Hàn, the term ideograph (sometimes also [mis-]written “ideogram”) is today used in information-technology (info-tech) circles to signify ‘the uniquely CJKV script entity’, which is to say, “CJKV ideographs” constitute a certain subset of the “characters” to be found in Asian texts. Japanese Kana (elements of Hiragana and Katakana syllabaries) are also “characters” in Japanese, but are not termed “ideographs” (though they <a href="http://linguistics.berkeley.edu/~rscook/pdf/HanKana.fp3.pdf" title='see a chart of kana derivations' >derive</a> from Chinese-derived “ideographs”). The English term “Han” has some advantages over “ideograph”, but reliance on a specific pronunciation of the character “漢” (which varies by locale) presents its own challenges to general acceptance as a cover term. Why English “Han” and not “Kan”? Let’s just chalk this up to emphasis on the Chinese-derived principles governing character formation, and deference to Modern Standard Chinese (北京話 Běijīnghuà = 官話 Guānhuà = 普通話 Pǔtōnghuà ‘Mandarin’) pronunciation of the character 漢 Hàn (as in <a title='Dōng Hàn ‘Eastern Hàn Dynasty’'>東漢</a>).

Though perhaps more politically acceptable than Han as a cover term for CJK characters, the term ideograph is nevertheless something of a misnomer, if “ideograph” primarily means ‘completely pure idea writing, not conveying the sounds of words, but conveying only ideas independent of specific verbal communication’. As originally applied to Chinese writing (by early missionaries), such usage may reflect misunderstanding, a sense that the writing did not convey sound values at all. Of course, Chinese characters are used to write spoken language, not simply wordless ideas or ideas unconnected with specific spoken language. However, if early sinologists were not confused about the meaning of “ideogram”, then they were guilty of hyperbole, and specifically chose to extend the meaning to include also relatively imprecise writing of specific speech sounds. Perhaps the impression was that the degree of phonological imprecision was such as to present almost a total lack of phonological information. Certainly, a first glance at phonological variation (<a title='through time, historical'>diachronic</a> and <a title='at a specific time'>synchronic</a>) in character readings within and across CJKV locales can only aggravate the opinion that if the writing is not completely <a title='writing not conveying speech-sound values; see below'>aphonographic</a>, then it is either extremely complex or completely and utterly chaotic. And lacking readily apparent systematic phonological information, what remains but semantics or the basic idea to be conveyed? Indeed, for casual purposes the ideas conveyed by CJK characters are often intelligible across locales, though the spoken languages are not themselves mutually intelligible. Certain pronunciation details may be irrecoverable from even a close phonetic transcription, but CJK writing is an especially lossy means of phonological information storage.

Although info-tech usage of the term ideograph may be an unfortunate neologism derivative of an original misnomer, it seems to have arisen as an informed compromise in a specific usage context, in clear lack of a more precise preferable pre-existing English word. Terms such as logograph (‘one-to-one relation between sign and “word” [itself not terribly well-defined]’; not all CJK syllables are words, and CJK words are not always monosyllabic) or morphosyllabograph (a specialist Greco- mouthful!) might have been similarly imprecise and clumsy, and might have been preferable for not perpetuating a misconception of “pure <a href='http://www.amazon.com/dp/0824826566/' title='Book alert! Ideogram: Chinese Characters and the Myth of Disembodied Meaning'>disembodied</a> idea writing”, but might promote some other misconceptions.

So, the term ideograph in modern info-tech usage might best be understood (or rationalized ex postfacto, along with early sinological usage) as indicative of a difference of degree: that the phonological information conveyed is somewhat limited relative to more fully phonographic scripts (those using alphabets and isographic syllabaries to convey specific sound values).

“A syllabary being a system for writing the elements of the <a title='syllable canon ‘regex-like rule describing possible syllable structure, e.g. <(C)V(C)(T)>; full set of attested syllable types satisfying that rule’; canon ‘rule; set of rules; repertory’'>syllable canon</a> of a language, the syllabograph would be a graphical element of a syllabary. When there is a one-to-one correspondence between syllable type and syllabograph, this is an isographic syllabary. In that it sometimes has <a title='often semantically differentiating, though sometimes non-distinctive'>multiple representations</a> of a given syllable type, the Chinese writing system might be termed an imperfect or heterographic syllabary. Chinese characters, the elements of a heterographic syllabary, might be termed heterographic syllabographs, or heterosyllabographs. No matter what they are called, there is clearly some degree of imprecision in the Chinese script, in terms of its ability to convey specific sound values.” [Cook, <a href='http://linguistics.berkeley.edu/~rscook/html/writing.html#EHC' title='The Eastern Hàn Chinese Grammaticon' >2003</a>:195]

Many Asian languages (and CJK languages in particular) are termed monosyllabic. Of Chinese languages (or dialects) in particular, this means that the syllable (phoneme cluster with tone-bearing vocalic nucleus) bears much functional weight, and is traditionally extremely well-defined, both phonologically (phonemically) and morphologically (morphemically), presumably in natural speech as in orthography and lexicographic descriptions. Most syllables are associated with (one or more) distinct units of meaning (morpheme + syllable = morphosyllable) apparent in and productively used in formation of polysyllabic words. If a syllable has multiple clearly distinct meanings, each of these would “ideally” be written with a distinct character, though actual orthographic practices reflect multiple layers of subjective and irregular development. The syllable has status similar to that of the phoneme in other languages, as evident in meaning-driven “transcription” of foreign words (nativizing morpho-analytic re-syllabification). Characters have been associated with single syllables for a long time, presumably from the beginning of Chinese writing, and this implies that the language before writing was also monosyllabic. It might, however, be more accurately termed “<a title='‘comprised of one and a half syllables’; term coined by Matisoff'>sesquisyllabic</a>”, since historical studies suggest that syllable boundaries are rather fluid, and that prominent syllabic nucleii may assimilate adjacent relatively unstressed (or destressed) elements over time. Though the earliest specific evidence comes from no earlier than the earliest 反切 fǎnqiè ‘sound glosses’ (perhaps in the 7th c. AD), the traditional opinion is that the character-to-syllable connection goes back to the earliest writing (this is reflected in traditional monosyllabic reconstructions of Old Chinese phonology).

Syllables have long had reality for native speakers/writers as functional nuggets of meaning+sound, as is evident e.g. in ancient Chinese character lexicons organized into (homophonic) syllable classes, with both sound and meaning glosses. Of the six major traditional Chinese character types (六書:象形,指事,會意,形聲,轉注,假借), by far the most common is the so-called 形聲 xíngshēng ‘sematophonic’ compound, combining one semantic determiner with one phonographic component. Thus, homophonic (<a title='having different meanings'>heteromorphic</a>, <a title='having different meanings; in the continuum of synonym, paronym, heteronym'>heteronymous</a>) characters may be written with the same phonographic component, but are semantically differentiated by means of a (non-shared) semantic determiner. But the phonographic component does not always give a very consistent indication of the pronunciation, due to local variation and historical changes. For example, the character 皆 (MC /kei/, c. 1000 AD) is today commonly pronounced jiē, but when it is used as a component the resulting compound is not always pronounced jiē (e.g. 階 jiē, 諧 xié, 偕 xié, 揩 kāi, 楷 kǎi). Such variation is sometimes regular, but at other times it seems unpredictable in the light of available historical evidence. The phonographic component is sometimes said also to have a (non-phonographic) semantic function, and such characters (simultaneously 形聲 xíngshēng ‘sematophonic’ and “會意” huìyì ‘semantic complexes’) are termed “亦聲” yìshēng [lit.] ‘also phonetic’. Thus, even in phonographic writings the phonographic component is sometimes not entirely devoid of (non-phonologic) semantic value, and serves in combination with the determiner to specify the meaning of the syllable. Morphosyllabographs are fraught with meaning, but often only a small part of that meaning is clearly phonological, and sometimes none of it is at all.

<a name='NOT'></a>At any rate, even if the issues surrounding Chinese character typologies are complex, and usages of the term ideograph are imprecise, contradictory, confused and/or confusing, the term ideograph is most certainly not used in info-tech to indicate “pure idea writing” (bypassing graphic representation of speech), nor is it used to indicate any of the traditional classes of Chinese-derived characters such as 象形 xiàngxíng ‘pictographic’ characters, nor the very small traditional set of so-called 指事 zhǐshì ‘indicative of the deed’ characters. All Chinese-derived characters indicate syllabic speech units, and though the spoken languages of their speakers may not be mutually intelligible, the writings sometimes are.

A word on the character of the word “character”

In naming CDL, we use “character” with (one of) its common English meaning(s), intentionally avoiding the uncommonly understood (or commonly misunderstood) information-technology terms “ideograph” and “glyph”. (No, info-tech glyph does nont mean we’re suddenly talking about Mayan here, as Pulleyblank once complained in the early 1990s.) Arguably, there are some other good reasons not to call CDL a “CJK Ideographic Glyph Description Language”.

  1. In terms of a “character” vs. “glyph” distinction: CDL descriptions lie somewhere between abstract character (script entity class) and concrete glyph (instantiated member of a script entity class). In info-tech-speak: character is to glyph as class [set] is to class-member [element]. The underlying stroke-based CDL description is rather abstract in that it specifies only a skeleton trajectory to be fleshed out by the CDL interpreter (rather than a complete outline). Interpreted and rasterized, the abstract CDL character becomes concrete CDL glyph.
  2. In terms of an “ideograph” vs. “character” (e.g. Kanji vs. Kana) distinction: CDL is truly a “character description language” in that its basic principles are applicable to any script entity in any script (not simply so-called “ideographs” in a CJK script). Wenlin’s own stroked Latin font, for example, is implemented using CDL technology: base characters and diacritics are comprised of curved and straight segments; these basic components may then be combined to describe precomposed Latin script entities. The difference is that where the set of Sòng stroke types is really the set of key distinctive features for CJK (due to the importance of Sòng calligraphic standards in modern orthography and standardization), for Latin letters there is not really the same emphasis on a particular style or a corresponding set of stroke types. Although we can have a set of graphical primitives for Latin, these do not necessarily correspond to the way that people actually write and index Latin letters.

Finally, if you are not bothered by the jargon and prefer to think of CDL as a C(IG)DL “CJK (Ideographic Glyph) Description Language”, please feel free to do so.