Difference between revisions of "Character Description Language"
|Line 211:||Line 211:|
* [://youtu.be/AF4_PHHwxSQ CDL Demo: Feature Overview] (introduction to basic features)
* [://youtu.be/2RxX4sT1izc CDL: Feature Walk-Through (0)] (introduction to advanced/experimental features)
* [://youtu.be/Rsn8v9vvDtM CDL: Feature Walk-Through (1)] (overview of more advanced/experimental features)
* [://youtu.be/EMm08ThQiiI CDL Demo: Automatic Coordinate Adjustment] (experimental clip of CDL automatically approximating a target glyph in another font)
==For Wenlin CDL Developers==
==For Wenlin CDL Developers==
Revision as of 08:30, 11 April 2017
Appendix G of the Wenlin User’s Guide
This appendix documents features relating to Wenlin Institute’s Character Description Language (CDL), a powerful font and character description technology.
- 1 Wenlin CDL Feature Overview
- 2 Wenlin Stroking Box: Advanced CDL Features
- 3 Wenlin CDL Variants
- 4 Wenlin CDL Clones
- 5 Wenlin CDL Screenshots and Videos
- 6 For Wenlin CDL Developers
文林研究所 ‧ 字形描述语言 (字描语)
Wenlin CDL Feature Overview
◦ For basic information about CDL see Wenlin CDL Technology Overview.
◦ CDL is Wenlin’s Character Description Language, an XML application for rendering and indexing all Han (CJKV) characters.
◦ Wenlin’s CDL font technology is the powerhouse behind numerous commonly used features of Wenlin Software for Learning Chinese, including:
- Stroking Box and Stroking Diagrams
After selecting Song Hanzi (CDL) or Plain Hanzi (CDL) in the Wenlin Font Menu, all Chinese text you see in Wenlin is rendered using CDL font technology. Likewise, after you choose Monospace Pinyin in the Wenlin Font Menu, all Pinyin, English and much other text you see in Wenlin is also rendered using CDL.
◦ Wenlin’s CDL font technology underlies numerous advanced features of Wenlin Software for Learning Chinese, some of which are available when the Advanced CDL Options are enabled, including:
- CDL glyph editing and export functions
- CDL character variant and component analyses
- Advanced CDL indexing features
- Advanced Shuowen character variant analyses
The following image shows some CDL variants of the character 𤁉[U+24049] / 漢[U+6F22] / 汉[U+6C49] Hàn rendered in the CDL Stroking Box.
CDL has always been a part of Wenlin, but the underlying language was invisible until Wenlin version 4.0. Now it is possible for end-users to view and manipulate the CDL description for any character-variant that can be viewed in the Stroking Box.
Wenlin Stroking Box: Advanced CDL Features
Basic features of Wenlin’s Stroking Box are described in Chapter 7. To explore advanced CDL features, users need only choose Advanced Options from the Options menu, and turn on the option labeled Enable advanced CDL (Character Description Language) features. Then, when you are viewing any character in the Stroking Box, there will be a checkbox labeled advanced, and when it is checked, additional buttons (at the lower right of the image below) will be available.
Advanced Stroking Box buttons include:
- CDL: to display the character's description in XML format.
- Points: to show the control points for manipulating the arrangement of strokes and components.
- EPS: to convert the character glyph into Encapsulated PostScript, an outline usable in graphics programs.
- Strokes: to convert the description into one that uses only <stroke> elements, not <comp> elements.
- SVG: to convert the character glyph into Scalable Vector Graphics, an outline usable in web browsers and other programs.
- Scale: to ensure that the coordinates fit the entire grid, when editing.
Each of the above-listed buttons is discussed below. In addition to the buttons just listed, if you are a Wenlin Developer, the Advanced Stroking Box may include any number of unpublished or experimental features documented only in Wenlin Source Code.
The CDL Button
- After pushing the CDL button, the underlying CDL description appears in XML form in a new Editing CDL window (illustration below).
In the above image, there is one top-level cdl element, with char and uni attributes associating this CDL description with a Unicode code point [U+24049]. This CDL description serves as the default representation of [U+24049](V=0): it is variant='0' and so has no explicit CDL variant attribute (more on variants below). Note that an explicit Unicode code point assignment in the top-level cdl element is optional: a CDL description can be associated with zero or more Unicode code point values.
The the top-level cdl element in the above illustration also contains a points attribute determining the scale of the description as a whole.
Below the top-level cdl element in the above illustration, there are two indented CDL comp (component) elements. There are no CDL stroke elements at this level of this CDL description.
Each comp element has char, uni and points attributes. The comp and uni attributes identify the specific variant form of the component for use in this context. The points attribute determines the scale of the component, here in the default 128x128 CDL grid-space (em-square). (Note: The default grid-space is configurable, but 128x128 has proven adequate for common rendering purposes, and for distinctive-feature analysis. If you need floating-point coords in your CDL point attributes, let us know!)
The Points Button
- After pushing the Points button, the underlying CDL description also appears in a separate window (bottom of the illustration below) if it was not displayed already (but the Stroking Box remains in the foreground).
In the Stroking Box itself the control points for positioning the components appear at the upper-left and lower-right corner of each component.
Dragging any control point of a component will change the proportions of that component, and update the points attribute value accordingly, in the Editing CDL window.
The Strokes Button
- After pushing the Strokes button, the underlying CDL description appearing in the separate window has been converted to Stroke-Level CDL: it is now comprised of stroke elements only, with components inserted as XML comments recursively at each depth (indentation level). This is an extremely powerful feature for advanced CDL editing: new CDL descriptions with custom components can be easily created by mingling and tweaking various elements of pre-existing CDL descriptions.
If the resulting CDL description is re-loaded into the Stroking Box (by pushing the ▷cdl button), then in the Stroking Box itself after pushing the Points button again, the control points for positioning the individual strokes appear.
Dragging any control point of a stroke will change the features of that stroke instance.
After pushing the Strokes button, if the resulting stroke-level CDL description (with comp elements interspersed as XML comments) is again re-loaded into the Stroking Box (by pushing the ▷cdl button), pushing the Strokes button again will strip the XML comments, leaving only stroke elements.
The image above shows Stroke-Level CDL, with one stroke element for each of the 17 strokes of [U+24049](V=0). This form of CDL is considerably more compact than the version with XML comments interspersed. It is also completely self-contained and relatively portable: all that is needed to render it is the CDL Engine, you do not need the CDL Database. (Various attributes of stroke, cdl and comp elements appearing in the above illustrations are as yet undiscussed. These are introduced in the CDL Specification.)
Shift-clicking on the ▷cdl button in the above window converts the multi-line CDL description into the in-line version: all newlines are stripped, and a new window opens showing the CDL description rendered as a single character. Such in-line CDL is suitable for use in your documents when you do not want to (or cannot) store the CDL in the CDL database. Such in-line CDL may be associated with zero or more Unicode code points: if there is no suitable Unicode code point, then the description cannot be stored in the CDL database except in Private-Use Area. Such anonymous CDL descriptions can feed into the Unicode encoding process.
The Scale Button
- After editing a CDL Hanzi description, you should always push the Scale button in the Stroking Box, to ensure that the CDL description completely fills the em-square.
Because CDL descriptions are built-up recursively, it is important for proper scaling and positioning of components that each sub-component completely fill its em-square. The Scale button takes care of this for you.
The EPS Button
- Pushing the EPS button in the Advanced Stroking Box generates an Encapsulated Postscript version of the glyph, suitable for use in any application that renders EPS. Save the resulting text to a file, and open it in your EPS application.
The SVG Button
- Pushing the SVG button in the Advanced Stroking Box generates a Scalable Vector Graphics version of the glyph, suitable for use in any application that renders SVG. Save the resulting text to a file, and open it in your SVG application.
Wenlin CDL Variants
Wenlin Variation Sequences
Wenlin CDL adds a new dimension to the Unicode code space, with a variant mechanism for associating an unlimited number of CDL descriptions with any Unicode codepoint.
- Unicode Variation Sequences standardize certain glyph variants, and those relating to Unihan characters are managed in Unicode’s fledgling Ideographic Variation Database. (Support for standard Unicode Variation Sequences on Mac OS X is New in 4.2, and depends on the OS font support.)
- Wenlin Variation Sequences depend on Wenlin CDL font support: Wenlin uses a whole plane of Private-Use Area (PUA) characters (U+F0000..U+FFFFD), to define its own Private-Use Variation Selectors (PVS). Wenlin uses these PVS to define Wenlin’s own Private Variation Sequences, used to manage glyph variation in the CDL Database.
Wenlin’s CDL Database aims to capture all of the glyph variation that is relevant to Unicode’s Unihan data management, and for many years Wenlin CDL development has gone hand-in-hand with the effort to manage and extend Unicode’s Unihan character set and property data. A long-term Wenlin CDL development goal is to prepare a submission to standardize CDL variants in IVD, and to manage IVD data directly with CDL.
Wenlin CDL Variants are specified by a sequence involving a base character followed by PVS. Using the rightside component of 𤁉[U+24049] as an example, such a codepoint sequence looks like this:
If the above sequence for variant='1' of uni='213F3' is decoded in Wenlin, and hidden codes are revealed, the resulting character sequence (Wenlin Variation Sequence) will be seen as on the left-side of the arrow in the image below:
When codes are revealed, Wenlin substitutes circled digits (①, ②, ③, ...) in rendering of PVS code points ([U+F0001], [U+F0002], [U+F0003], ...), and the default glyph for the base character (variant='0' of uni='213F3') is displayed (in this case before the ①). Note that no PVS suffix is necessary to select the default glyph ([U+F0000] is not used, and ⓪ is not rendered).
But when codes are hidden, the variation selector is not rendered separately; instead it selects the specific CDL description, and the result is displayed as on the right-side of the arrow in the image above.
The CDL glyph for variant='1' of uni='213F3' is selected (highlighted in yellow) in the illustration below.
Advanced CDL Zidian Entries
The image below shows the Zìdiǎn entry for 𡏳[U+213F3] as it is displayed when Advanced CDL Features are enabled.
A total of five variant glyphs of [U+213F3] are shown in the above image, including the default glyph (V=0) and four other variants (V=1..V=4).
- Each button such as “V=1:▷” etc. opens the Stroking Box, for viewing the animated stroke-by-stroke rendering, and for editing that particular variant CDL description (as the “▷stroke” button does for V=0).
- Each CDL variant is followed by stroke-count information in parentheses, e.g. (14), followed by a stroke diagram.
- Each CDL variant also includes a stroke diagram (Note: Reveal Codes to explore the skdig tag).
- Stroke-type information is given below the stroke diagram, including:
- a numeric string like 12212513434121 (the 札 Zhá 'Five-Type' Key), followed by
- a comma-delimited string like h,s,s,h,s,hz,h,wp,n,p,d,h,s,h (the full CDL stroke-type key).
Such information is available for all Unihan CJK characters and for all CDL variants, and will be displayed in the Zìdiǎn entry of the base form (V=0), when Advanced CDL Features are enabled..
Depending on your build of Wenlin (if you are a Wenlin Developer), additional buttons may be to the right of each variant, as in the above illustration:
- ▷CDL : Opens a new window showing the CDL description of that variant. (Same as the Stroking Box: CDL button.)
- ▷new : Creates a new CDL description using that particular variant as the template, at the next available PVS position (or if shift-key is down, specify the Target in the dialog).
- ▷clone : Creates a clone of that Source CDL description at Target (via a dialog to specify Target).
At the bottom of the above illustration are two other buttons available only when Advanced CDL Features are enabled.
- ▷list similar characters : This button in the above illustration (New in 4.2) produces a list of similar characters; similarity is based on CDL, as used for Handwriting Recognition.
- ▷as a CDL comp : This button in the above illustration (New in 4.2) provides access to the list of all CDL descriptions containing any variant of the character as CDL component (at any depth of recursion). The resulting list is tabulated in a tree-style report containing frequency distributions for each variant. (The ability to list characters containing a given CDL component has been enabled since Wenlin 4.0, but was previously available only if the shift-key was held down when choosing “List: Characters containing components”, as documented in components.wenlin.)
The image above shows partial output of the ▷as a CDL comp button. Depending on your build of Wenlin (if you are a Wenlin Developer), various statistics and bracketed tree diagrams are appended to the end of the CDL comp report.
Wenlin CDL Clones
Note that small red Hanzi in the image above are BMP (Basic Multi-lingual Plane) Private-Use Area (PUA) characters which are clones of non-PUA masters.
- A CDL clone is any CDL description which has exactly one comp element, and no stroke elements at all.
- The ultimate CDL master of a clone is not simply the CDL description which serves as the single comp element of the clone, but rather, it is that comp recursively resolved (decloned) to the CDL description containing at least one stroke element. (Compare stroke-level CDL.)
In CDL development it is sometimes necessary or advantageous to migrate descriptions from PUA to formal UCS code points, as new characters are encoded in Unicode, or as the CDL database is refined (with PVS). The CDL cloning mechanism makes this possible, ensuring both consistency and backwards compatibility. (Note: Wenlin PUA assignments are guaranteed stable: Wenlin PUA assignments once published are not re-used.)
If advanced CDL features are enabled, PUA clones are scaled down small (the top-level points attribute is '30,30 98,98') as a visual aid for dictionary editors, to call out the fact that they are indeed PUA clones. If you use a CDL clone as a comp in another CDL description, it will automatically be resolved to its master when the description is saved.
If you open the Zìdiǎn entry for any clone, you will see additional information about the clone, including the mapping to its master (at which level of recursion). In some cases that master may be a variant (V>=0).
The above image shows the Zìdiǎn entry for an example clone, BMP PUA character [U+E223], which maps to CDL master [U+213F3](V=2). This second variant of [U+213F3] is a left-side combining variant of [U+213F3] its distribution in other CDL descriptions can be seen in the ▷as a CDL comp image above.
Note that if Advanced CDL Features are not enabled, PUA clones display at the normal Hanzi size.
Wenlin CDL Screenshots and Videos
- CDL Demo: Feature Overview (introduction to basic features)
- CDL: Feature Walk-Through (0) (introduction to advanced/experimental features)
- CDL: Feature Walk-Through (1) (overview of more advanced/experimental features)
- CDL Demo: Automatic Coordinate Adjustment (experimental clip of CDL automatically approximating a target glyph in another font)
- CDL and Pinyin integration (playlist: experimental use of CDL with pīnyīn input methods, to input any CJK character)
For Wenlin CDL Developers
Wenlin CDL Technology Overview
For advanced CDL users and Wenlin Developers, the following list summarizes some of the unique features of Wenlin’s CDL font technology, features which make possible all of the CDL-related features of the Wenlin Software for Learning Chinese application, and much much more.
- CDL is the engine (C source code) behind CJKV Unicode megafonts, breaking the 64K glyph barrier! (A CDL font can contain an unlimited number of glyphs.)
- CDL is an XML application, a standards-based font and encoding technology designed for precise and compact description, rendering, and indexing of all 漢/汉 Han (Chinese, Japanese, Korean, and Vietnamese = CJKV) characters, encoded and unencoded.
- CDL is a font database containing (to date) XML / Unicode descriptions of nearly 100,000 characters, complete Unicode 7.1 CJK character support, and more.
- CDL adds a third dimension to the code space, with a variant mechanism for associating an unlimited number of CDL descriptions with any Unicode codepoint.
- Each CDL description can be associated with zero or more Unicode code points, making CDL the ideal tool for extending The Unicode Standard.
- CDL means consistent stroke/component analyses, built-in indexing and variant mappings, and high-quality graphic images as outlines convertible to SVG, PostScript, MetaFont, and more.
- CDL is a compressed binary with an incredibly small memory footprint (~1.5 MB!), suitable for use in limited-memory mobile devices that want full Unicode CJK support.
- CDL technology has applications for machine learning, for handwriting recognition and input methods, for optical character recognition (OCR), and most importantly for human language-learning.
- The basic elements of CDL are a two-dimensional coordinate space, and a set of basic stroke types. Using these simple elements, CDL provides a framework for describing characters and components, and for (recursive) reuse of character and component descriptions in the descriptions of other characters and components.
- CDL has applications beyond CJK, for organizing information underlying the rendering of any complex script.
Core CDL Resources
- A Specification for CDL [Bishop (毕晓普 (畢曉普)) & Cook (曲理查)] (PDF, 2003-10-31).
- Set of Basic Stroke Types [Bishop & Cook] (PDF, 2003-11-04; minor revisions 2004-05-23).
- A character description language for CJK [Bishop & Cook] (PDF; Multilingual, #91, Volume 18 Issue 7 (p. 62-8); October/November 2007).
- Wénlín CDL Online (web interface to the CDL Engine and Database; 2008-06-06).
- A draft CDL DTD (Document Type Definition) defining the CDL tags (elements and attributes).
- The Unicode Standard Version 6.1 – Core Specification: Appendix F: “CJK Strokes Documentation” (all CJK glyphs in this appendix were created by the CDL team using CDL, and all text derives from the CDL Spec. and from CJK Strokes work in WG2/IRG:N3063) [2012-01-31].
- CJK Strokes block, Unicode 5.1 [April 4, 2008] (PDF).
Contact the CDL Development Team
If you are interested in building CDL font technology into your application, if you want to build cutting-edge CJK fonts, if you need to digitize difficult CJK texts, don't try to reinvent the wheel, don't hesitate to contact the CDL Development Team. We can help build a solution to meet your programming needs.
All CDL descriptions provided with Wenlin Software for Learning Chinese are copyright © 2015 Wenlin Institute, Inc., All Rights Reserved. To use CDL in your applications and publications, please contact Wenlin Institute. Conventional fonts can be exported from CDL by various methods, and the CDL Engine and Database are available for licensing.