Chapter 3 of the Wenlin User’s Guide
One of the main ideas behind Wenlin is to provide an environment for reading Chinese texts in which looking up vocabulary is quick and easy.
The Wenlin software package includes a variety of Chinese texts that you can read and study. Much of this chapter is about the different electronic file formats (or encodings) that Wenlin supports. But you don’t need to know about file formats to view any of the sample documents in the Wenlin package. Simply use the Sample Texts and Open Files in Wenlin Folder... commands, as described below.
If the file which you want to read was included with Wenlin, or was created by Wenlin and saved with a file extension associated with Wenlin or which Wenlin understands, then you may not even need to use the Open... command. Simply double-clicking the file icon may result in the file being opened by Wenlin. This behavior is somewhat dependent on your operating-system, but in general if you save your documents in a Unicode encoding form, and use the “.wenlin” file extension, you will ensure that the file is associated with Wenlin. For more information, see File Name Conventions below.
Wenlin can also help you to read your own documents or any other documents that are in any of the file formats recognized by Wenlin. To do so, you may need to know a few things about file formats and file naming conventions that are described in this chapter.
- 1 The Open Command
- 2 Open Recent Files
- 3 The Open Files in Wenlin Folder Command
- 4 The Open Wenlin Folder in Explorer/Finder Command
- 5 Sample Texts
- 6 The Text Folder
- 7 What Format Is This File?
- 8 File Formats Supported by Wenlin
- 9 File Name Conventions
- 10 Hypertext Links
The Open Command
To view a document, first you open it by letting Wenlin know the name of the file and its location.
The main way to open a document is to choose Open... from the File menu.
If you don’t know the difference between a file and a folder, here is a brief introduction.
A plain text file is a document: it’s a string of characters, with a name and a location (and a date and maybe an icon, etc.). We’ll discuss File Name Conventions below. The location of a file depends on two things: what disk it’s on, and what folder it’s in.
A disk (also called a volume can be a removable diskette, a CD, or a hard disk. A drive was originally the hardware that read a disk. But Hard drive and hard disk (or drive and disk) are practically synonymous today. Your computer almost certainly has an internal hard drive, which you use even though you may not see it unless you open the case. Hard drives with spinning internal disks are gradually being replaced by solid state flash memory devices (with no moving parts), sometimes called flash drives. Such devices may be removable, and also go by names such as USB drive or memory stick. Whether it is called a drive or a disk or a memory stick, these are all volumes of data storage available to your computer.
A folder (also called a directory is a subdivision of a disk, and a container for other files. Instead of putting all the files on a drive in one long list, the disk can be organized into separate folders. For example, all the writings of the author Lu Xun can be put together inside one folder. There can be folders inside of folders; a Lu Xun folder might contain one folder for fiction, and another for non-fiction.
Suppose you had a file named “madman.uni” (containing Lu Xun’s story Diary of a Madman) inside a folder named “fiction” which in turn was inside a folder named “luxun”. Then in some circumstances you might see the location written like this (depending on your operating system; these are also called file paths):
- C:\luxun\fiction\madman.uni (MS-Windows)
- /luxun/fiction/madman.uni (Macintosh)
The hard drive is named “C:” (or “D:” etc.) for MS-Windows, while for Macintosh it might be “Macintosh HD” and its name might be omitted (if it’s the default drive). Also, for MS-Windows the separator is a backslash (\), while on Macintosh it is a forward-slash (/).
The path to a file is the series of folders leading to the file; if the path begins with a drive letter or volume name, it is an absolute path. Wenlin uses absolute path names in the File menu under Open Recent.
When you choose Open... from the File menu, a dialog box appears for you to specify the file you want to open. The appearance of the dialog box, and the way it works, depend on which operating system you are using. These dialog boxes can be very confusing. It may help to form a mental picture of the “path” to a file. You need to navigate to the right disk, then, for example, open the “luxun” folder, then open the “fiction” folder, then find the file there. The dialog box enables you to navigate in the hierarchy of folders on any disk, and also to see, near the “top” level of this hierarchy, the available disks. There’s also a Desktop level, which should show the same items that you would see on the screen if there were no windows or dialog boxes in the way (except, perversely, on OS X, the disks/volumes aren’t shown).
The Open Command for Macintosh
When you choose Open... on a Macintosh, an ordinary Macintosh “Open” dialog box appears. The files listed are those in the current folder. If there are a lot of files, you can scroll to bring more file names into view. To open one of the files listed, select (highlight) its name by clicking on it, then click on the Open button (or double-click on the file name). Folders inside the current folder are also included in the list; by opening a folder you can see the files it contains.
To open a document in a different location (for example, on a different disk), scroll to the left (using the horizontal scrollbar) to see the highest level of the hierarchy, where the disks (and networks) are shown. Then you can open the disk, the folder, and (finally) the file of your choice.
The Open Command for MS-Windows
When you choose Open... using MS-Windows, an ordinary MS-Windows “Open” dialog box appears. The box to the right of the caption “Look in:” indicates the current disk, or the current folder if it is not the top level of the disk. By clicking on that box, you can navigate to a different location. The large box that occupies most of the dialog shows all the files and folders in the current location. If there are a lot of files, you can scroll to bring more file names into view. To open one of the files listed, select (highlight) its name by clicking on it, then click on the Open button (or double-click on the file name). Folders inside the current folder are also included in the list; by opening a folder you can see the files it contains.
Alternatively, you can click on the box with the caption “File name:” and type the name of the file you want to open there. You can even type the complete path of the file; for example,
(Capital and lower case letters are equivalent.)
Open Recent Files
Besides using the Open... command, you can sometimes open a document by selecting its name from the Open Recent list in the File menu.
Wenlin remembers up to 36 of the most recent documents that have been opened, and stores the absolute path to these files in the Open Recent list in the File menu. This list is updated every time you open a document, and the most recently opened documents appear at the top of the list.
The Open Files in Wenlin Folder Command
In the File menu, there is a command called Open Files in Wenlin Folder.... It is almost identical to the Open... command. The only difference is that whenever you choose Open Files in Wenlin Folder..., the dialog box always starts out showing the “Wenlin4” folder, which contains the files that were supplied along with the Wenlin application.
The Open Wenlin Folder in Explorer/Finder Command
The next command in the File menu is called Open Wenlin Folder in Explorer (MS-Windows) or Open Wenlin Folder in Finder (Macintosh). Instead of a dialog box, it causes the “Wenlin4” folder to be displayed by the MS-Windows Explorer or the Macintosh Finder. This may be useful both as an alternative method for opening files (which you can drag and drop onto the Wenlin application icon), and as a method for doing other things with the files, such as making backup copies, or opening them with programs other than Wenlin. Be careful not to move, damage, or delete any of the files that Wenlin depends on; otherwise you might prevent Wenlin from running. More information about these folders and files is in Appendix C.
The Sample Texts command in the File menu opens a window listing some of the sample texts included with the Wenlin software package, all located in the “Text” folder.
The Text Folder
If you install Wenlin on a writable disk, the “Text” folder is copied automatically. Certain files in the “Text” folder must be present for Wenlin to run properly. For example, “gua.wenlin” must be present, so that when you click on any hexagram (a symbol like this: ䷾), the corresponding section of the 《易經》 Yìjīng (I Ching) is displayed.
Additional, optional sample texts are in the “Text2” folder. You can open them using Open Files in Wenlin Folder....
Wenlin Institute is not responsible for the contents of the documents that we downloaded from the Internet; in fact we may not have read through them all!
What Format Is This File?
By format, we mean a standard for encoding the characters in a file. At a fundamental level that’s normally visible only to programmers, each letter or character is represented by a numeric code. (Ultimately, a sequence of zeros and ones; Chapter 6 has a brief introduction to bits, bytes, nybbles, and hexadecimal, in the section for List Characters by Unicode.) There are several different formats commonly used for Chinese characters, which sometimes makes Chinese computing difficult. Wenlin reduces this difficulty by supporting all the most common formats, including Unicode, which is well on its way to being the universal standard for all the languages on Earth. You can open any text file supplied with Wenlin, without having to know what its format is. (If you get the message “What format is this file?” for a file supplied with Wenlin, it must not be a text file—it’s a binary file, such as a font or database.) However, when you open a file from another source, you may need to know its format. For example, you might have written a Chinese file using some other Chinese text editing software, or you might have downloaded a Chinese file from the Internet.
When you open a file, if Wenlin can’t automatically determine its format, it asks you, “What format is this file?” Don’t panic, it may well be in a format that Wenlin can open. The problem is, none of the popular Chinese formats, except for Unicode, has a signature that says, in effect, “I am a such-and-such format file”. Unicode files normally do have an internal signature (the first two bytes have certain values), meaning: “I am a Unicode file.” So Unicode files are no trouble. Other formats can be recognized by file naming conventions. For example, a GB file should have a name like “GoodFile.gb”'; then Wenlin will know it’s GB. Otherwise, Wenlin presents a list of choices. You can click on the ▷ button for the correct format. To help you make the right choice, Wenlin shows one-line previews.
In the illustration below, a mysterious file is named “Example.123”. Following the ▷Βig5 button is a preview, assuming that it is a Big5 file; but the preview is garbage—the Chinese characters make no sense—so it’s evidently not Big5. The preview following the ▷GΒ button button does make sense, and it includes Chinese characters, so this must be a GB file.
If the preview contains a lot of � symbols (diamonds with question marks), that’s a bad sign, meaning invalid character codes; however, it’s not unusual for a file obtained from some odd source to have to contain a few invalid character codes, so this isn’t a sure sign of a wrong format for a file as a whole.
If you can’t be sure from the previews, you can try opening the file in one format after another, until it doesn’t look like nonsense, or until you’ve exhausted all the possibilities.
To make this more convenient, Wenlin provides the Re-open as... command in the File menu. The window you already opened needs to be the active window when you choose this command.
File Formats Supported by Wenlin
Wenlin can read and write documents in the following plain text electronic file formats:
- Unicode (ISO 10646), an international standard for nearly all the world’s languages. Unicode supports both simple and full form Chinese characters simultaneously. By assigning a unique number to each character in every major world language, Unicode solves the problems of incompatibility between the many different standards for particular languages. There are a few different ways to encode Unicode text; Wenlin considers Big-endian UTF-16 to be the normal one.
- UTF-8 (Unicode Transformation Format, 8-bit), a popular Unicode variety, which stores exactly the same information but in a way that is more compatible with the earlier ASCII standard (see below). (It stores the 128 ASCII characters with one byte each, and stores other characters as two, three, or four bytes. In contrast, UTF-16 uses 16 bits, or two bytes, for each character, except some extremely rare characters that it encodes as 32 bits, or four bytes.)
- Little-endian Unicode, a backward variety of UTF-16. (The technical terms big-endian and little-endian, which refer to the ordering of bytes in a two-byte number, were inspired by Jonathan Swift’s book Gulliver’s Travels, in which the Big-Endians are people who insist on breaking their eggs at the big end rather than the little end.)
- GB (Guojia Biaozhun 18030), a national standard of the Peoples Republic of China. It’s an extension of GBK, which was an extension of the older GB2312 standard, which was for simple form Chinese characters only. There is now a one-to-one mapping between GB18030 and Unicode, but some software still only supports the original simple-form-only GB.
- HZ, an old variant of GB2312, now rare and not recommended. Wenlin supports reading HZ but not writing it.
- Big5+ (“Big Five Plus”), an industrial standard originating in Taiwan. It’s an extension of the older Big5 standard, which was for full form Chinese characters only. It contains codes for all 20,092 Chinese characters that were in the first version of the Unicode standard, but is missing tens of thousands of additional characters that have been added to Unicode and GB18030.
- ASCII (American Standard Code For Information Interchange), for English text only. It consists of the first 128 Unicode characters, encoded as one byte each.
- Latin1 (ISO-8859-1), an extension of ASCII that includes some European characters. It consists of the first 256 Unicode characters, encoded as one byte each.
- MacRoman, an extension of ASCII that includes some European characters, formerly common on Apple Macintosh computers. It is a one-byte encoding. The first 128 characters are the same as in ASCII, Latin1, and Unicode, but the additional characters are arranged completely differently.
Wenlin doesn’t support any of the fancy text (as opposed to plain text)" file formats used by commercial word processors. Fancy text contains codes for page layout, fonts, etc. Most word processors can save or “export” plain text. Note that even plain text formats all include essential control codes such as spaces, tabs and carriage returns. Currently, Wenlin can’t properly display Arabic, Bengali, or any script other than Chinese and English. In fact, depending on what Unicode fonts you have, you may be able to display a wide variety of scripts in Wenlin, including Japanese and Russian; but in order to be displayed correctly, most scripts require not only a font but also other special handling, which we hope to implement in future editions.
When you create or edit files, it’s advisable to use Unicode (or UTF-8 with a .u8 extension) whenever possible. If you must use non-Unicode files, for compatibility with other software, give them names that indicate their formats (such as “Example.gb” or “Example.b5” —see below for file name extensions), and only use simple form characters in a GB file or full form characters in a Big5(+) file, unless you are certain the other software you’re using supports GB18030 and/or Big5+.
The Unicode file “signature” (BOM = Byte Order Mark) is normally invisible (to humans), and you probably don’t need to know anything more about it. In case you are interested and mathematically inclined, for UTF-16 the signature simply consists in the first two bytes of the file being FE and FF hexadecimal (i.e., 11111110 and 11111111 binary.) For little-endian, the two bytes are reversed. The Unicode standard recommends using this signature, and Wenlin always writes it (except for UTF-8); but Wenlin can also open a Unicode file that is missing the signature, in case such a file is created by other software.
File Name Conventions
File Names in General
FIle names can use a variety of characters. For the specific characters which may be legal for use in file names on your computer, please refer to your operating system documentation. Most modern operating systems support a wide range of Unicode characters in file names, though some characters may not be permitted. We recommend avoiding punctuation marks other than hyphen (-) and underscore (_), and using period (.) for filename extensions only.
File Name Extensions
The name of a file can help to indicate the file’s format. Wenlin follows the widespread convention of recognizing certain file name extensions (also known as suffixes). The extension is the part of the name that starts with a period. For example, in “Example.ASC”, the extension is “.ASC”. (If there are two or more periods, the extension is what follows the last period.) Wenlin recognizes these extensions for text files:
|.wenlin||Unicode (either UTF-16 or UTF-8)|
|.u8||UTF-8 (8-bit Unicode Transformation Format)|
|.gb||GB (Sometimes limited to simple form)|
|.b5||Big5(+) (Sometimes limited to full form)|
|.hz||(Obsolete variant of GB)|
|.asc||ASCII (English only)|
Upper and lower case aren’t distinguished for this purpose, so “.asc” is treated the same as “.ASC”.
A UTF-16 file doesn’t need a particular extension, since the file itself has a signature in the first two bytes. Therefore, the file name can have almost any extension, as far as Wenlin is concerned. It might be a good idea to use the extension “.uni” if in doubt. Or, if you want to associate the file with Wenlin (for example, so that it opens with Wenlin when you double-click on its icon), give it the “.wenlin” extension.
The popular “.txt” extension strongly suggests text, but otherwise conveys nothing about file formats. Most web page files have “.htm” or “.html” extensions (for Hyper-Text Markup Language); these extensions don’t convey which encoding is used for Chinese text, but sometimes HTML files do contain instructions like “charset=utf-8” which Wenlin is occasionally clever enough to recognize.
Earlier versions of Wenlin sometimes used “.han” for Chinese character text files, “.eng” for English text files, and “.pin” for pinyin text files (all in UTF-16 format); but these aren’t widespread conventions and they can’t be recommended. Wenlin doesn’t give any special treatment to these extensions. Avoid making up your own file extensions: they might be misinterpreted in strange unpredictable ways by software you didn’t even know was on your computer.
Both Microsoft and Apple have a perverse fondness for keeping people ignorant, so their operating systems often make file name extensions invisible. You might be able to thwart this behavior. For MS-Windows, in an Explorer folder window, choose Options... from a View menu, and un-check a check box that says “Hide file extensions for files types that are registered.” For Macintosh, in the Finder, choose Preferences from the Finder menu, click Advanced, and check a check box that says “Show all file extensions.”
Besides using the Open... command, you can sometimes open a document by pressing a hypertext link triangle ▷ button. This kind of triangle button shows up in:
- Any window that displays the results of a global search (details are in Chapter 10)
- Choices for “What format is this file?” (described in earlier in this chapter)
- Miscellaneous documents
A hypertext triangle button may point to another document, a particular position in another document, or a different position in the same document that contains the button. In the latter case, when you click on the button, a new window doesn’t open, the window simply scrolls to the new position.
To follow a hypertext link, click on the triangle button. If you want to include hypertext links in your own documents, see Chapter 8 for a brief explanation of the hidden codes for creating hypertext links.