Frequency Statistics

From Wenlin Guide
Jump to navigation Jump to search

Wenlin 216x93.png Appendix A of the Wenlin User’s Guide

People often ask, “How many Chinese characters are there?” A moderately sized Chinese dictionary might have entries for seven thousand characters. The Unicode Standard (6.0) specifies codes for over seventy thousand (so far). The authoritative Kangxi dictionary of 1716 A.D. contains forty thousand, which Dr. L. Wieger said “may be plainly divided as follows: 4000 characters in common use; 2000 proper names and doubles of limited use; 34,000 monstrosities of no practical use.” (Chinese Characters, Dover Publications, Inc., New York.) We would hesitate to call any character a “monstrosity”, but it’s true that some characters are used far more often than others, and the practical importance of a character is closely related to its frequency of usage. Nobody really knows forty thousand characters.

If you know a thousand Chinese characters, you know a lot more than a thousand Chinese words or phrases. A Chinese word may contain one or more characters, and a large number of words or phrases can be built from a small number of characters. The approximate meaning of a polysyllabic compound word is usually easy to learn, given the meanings of the individual characters in the compound. (Looking at it another way, “knowing” a character really entails knowing the meanings of the compounds in which it commonly occurs.)

Freq-1996.jpg

If you know the one thousand most common characters, then you will recognize 90% of the characters on an average page of modern Chinese text. Knowing the two thousand most common characters will give you over 97% average recognition. All the remaining thousands of characters account for less than 3% of usage. The significance of these figures is that you may obtain the fastest rewards by learning the most common characters first. Elementary textbooks are generally designed with this in mind, so you will learn more common characters at the beginning without having to pay attention to this issue. However, at some point, when you have learned hundreds of characters, you may wish for ways to review and assess your progress. Memorization is greatly facilitated by making lists and going over them repeatedly. It’s a real boost for your morale to have an idea of when you pass certain mileposts: one hundred characters; five hundred characters; one thousand characters; and so on.

The actual order in which one learns new vocabulary is a highly individual matter. Wenlin isn’t designed to impose a rigid order or method on the student. It simply makes statistical information available, and emphasizes the most common vocabulary.

Two distinct types of frequency statistics are useful. The first is obtained by counting occurrences of characters, the second by counting occurrences of words (including vocabulary items that might also be called “phrases”). When learning characters, we may emphasize the most common characters. Then when learning each character, we may emphasize the most common words or phrases that contain that character. The word frequencies are also helpful, for example, when one wants to know which of two synonymous Chinese words is more often used. Here, by word we mean to include monosyllabic words, as well as polysyllabic words and phrases.

Counting occurrences of characters is fairly straightforward, aside from the fundamental decision of what texts to use as a representative sample of vocabulary (and some other details). Take the pile of books, newspapers, documents etc. constituting the sample and count how many times each character occurs in the whole pile. The characters may then be ranked as the first, second, third, etc., most common character.

Counting occurrences of words is slightly more problematic, since the concept of a Chinese “word” is not well-defined. A Chinese word may be a single character or several characters in length. Sometimes, however, a two-character word might alternatively be regarded as a phrase consisting of two single-character words. This is reflected in the way one tends to learn Chinese vocabulary: each new character seems to present a specific challenge, while learning new compounds made of already-known characters takes relatively little effort.

In Wenlin’s Zìdiǎn (character dictionary), the most common character is ranked first and assigned the number 1; the second most common character is ranked second and assigned the number 2, etc. (see Chapter 6, List Characters by Frequency). When a character is not among the 3,000 most common characters, the notation “----”, “====”, or “xxxx” appears instead of a number (characters marked “----” are more common than those marked “====”, according to GB and Big5 codes; “xxxx” marks characters that were in neither of the original GB and Big5 standards).

NotationMeaning
123(numeric) frequency rank
----more common (according to GB and Big5 codes)
====less common (according to GB and Big5 codes)
xxxxcharacters in neither of the original GB and Big5 standards

The frequency rankings are based on the set of simple form characters. Full form characters take on the ranking of the corresponding simple form character. In cases where multiple full form characters correspond to a single simple form character, all of the full forms take on identical frequency ranks.

In view of the differences between character frequency and word frequency, the statistics are indicated two different ways in Wenlin’s Zìdiǎn and Cídiǎn. Character frequency ranks are prominently displayed as “1” for the most common character, “2” for the second most common, and so forth. Cídiǎn entries are not numbered in this way. Instead, the Cídiǎn entry specifies how many times the word (or phrase) occurs per million characters of text on the average. Anyway, you may ignore the numbers and still enjoy their benefit, since lists of characters and words are automatically displayed with the most common ones first.

The statistics employed in Wenlin are based on three studies, by scholars who are to be thanked for their hard work in selecting and entering vast quantities of text, performing the calculations, and publishing the results in these three books:

(1) Modern Chinese Frequency Dictionary 《现代汉语频率词典》Xiàndài Hànyǔ Pínlǜ Cídiǎn (Beijing Language Institute, 1986) [Character and word frequencies from a sample of 1,807,389 characters.]
(2) Which are the Most Commonly Used Chinese Characters? 《“最常用的汉子是哪些?》Zuì Chángyòng de Hànzi shì Nǎxiē? (Chinese Writing Reform Committee and National Standards Office, 1982) [Character frequencies from a sample of 11,080,000 characters.]
(3) Cracking the Chinese Puzzles (T. K. Ann, © 1982 Stockflows Co., Ltd., 37 Queen’s Road, Central, Hong Kong) [Character frequencies from a sample of 1,408,573 characters.]

In addition, at Wenlin Institute we calculated two sets of statistics using electronic Chinese magazines published on the Internet between April 1991 and April 1995:

(4) 华夏文摘 Huáxià Wénzhāi (HXWZ) [Sample of 4,189,874 characters.]
(5) 枫华园 Fēng Huá Yuán (FHY) and 联谊通讯 Liányì Tōngxùn (LYTX) [Combined sample of 1,227,883 characters.]

The character frequency ranks in Wenlin were derived by combining and averaging these five sets of statistics. Since the text samples were different, the rankings are not identical, yet they are fairly close. For example, all the studies rank 的 de and 一 as the two most common characters.

CharacterPinyinStudy 1Study 2Study 3Study 4Study 5Adjusted Median
de111111
222222
shì433333
555444
le3614565
rén994656
zài746877
61728788
yǒu877999
zhōng611111121110
...
wén1702001175159115
...
lín721377480356398384

The character 文 wén is ranked 170th, 200th, 117th, 51st, and 59th in the five studies. The median rank is 117, which is adjusted to 115, Wenlin’s frequency rank for 文. A slight adjustment is needed to assign one and only one character to each number from 1 to 3000 – otherwise, there would be no character ranked 4th; 不 would be ranked 5th; and both 了 and 人 would be ranked 6th.

Only the Beijing study (1) calculated word frequencies; the researchers actually parsed the entire sample of text into monosyllabic and polysyllabic words. The statistics in Wenlin’s Cídiǎn entries are therefore derived entirely from the Beijing study. Since the sample size was 1,807,389 characters, we divided the number of occurrences of each word by 1.807389 to obtain the number of occurrences per million characters. Thanks to Dr. Richard Cook for providing Wenlin Institute with a large part of this data in electronic form.

The frequency statistics may change in future editions of Wenlin, based on further research.


Mouse pointer finger right.jpg | Previous: 15. The Dictionary Menu | Next: App. B. Keyboard Shortcuts | Contents |