Chinese language database (中文数据库)
The Chinese language is a great hobby of mine. While I am studying it in general too (passed the HSK 3 level a couple of years back and plan to pass HSK 5 soon), I enjoy learning and compiling facts about even more. Specifically, I love everything that has to do with the Chinese writing system, including learning the characters, studying their history, and practicing calligraphy. The “discrete” nature of the Chinese language appeals to my love of statistics, because without grammatical forms and with a fixed set of used characters, everything in Chinese can be counted and analyzed.
My biggest project in regard to the Chinese language is the Chinese language database (中文数据库). The database consists of two large parts: one is dedicated to the language in general and can be of interest to anyone, and the second one is dedicated to my own progress in learning the language and can help those who want to start learning it.
General information
The first part contains extensive lists of Chinese characters and words with statistics for them. There are a total of eight lists:
- Five lists of Chinese characters:
- The list of the Chinese characters by their frequency in the language, based on Jun Da’s Modern Chinese Character Frequency List. This list contains a total of 9,933 characters.
- The list of the Chinese characters from the General Standard. This list contains a total of 8,105 characters: 3500 frequent, 3000 common, and 1605 rare, — and is the official standard.
- The full list of the Chinese characters that is obtained by merging the first two lists. This list contains 11,062 characters and can be treated as the exhaustive list of characters, for which one can find the data in an automated way.
- The list of the Chinese characters from the Hanyu Shuiping Kaoshi 2.0, the main international exam for Chinese language. This list is split into levels of the exam, contains 2,663 characters, and represents the version of the exam as it was from 2010 to 2020.
- The list of the Chinese characters from the Hanyu Shuiping Kaoshi 3.0. This list is split into bands of the exam, contains 3,000 characters, and represents the version of the exam as it runs from 2021 and onwards.
- Three lists of Chinese words (multi-character):
- The list of the Chinese words by their frequency in the language, based on BLCU Chinese Corpus. This list contains all the words with at least 2,000 encounters in the corpus (a total of 93,279 words).
- The list of the Chinese words from the Hanyu Shuiping Kaoshi 2.0, the main international exam for Chinese language. This list is split into levels of the exam, contains 4,287 words, and represents the version of the exam as it was from 2010 to 2020.
- The list of the Chinese characters from the Hanyu Shuiping Kaoshi 3.0. This list is split into bands of the exam, contains 9,446 words, and represents the version of the exam as it runs from 2021 and onwards.
For all the characters in the lists, the database provides various data: pronunciation, meaning, dictionary keys, and stroke count. For the words from the HSK levels, there are pronunciations and meanings. An additional list in the database is dedicated to compiling some statistics about all the 11,062 characters, like this:
Learning progress
The second part of the database describes my own learning progress and can be of use to anyone who decides to learn the language. The main sheet lists all the characters that I learned, their distribution among the frequency and the HSK levels, as well as the learned words and phrases. Additionally, the database tracks the progress in the set out goals: for example, learning all the HSK characters, learning 3,000 most frequent characters, etc.
I hope that the database can help you or make you interested in the Chinese language!