Chinese language database
The Chinese language is my biggest hobby. While I am studying it in general too (passed the HSK 3 level and plan to pass HSK 4/5 soon), I enjoy learning and compiling facts about even more. Specifically, I love everything that has to do with the Chinese writing system, including learning the characters, studying their history, and practicing calligraphy. The “discrete” nature of the Chinese language appeals to my love of statistics, because without grammatical forms and with a fixed set of used characters, everything in Chinese can be counted and analyzed.
My biggest project in regard to the Chinese language is the Chinese language database (中文数据库). The database consists of two large parts: one is dedicated to the characters in general and can be of interest to anyone, and the second one is dedicated to my own progress in learning the language and can help those who want to start learning it.
The first part contains extensive lists of Chinese characters and statistics for them. There are a total of four lists:
- The list of the Chinese characters by their frequency in the language, based on Jun Da’s Modern Chinese Character Frequency List. This list contains a total of 9,933 characters.
- The list of the Chinese characters from the General Standard. This list contains a total of 8,105 characters: 3500 frequent, 3000 common, and 1605 rare, — and is the official standard.
- The full list of the Chinese characters that is obtained by merging the first two lists. This list contains 11,062 characters and can be treated as the exhaustive list of characters, for which one can find the data in an automated way.
- The list of the Chinese characters from the Hanyu Shuiping Kaoshi, the main international exam for Chinese language. This list contains 2,663 characters and is split into levels of the exam.
For all the characters in the lists, the database provides various data: pronunciation, meaning, dictionary keys, and stroke count. An additional list in the database is dedicated to compiling some statistics about all the 11,062 characters, like this:
The second part of the database describes my own learning progress and can be of use to anyone who decides to learn the language. The main sheet lists all the characters that I learned, their distribution among the frequency and the HSK levels, as well as the learned words and phrases. Additionally, the database tracks the progress in the set out goals: for example, learning all the HSK characters, learning 3,000 most frequent characters, etc.
I hope that the database can help you or make you interested in the Chinese language!