My progress in learning Chinese for over a year, demonstrated by a lot of colorful language statistics, such as character frequency, language coverage, syllables, HSK levels, and stroke count.

Hello everyone!

This is a project that I have been working on for the last year and would very much like to share finally. The data is located in a Google Spreadsheet, which I would like to elaborate on. I am a person who is very fond of statistics and I want to encourage people to take another look at learning Chinese.

The statistics in the spreadsheet represent my learning of it, not the entire language, so let me start with the piece of it that does represent the entire language and may be interesting to any Chinese language enthusiast. The “Frequency data” sheet is based on Jun Da's Modern Chinese Character Frequency List and lists top 10000 modern Chinese characters by their frequencies in the corpus of texts. In my table, those frequencies are recalculated into a percentage of the language that this character represents. As one might see, the character takes up 4% of the language, making it approximately every 25th one you encounter.

The main body of the document is the “Characters” sheet, where I list the learned characters. They are divided into Sources and into Lessons, corresponding to the apps I was using to study them. For every character, the following information is listed: index number, frequency, HSK level (Chinese language proficiency exam, basically), stroke count, language coverage of this character (abovementioned), total language coverage up to this point, the character itself, its traditional form if it is different, pronunciation, and, of course, meaning. The table is color-coded for frequency and pronunciation.

Apart from the table itself, the first sheet contains all the statistics and the corresponding figures:

  • Firstly, it is calculated how many characters I learned in the first 100, 500, 1000 etc. most popular characters and how much language I would understand if I were to know all of them. Here I must note that knowing X% here stands for “recognizing X% of characters of a random non-specific text”, while, of course, the majority of Chinese language is compound words. Still, with only 100 characters you would technically "recognize" 42% of the text! It also states how many characters I’ve learned and how much language it covers, the rarest character that I know, etc. There is a histogram of the characters and the graph of the language coverage change with each character (reaching saturation, unfortunately).
  • Secondly, it shows how many characters I’ve learned of a certain pronunciation and shows the most common syllables (disregarding the tones), with “shi” having astonishing 29 characters so far. The figures show the popularity of the syllables and the tones.
  • Thirdly, there is a bar chart of HSK levels, showing the percentage of each level and how much of it I have left. So far, I finished HSK3, and I have an exam in December.
  • Fourthly, there is a graph of stroke count, demonstrating a nice bell curve around the value of 8-9 strokes.
  • Lastly, there is a graph of my level of knowledge of characters from Pleco app :)

Other lists are mainly auxiliary, but are simply nice to look at.

The “Words and Phrases” sheet lists, well, words and phrases that I learned, simple as that, that are also color coded for tones.

The “Syllables” sheet is a full version of the pronunciation chart, showing the most popular syllables both considering the tone and disregarding it.

The last two sheets are simply different visualizations of the characters. “Character frequency spreadsheet” is a giant table of all possible characters that shows the character if it is learned and shows an "X" if it isn’t. “Characters by HSK level” is more interesting: the idea is the same, but the characters are grouped by the HSK level. One might find it interesting to see the correlation between the level and the frequency of characters: for example, the 1 level (the most basic one) requires the character “苹” (apple), which is only the 2478th most popular one, while the 168th most popular one, “斯” is only taught in the 6 level, the super-advanced one. Of course, it’s natural, because learning the language requires some basic child-like vocabulary, and while the latter character is very popular, it is used a lot in loaned words. Still, the distribution is of interest.

Overall, I hope that my spreadsheet can make people more interested in Mandarin, especially those who love statistics, because the granular nature of this language makes it perfect for such people.

submitted by /u/areyde
[link] [comments]

from Aloha | Languagelearning https://ift.tt/30xW5he
via Learn Online English Speaking

Comments