How to I learnt Chinese.

Todo: this is a post-mortem of learning Chinese as a foreign language, for those who would like to try it themselves, to assess time and effort needed.

1. TODO Body

1.1. TODO References

  1. https://github.com/cschiller/zhongwen/ :: firefox extension dictionary
  2. panlatin (sound hints) :: firefox extension
  3. Pleco
  4. ChinesePOD
  5. Mao’s Little Red Book
  6. The Little Schemer
  7. Zhihu
  8. Private offline teacher
  9. Word plotting from pleco backups
  10. Chinese Word Separator
  11. fcitx5
  12. mdbg
  13. cedict, cc-cedict
  14. opencc
  15. Anki
  16. ctext.org
  17. Book of Change
  18. Emacs’ “pinyin with tone marks” input method.

1.2. TODO Text

1.3. Counting words I have learnt overtime

1.3.1. Counting words

#!/bin/bash
# -*- mode:sh; -*-
# Time-stamp: <2024-12-12 19:32:34 lockywolf>
#+title: Print word usage from time, learning Chinese.
#+author: 
#+date: 
#+created: <2024-05-09 Thu 13:19:26>
#+refiled:
#+language: bash
#+category: 
#+filetags: 
#+creator: Emacs-30.0.50/org-mode-9.7-pre

if [[ $1 == "" ]] ; then
  printf "usage: %s <dir>" "$0"
fi

{
echo date words-learnt
for f in "$1"/*.pqb ; do
#  echo "$f"
  REG='flashbackup-([[:digit:]]{10})\.pqb'
  if [[ "$f" =~ $REG ]] ; then
#    echo matches "${BASH_REMATCH[1]}"
    cdate="${BASH_REMATCH[1]}"
    ndate=$(stat --format='%y' "$f" | cut -f 1 -d ' ')
    #ndate=$(date --date="$ndate" +%s)
  else
    echo "not matches!"
    exit 1
  fi
  nwords=$(sqlite3 "$f" 'select count(*) from pleco_flash_cards;')
  printf "%s %s\n" "$ndate" "$nwords"

done } > print-word-usages.csv

echo finished successfully

1.3.3. Result

(This plot is outdated, and not re-generated regularly.)

output_graph.jpg

1.4. Generating a list of hard characters

I used this list for the Chinese “alphabet” https://www.hanyuguoxue.com/zidian/guifanhanzi-sn-3 , because full UNICODE is too huge. It has four groups: easy characters, middle characters, hard characters, and the rest, seldom used characters.

This script has a few nice features:

  1. It is still pure UNIX: bash, sed, grep, and awk, even though it does “statistical data analysis”.
  2. Sequential string character processing in bash is horrible (unicode BAD, BAD!), but there is a nice trick around it.
  3. OpenCC is a very nice tool for simplified<->traditional conversion, and this is the first time in my life I actually used co-processes in a script (albeit not in BASH).
  4. grep -P '[\p{Han}]' is really useful.
#!/bin/bash
# -*- mode:sh; -*-
# Time-stamp: <2025-07-07 09:05:01 lockywolf>
#+title: Make word statistics in bash.
#+author: 
#+date: 
#+created: <2025-07-05 Sat 21:44:56>
#+refiled:
#+language: bash
#+category: 
#+tags: 
#+creator: Emacs-31.0.50/org-mode-9.7-pre

set -x

#echo "你好嗎 新年好。全型句號" | sed -e 's/\(.\)/\1\n/g'



for i in 1 2 3; do
  time sed -e 's/\(.\)/\1\n/g' level-$i-source.txt | grep -P '[\p{Han}]' | sort\
    | uniq  > level-$i-processed.txt
done


filename=$( cd ~/Syncthing_2021/Oneplus-5t-sdcard/2020-06-15-Pleco_backups
            readlink -f $(find . -iname 'flashbackup-*.pqb' | sort  -r | head -n1))

echo "$filename"

time sqlite3 -cmd 'select * from pleco_flash_cards ; ' "$filename" </dev/null \
  | sed -e 's/\(.\)/\1\n/g' | grep -P '[\p{Han}]' | sort | uniq > my-characters.txt



#  | awk --field-separator='' '{ for (i=1; i<NF ; i++)  ;}'

if [[ ! -d ./sources ]] ; then
  echo "you must mkdir ./sources and copy your source files there"
  exit 2
fi

time find -L . -type f -iname '*html' -exec cat {} + \
  | awk --field-separator='' 'BEGIN{distribution[a]=1} { for (i=1; i<NF ; i++) if( $i in distribution ) { distribution[$i]++ } else {distribution[$i]=1} ;} END{ for (char in distribution ) { print distribution[char] " : " char  } ; }' \
  | grep -P '[\p{Han}]'  | sort -V -r  \
  | awk 'BEGIN{while(( getline tmp < "my-characters.txt") > 0) {mine[tmp]=1 ; } ;
               while(( getline tmp < "level-1-processed.txt") > 0) {level1[tmp]=1 ; } ;
               while(( getline tmp < "level-2-processed.txt") > 0) {level2[tmp]=1 ; } ;
               while(( getline tmp < "level-3-processed.txt") > 0) {level3[tmp]=1 ; } ;
  }
{printf( "%4d" " : " "%6d : %s"  " : " , NR, $1, $3);
if ($3 in mine) {printf("  seen") ; } else {printf("unseen")}; printf(" : ");
if ($3 in level1) {printf("1"); } else if ($3 in level2) {printf("2") } else if ($3 in level3) {printf("3");} else {printf("0");}
 print $3 |& "opencc -c /usr/share/opencc/t2s.json" ;
 "opencc -c /usr/share/opencc/t2s.json" |& getline tmp ;
 printf(" : " tmp )
 print ""; }' > sorted.txt