How to I learnt Chinese.

Todo: this is a post-mortem of learning Chinese as a foreign language, for those who would like to try it themselves, to assess time and effort needed.

1. TODO Body

1.1. TODO References

https://github.com/cschiller/zhongwen/ :: firefox extension dictionary
panlatin (sound hints) :: firefox extension
Pleco
ChinesePOD
Mao’s Little Red Book
The Little Schemer
Zhihu
Private offline teacher
Word plotting from pleco backups
Chinese Word Separator
fcitx5
mdbg
cedict, cc-cedict
opencc
Anki
ctext.org
Book of Change
Emacs’ “pinyin with tone marks” input method.

1.3. Counting words I have learnt overtime

1.3.1. Counting words

#!/bin/bash
# -*- mode:sh; -*-
# Time-stamp: <2024-12-12 19:32:34 lockywolf>
#+title: Print word usage from time, learning Chinese.
#+author: 
#+date: 
#+created: <2024-05-09 Thu 13:19:26>
#+refiled:
#+language: bash
#+category: 
#+filetags: 
#+creator: Emacs-30.0.50/org-mode-9.7-pre

if [[ $1 == "" ]] ; then
  printf "usage: %s <dir>" "$0"
fi

{
echo date words-learnt
for f in "$1"/*.pqb ; do
#  echo "$f"
  REG='flashbackup-([[:digit:]]{10})\.pqb'
  if [[ "$f" =~ $REG ]] ; then
#    echo matches "${BASH_REMATCH[1]}"
    cdate="${BASH_REMATCH[1]}"
    ndate=$(stat --format='%y' "$f" | cut -f 1 -d ' ')
    #ndate=$(date --date="$ndate" +%s)
  else
    echo "not matches!"
    exit 1
  fi
  nwords=$(sqlite3 "$f" 'select count(*) from pleco_flash_cards;')
  printf "%s %s\n" "$ndate" "$nwords"

done } > print-word-usages.csv

echo finished successfully

1.3.3. Result

(This plot is outdated, and not re-generated regularly.)

1.4. Generating a list of hard characters

I used this list for the Chinese “alphabet” https://www.hanyuguoxue.com/zidian/guifanhanzi-sn-3 , because full UNICODE is too huge. It has four groups: easy characters, middle characters, hard characters, and the rest, seldom used characters.

This script has a few nice features:

It is still pure UNIX: bash, sed, grep, and awk, even though it does “statistical data analysis”.
Sequential string character processing in bash is horrible (unicode BAD, BAD!), but there is a nice trick around it.
OpenCC is a very nice tool for simplified<->traditional conversion, and this is the first time in my life I actually used co-processes in a script (albeit not in BASH).
grep -P '[\p{Han}]' is really useful.

#!/bin/bash
# -*- mode:sh; -*-
# Time-stamp: <2025-07-07 09:05:01 lockywolf>
#+title: Make word statistics in bash.
#+author: 
#+date: 
#+created: <2025-07-05 Sat 21:44:56>
#+refiled:
#+language: bash
#+category: 
#+tags: 
#+creator: Emacs-31.0.50/org-mode-9.7-pre

set -x

#echo "你好嗎 新年好。全型句號" | sed -e 's/\(.\)/\1\n/g'



for i in 1 2 3; do
  time sed -e 's/\(.\)/\1\n/g' level-$i-source.txt | grep -P '[\p{Han}]' | sort\
    | uniq  > level-$i-processed.txt
done


filename=$( cd ~/Syncthing_2021/Oneplus-5t-sdcard/2020-06-15-Pleco_backups
            readlink -f $(find . -iname 'flashbackup-*.pqb' | sort  -r | head -n1))

echo "$filename"

time sqlite3 -cmd 'select * from pleco_flash_cards ; ' "$filename" </dev/null \
  | sed -e 's/\(.\)/\1\n/g' | grep -P '[\p{Han}]' | sort | uniq > my-characters.txt



#  | awk --field-separator='' '{ for (i=1; i<NF ; i++)  ;}'

if [[ ! -d ./sources ]] ; then
  echo "you must mkdir ./sources and copy your source files there"
  exit 2
fi

time find -L . -type f -iname '*html' -exec cat {} + \
  | awk --field-separator='' 'BEGIN{distribution[a]=1} { for (i=1; i<NF ; i++) if( $i in distribution ) { distribution[$i]++ } else {distribution[$i]=1} ;} END{ for (char in distribution ) { print distribution[char] " : " char  } ; }' \
  | grep -P '[\p{Han}]'  | sort -V -r  \
  | awk 'BEGIN{while(( getline tmp < "my-characters.txt") > 0) {mine[tmp]=1 ; } ;
               while(( getline tmp < "level-1-processed.txt") > 0) {level1[tmp]=1 ; } ;
               while(( getline tmp < "level-2-processed.txt") > 0) {level2[tmp]=1 ; } ;
               while(( getline tmp < "level-3-processed.txt") > 0) {level3[tmp]=1 ; } ;
  }
{printf( "%4d" " : " "%6d : %s"  " : " , NR, $1, $3);
if ($3 in mine) {printf("  seen") ; } else {printf("unseen")}; printf(" : ");
if ($3 in level1) {printf("1"); } else if ($3 in level2) {printf("2") } else if ($3 in level3) {printf("3");} else {printf("0");}
 print $3 |& "opencc -c /usr/share/opencc/t2s.json" ;
 "opencc -c /usr/share/opencc/t2s.json" |& getline tmp ;
 printf(" : " tmp )
 print ""; }' > sorted.txt