This document is likely to become obsolete by ~2024.
Working with natural languages is hard. Many of us remember high school language classes as a horror.
Nevertheless, spellchecking is an expected application for computers. At some point I decided to make spell checking work where I need it to work.
Obviously, it turned out to be hard, and this document was created as a memo on how it was done.
The fields is very huge and heterogeneous, as Linux is an anarchic system, so I will limit the scope for myself with the following targets:
- English (British)
- Chinese (if possible)
- OmegaT (if possible)
- Jabber Messaging (Dino, Pidgin, m.b. others)
There are, at the moment, essentially three approaches to checking spelling.
- Machine-Learning based
- Human-based (or outsourced)
A lot of natural language processing theory has focused on the rule-based theory, until Internet giants accumulate enough data to train their models enough to outperform even the most advanced of models. Whenever you can, I suggest using Machine-Learning based methods.
However, when the task gets closer to implementing spelling rather than using it, rule-based methods still retain their validity, due to the fact that they are much easier to hook into your system. Naturally, people usually have no ability to train their own models.
There are services that let you submit your data for checking at remote services. This may be better for some cases – remote parties may have better resources for developing spellchecking methods. However, sometimes the Internet connection is not very good, and sometimes you are not allowed to share your texts.
A spell checker essentially consists of four components:
- Dictionary (sometimes grammar database)
- Dictionary package
- Application interface
The difference between a dictionary and a dictionary package usually comes from the breakage of an abstraction barrier. Dictionary packages turn out to be Engine-specific. Even worse thing is when a dictionary package is application-specific (sadly, happens). And although converting between different dictionaries and dictionary packages is often not too hard, it requires work.
Things do not appear out of nowhere in this world. Everything is done by some people driven by different motives.
This section lists several projects that continue improving natural language support in computing.
- https://www.cs.hmc.edu/~geoff/ispell.html – English dictionary and the oldest API for spelling.
- Moby Word List (by dwyl)
- https://github.com/dwyl/english-words, originally from https://www.gutenberg.org/ebooks/3201 – English Dictionary
- Alexander Lebedev
- http://scon155.phys.msu.su/eng/lebedev.html – Russian Dictionary 2004, ye, yo, both ftp://scon155.phys.msu.su/pub/russian/ispell/rus-ispell.tar.gz
- Konstantin Knizhnik
- http://www.garret.ru/~knizhnik – Russian Dictionary http://www.garret.ru/~knizhnik/rispell.tar.gz
- https://ftp.gnu.org/gnu/aspell/dict/0index.html – Dictionaries (Has Lebedev’s 2004 dictionary for Russian)
- Alexander Slovesnik
- https://addons.mozilla.org/en-US/firefox/addon/russian-spellchecking-dic-3703/, repack of Alexander Lebedev’s 3 dictionaries for Firefox
- Alexander Klukvin
- https://code.google.com/archive/p/hunspell-ru/, https://addons.mozilla.org/en-US/firefox/addon/russian-hunspell-dictionary/ 2013, improvement of the Lebedev’s dictionary. Hunspell format. Also has https://addons.mozilla.org/ru/firefox/addon/russian-hunspell-dictionary/, https://sites.google.com/site/dictru/
- Russian Friends of Hunspell
- https://addons.mozilla.org/en-US/firefox/addon/%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C-%D0%BE%D1%80%D1%84%D0%BE%D0%B3%D1%80%D0%B0%D1%84%D0%B8%D0%B8-%D0%B2%D0%B8%D0%BA%D0%B8/ – repacking of Wiktionary for Hunspell
- http://hunspell.github.io/ has dictionaries, and such.
- https://abiword.github.io/enchant/ a unified api for spellcheckers
- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/, https://extensions.libreoffice.org/?Tag%5B0%5D=50&q=
- AOT Project
- https://github.com/sokirko74/aot , has dictionary too, Yakov’s dictionary https://addons.mozilla.org/ru/firefox/addon/russian-spellcheck-dict-aot/ 2015
- Mozilla-Russia Unified Dictionary
- Alexander Petrenas
- https://addons.mozilla.org/ru/firefox/addon/unified-russian-english-spell/ , United Dictionary
- Apache OpenOffice
- https://www.languagetool.org/ , Grammar checker (!), extensions for Chrome, Firefox, LibreOffice
The oldest and the most widespread engine. Slackware only has an English dictionary for it.
Supposedly the best engine for English, but Slackware has dictionaries for most languages.
Russian one is from Lebedev.
A library that is built into Hunspell.
Supposedly, the best spellchecker for all languages, except, maybe, English.
Package for Slackware.
A meta-checker that is ispell-compatible, but can use other engines. Should be used if the software allows you to choose a language.
*:nuspell,hunspell,aspell,ispell en:aspell,hunspell,nuspell en_GB:aspell,hunspell,nuspell
Check for a dictionary:
enchant-lsmod-2 -list-dicts | grep ru
A grammar checker, supports English and Russian.
A great tool, actually. And the emacs package has great quality. Eats a lot of CPU, though.
Is on slackbuilds.org
Stylistics checker, quite strong.
flycheck has buit-in support for proselint.
You need to enable it in Emacs, using
on Slackware you need nodeenv, packaged at https://gitlab.com/Lockywolf/lwfslackbuilds/-/tree/master/nodeenv
Auxiliary (but enormous) checker for everything text.
DATE=$(date --iso) DNAME="$DATE_textlint" mkdir -p ~/bin/"$DNAME" source ~/bin/"$DNAME/bin/activate" npm install --global textlint npm install --global textlint-rule-rousseau npm install --global textlint-rule-diacritics
- ispell-mode for Emacs
This checks grammar using when you ask.
(use-package ispell ; :demand t :ensure t :hook (tex-mode-hook . (lambda () (setq ispell-parser 'tex))) :config ;; (ispell-change-dictionary "british-ise-w_accents" t) ;; aspell-specific ;; превед привет (setf ispell-program-name (executable-find "aspell")) ;; This will not change the language automatically. You still have to select it manually. (setq ispell-local-dictionary-alist '(("ru-local" "[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщьыъэюя]" "[^АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщьыъэюя]" "[-]" nil ("-d" "ru-yo") nil utf-8) ;; aspell ("british-local" "[A-Za-z]" "[^A-Za-z]" "[']" nil ("-d" "en_GB-ise") nil utf-8))) (ispell-change-dictionary "british-local" t) (setq ispell-silently-savep t) )
ispell-dictionary-alistis probably not the best here.
In Russian files, switch the language by using
Should be configured in some weird way in order to check both languages.
(use-package flyspell :demand t :ensure t :hook ((text-mode . flyspell-mode) (prog-mode . flyspell-prog-mode)) :config (diminish 'flyspell-mode "🦋🧙") (setf flyspell-use-meta-tab nil) ) (use-package flyspell-correct ; -ido :ensure t :demand t :bind (:map flyspell-mode-map ("C-;" . flyspell-correct-wrapper)) :init (setq flyspell-correct-interface #'flyspell-correct-popup))
Works, but I would recommend trying spell-fu.
Only uses aspell, but supports what was flyspell-mode, and what was flyspell-prog-mode Also is very fast.
(use-package spell-fu :ensure t :demand t :config (setf spell-fu-faces-exclude '(org-meta-line org-link org-code)) (global-spell-fu-mode) :bind (("C-," . spell-fu-goto-next-error)))
;;;; LanguageTool (use-package languagetool :ensure t :demand t :config (setf languagetool-language-tool-jar "/usr/share/LanguageTool/languagetool-commandline.jar") (setq languagetool-java-arguments '("-Dfile.encoding=UTF-8")) (setq languagetool-default-language "en-GB") (setq languagetool-server-language-tool-jar "/usr/share/LanguageTool/languagetool-server.jar") (languagetool-server-start))
languagetool-server-modeto your hooks, if you have a strong CPU.
and enable the spelling checker there.
(use-package flycheck :demand t :ensure t :config (setf flycheck-textlint-executable "~/binary_software/2021-05-29_textlint/bin/textlint") (setf flycheck-textlint-config "~/.textlintrc") (add-to-list 'flycheck-checkers 'proselint) ;; also enabled somewhere in customize (global-flycheck-mode) (diminish 'flycheck-mode "🦋✓"))