Spellchecking on Linux with English and Russian (Work In Progress).
This document is likely to become obsolete by ~2024.
Working with natural languages is hard. Many of us remember high school language classes as a horror.
Nevertheless, spellchecking is an expected application for computers. At some point I decided to make spell checking work where I need it to work.
Obviously, it turned out to be hard, and this document was created as a memo on how it was done.
The fields is very huge and heterogeneous, as Linux is an anarchic system, so I will limit the scope for myself with the following targets:
Languages:
- English (British)
- Russian
- Chinese (if possible)
Applications:
- Emacs
- Firefox
- LibreOffice
- OmegaT (if possible)
- Jabber Messaging (Dino, Pidgin, m.b. others)
References:
1. Body
1.1. Methods of spellchecking
There are, at the moment, essentially three approaches to checking spelling.
- Rule-based
- Machine-Learning based
- Human-based (or outsourced)
A lot of natural language processing theory has focused on the rule-based theory, until Internet giants accumulate enough data to train their models enough to outperform even the most advanced of models. Whenever you can, I suggest using Machine-Learning based methods.
However, when the task gets closer to implementing spelling rather than using it, rule-based methods still retain their validity, due to the fact that they are much easier to hook into your system. Naturally, people usually have no ability to train their own models.
1.2. Outsourcing
There are services that let you submit your data for checking at remote services. This may be better for some cases – remote parties may have better resources for developing spellchecking methods. However, sometimes the Internet connection is not very good, and sometimes you are not allowed to share your texts.
1.3. Spell-checking components
A spell checker essentially consists of four components:
- Dictionary (sometimes grammar database)
- Dictionary package
- Engine
- Application interface
The difference between a dictionary and a dictionary package usually comes from the breakage of an abstraction barrier. Dictionary packages turn out to be Engine-specific. Even worse thing is when a dictionary package is application-specific (sadly, happens). And although converting between different dictionaries and dictionary packages is often not too hard, it requires work.
1.4. Projects-Sources
Things do not appear out of nowhere in this world. Everything is done by some people driven by different motives.
This section lists several projects that continue improving natural language support in computing.
- Ispell-Project
- https://www.cs.hmc.edu/~geoff/ispell.html – English dictionary and the oldest API for spelling.
- Moby Word List (by dwyl)
- https://github.com/dwyl/english-words, originally from https://www.gutenberg.org/ebooks/3201 – English Dictionary
- Alexander Lebedev
- http://scon155.phys.msu.su/eng/lebedev.html – Russian Dictionary 2004, ye, yo, both ftp://scon155.phys.msu.su/pub/russian/ispell/rus-ispell.tar.gz
- Konstantin Knizhnik
- http://www.garret.ru/~knizhnik – Russian Dictionary http://www.garret.ru/~knizhnik/rispell.tar.gz
- Aspell-Project
- https://ftp.gnu.org/gnu/aspell/dict/0index.html – Dictionaries (Has Lebedev’s 2004 dictionary for Russian)
- Alexander Slovesnik
- https://addons.mozilla.org/en-US/firefox/addon/russian-spellchecking-dic-3703/, repack of Alexander Lebedev’s 3 dictionaries for Firefox
- Alexander Klukvin
- https://code.google.com/archive/p/hunspell-ru/, https://addons.mozilla.org/en-US/firefox/addon/russian-hunspell-dictionary/ 2013, improvement of the Lebedev’s dictionary. Hunspell format. Also has https://addons.mozilla.org/ru/firefox/addon/russian-hunspell-dictionary/, https://sites.google.com/site/dictru/
- Russian Friends of Hunspell
- https://mozilla-russia.org/projects/dictionary/hunspell.html
- Kek
- https://addons.mozilla.org/en-US/firefox/addon/%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C-%D0%BE%D1%80%D1%84%D0%BE%D0%B3%D1%80%D0%B0%D1%84%D0%B8%D0%B8-%D0%B2%D0%B8%D0%BA%D0%B8/ – repacking of Wiktionary for Hunspell
- Wiktionary
- www.wiktionary.org
- Hunspell
- http://hunspell.github.io/ has dictionaries, and such.
- Enchant
- https://abiword.github.io/enchant/ a unified api for spellcheckers
- Libreoffice
- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/, https://extensions.libreoffice.org/?Tag%5B0%5D=50&q=
- Nuspell
- https://github.com/nuspell
- AOT Project
- https://github.com/sokirko74/aot , has dictionary too, Yakov’s dictionary https://addons.mozilla.org/ru/firefox/addon/russian-spellcheck-dict-aot/ 2015
- Mozilla-Russia Unified Dictionary
- https://forum.mozilla-russia.org/viewtopic.php?id=75564
- Alexander Petrenas
- https://addons.mozilla.org/ru/firefox/addon/unified-russian-english-spell/ , United Dictionary
- Apache OpenOffice
- https://extensions.openoffice.org
- myooo
- http://myooo.ru/
- LanguageTool
- https://www.languagetool.org/ , Grammar checker (!), extensions for Chrome, Firefox, LibreOffice
1.5. Dictionaries
1.6. Engines
1.6.1. Ispell
The oldest and the most widespread engine. Slackware only has an English dictionary for it.
It is very old, and only checks spelling on a word-by-word basis. It has some limited form of morphology, called “affixes”. Still, its interface is what everyone tries to emulate. https://www.cs.hmc.edu/~geoff/ispell.html
By default it only has one dictionary, that is American English dictionary. You can download more, here: https://www.cs.hmc.edu/~geoff/ispell-dictionaries.html#Russian-dicts
1.6.2. Aspell
Supposedly the best engine for English, but Slackware has dictionaries for most languages.
Russian one is from Lebedev.
1.6.3. myspell
A library that is built into Hunspell.
1.6.4. Hunspell
Supposedly, the best spellchecker for all languages, except, maybe, English.
1.6.5. TODO Nuspell, and improved version of Hunspell.
Package for Slackware.
1.6.6. Enchant
A meta-checker that is ispell-compatible, but can use other engines. Should be used if the software allows you to choose a language.
~/.config/enchant/enchant.ordering
*:nuspell,hunspell,aspell,ispell en:aspell,hunspell,nuspell en_GB:aspell,hunspell,nuspell
Check for a dictionary:
enchant-lsmod-2 -list-dicts | grep ru
1.6.7. LanguageTool
A grammar checker, supports English and Russian.
A great tool, actually. And the emacs package has great quality. Eats a lot of CPU, though.
Is on slackbuilds.org
1.6.8. proselint
Stylistics checker, quite strong.
https://gitlab.com/Lockywolf/lwfslackbuilds/-/tree/master/proselint
flycheck has buit-in support for proselint.
You need to enable it in Emacs, using
(flycheck-verify-setup)
1.6.9. textlint
on Slackware you need nodeenv, packaged at https://gitlab.com/Lockywolf/lwfslackbuilds/-/tree/master/nodeenv
Auxiliary (but enormous) checker for everything text.
DATE=$(date --iso) DNAME="$DATE_textlint" mkdir -p ~/bin/"$DNAME" source ~/bin/"$DNAME/bin/activate" npm install --global textlint npm install --global textlint-rule-rousseau npm install --global textlint-rule-diacritics
1.7. APIs
1.7.1. Emacs
- ispell-mode for Emacs
This checks grammar using when you ask.
(use-package ispell ; :demand t :ensure t :hook (tex-mode-hook . (lambda () (setq ispell-parser 'tex))) :config ;; (ispell-change-dictionary "british-ise-w_accents" t) ;; aspell-specific ;; превед привет (setf ispell-program-name (executable-find "aspell")) ;; This will not change the language automatically. You still have to select it manually. (setq ispell-local-dictionary-alist '(("ru-local" "[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщьыъэюя]" "[^АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщьыъэюя]" "[-]" nil ("-d" "ru-yo") nil utf-8) ;; aspell ("british-local" "[A-Za-z]" "[^A-Za-z]" "[']" nil ("-d" "en_GB-ise") nil utf-8))) (ispell-change-dictionary "british-local" t) (setq ispell-silently-savep t) )
ispell-dictionary-alist
is probably not the best here.In Russian files, switch the language by using
ispell-change-dictionary
.Should be configured in some weird way in order to check both languages.
- flyspell-mode
(use-package flyspell :demand t :ensure t :hook ((text-mode . flyspell-mode) (prog-mode . flyspell-prog-mode)) :config (diminish 'flyspell-mode "🦋🧙") (setf flyspell-use-meta-tab nil) ) (use-package flyspell-correct ; -ido :ensure t :demand t :bind (:map flyspell-mode-map ("C-;" . flyspell-correct-wrapper)) :init (setq flyspell-correct-interface #'flyspell-correct-popup))
Works, but I would recommend trying spell-fu.
- spell-fu
Only uses aspell, but supports what was flyspell-mode, and what was flyspell-prog-mode Also is very fast.
(use-package spell-fu :ensure t :demand t :config (setf spell-fu-faces-exclude '(org-meta-line org-link org-code)) (global-spell-fu-mode) :bind (("C-," . spell-fu-goto-next-error)))
- flycheck-aspell
Quite unfinished.
- languagetool
;;;; LanguageTool (use-package languagetool :ensure t :demand t :config (setf languagetool-language-tool-jar "/usr/share/LanguageTool/languagetool-commandline.jar") (setq languagetool-java-arguments '("-Dfile.encoding=UTF-8")) (setq languagetool-default-language "en-GB") (setq languagetool-server-language-tool-jar "/usr/share/LanguageTool/languagetool-server.jar") (languagetool-server-start))
Add
languagetool-server-mode
to your hooks, if you have a strong CPU.Otherwise, use
languagetool-check
. - flycheck-proselint
(flycheck-verify-setup)
and enable the spelling checker there.
- flycheck-textlint
(use-package flycheck :demand t :ensure t :config (setf flycheck-textlint-executable "~/binary_software/2021-05-29_textlint/bin/textlint") (setf flycheck-textlint-config "~/.textlintrc") (add-to-list 'flycheck-checkers 'proselint) ;; also enabled somewhere in customize (global-flycheck-mode) (diminish 'flycheck-mode "🦋✓"))
- guess-language mode from MELPA
- company-wordfreq
- dyncloze
- langtool
- langtool-ignore-fonts
- langtool-popup
- dictionary-mode (built into emacs 28.1)