Spellchecking on Linux with English and Russian (Work In Progress).

This document is likely to become obsolete by ~2024.

Working with natural languages is hard. Many of us remember high school language classes as a horror.

Nevertheless, spellchecking is an expected application for computers. At some point I decided to make spell checking work where I need it to work.

Obviously, it turned out to be hard, and this document was created as a memo on how it was done.

The fields is very huge and heterogeneous, as Linux is an anarchic system, so I will limit the scope for myself with the following targets:

Languages:

  1. English (British)
  2. Russian
  3. Chinese (if possible)

Applications:

  1. Emacs
  2. Firefox
  3. LibreOffice
  4. OmegaT (if possible)
  5. Jabber Messaging (Dino, Pidgin, m.b. others)

References:

1. Body

1.1. Methods of spellchecking

There are, at the moment, essentially three approaches to checking spelling.

  • Rule-based
  • Machine-Learning based
  • Human-based (or outsourced)

A lot of natural language processing theory has focused on the rule-based theory, until Internet giants accumulate enough data to train their models enough to outperform even the most advanced of models. Whenever you can, I suggest using Machine-Learning based methods.

However, when the task gets closer to implementing spelling rather than using it, rule-based methods still retain their validity, due to the fact that they are much easier to hook into your system. Naturally, people usually have no ability to train their own models.

1.2. Outsourcing

There are services that let you submit your data for checking at remote services. This may be better for some cases – remote parties may have better resources for developing spellchecking methods. However, sometimes the Internet connection is not very good, and sometimes you are not allowed to share your texts.

1.3. Spell-checking components

A spell checker essentially consists of four components:

  1. Dictionary (sometimes grammar database)
  2. Dictionary package
  3. Engine
  4. Application interface

The difference between a dictionary and a dictionary package usually comes from the breakage of an abstraction barrier. Dictionary packages turn out to be Engine-specific. Even worse thing is when a dictionary package is application-specific (sadly, happens). And although converting between different dictionaries and dictionary packages is often not too hard, it requires work.

1.4. Projects-Sources

Things do not appear out of nowhere in this world. Everything is done by some people driven by different motives.

This section lists several projects that continue improving natural language support in computing.

Ispell-Project
https://www.cs.hmc.edu/~geoff/ispell.html – English dictionary and the oldest API for spelling.
Moby Word List (by dwyl)
https://github.com/dwyl/english-words, originally from https://www.gutenberg.org/ebooks/3201 – English Dictionary
Alexander Lebedev
http://scon155.phys.msu.su/eng/lebedev.html – Russian Dictionary 2004, ye, yo, both ftp://scon155.phys.msu.su/pub/russian/ispell/rus-ispell.tar.gz
Konstantin Knizhnik
http://www.garret.ru/~knizhnik – Russian Dictionary http://www.garret.ru/~knizhnik/rispell.tar.gz
Aspell-Project
https://ftp.gnu.org/gnu/aspell/dict/0index.html – Dictionaries (Has Lebedev’s 2004 dictionary for Russian)
Alexander Slovesnik
https://addons.mozilla.org/en-US/firefox/addon/russian-spellchecking-dic-3703/, repack of Alexander Lebedev’s 3 dictionaries for Firefox
Alexander Klukvin
https://code.google.com/archive/p/hunspell-ru/, https://addons.mozilla.org/en-US/firefox/addon/russian-hunspell-dictionary/ 2013, improvement of the Lebedev’s dictionary. Hunspell format. Also has https://addons.mozilla.org/ru/firefox/addon/russian-hunspell-dictionary/, https://sites.google.com/site/dictru/
Russian Friends of Hunspell
https://mozilla-russia.org/projects/dictionary/hunspell.html
Kek
https://addons.mozilla.org/en-US/firefox/addon/%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C-%D0%BE%D1%80%D1%84%D0%BE%D0%B3%D1%80%D0%B0%D1%84%D0%B8%D0%B8-%D0%B2%D0%B8%D0%BA%D0%B8/ – repacking of Wiktionary for Hunspell
Wiktionary
www.wiktionary.org
Hunspell
http://hunspell.github.io/ has dictionaries, and such.
Enchant
https://abiword.github.io/enchant/ a unified api for spellcheckers
Libreoffice
https://cgit.freedesktop.org/libreoffice/dictionaries/tree/, https://extensions.libreoffice.org/?Tag%5B0%5D=50&q=
Nuspell
https://github.com/nuspell
AOT Project
https://github.com/sokirko74/aot , has dictionary too, Yakov’s dictionary https://addons.mozilla.org/ru/firefox/addon/russian-spellcheck-dict-aot/ 2015
Mozilla-Russia Unified Dictionary
https://forum.mozilla-russia.org/viewtopic.php?id=75564
Alexander Petrenas
https://addons.mozilla.org/ru/firefox/addon/unified-russian-english-spell/ , United Dictionary
Apache OpenOffice
https://extensions.openoffice.org
myooo
http://myooo.ru/
LanguageTool
https://www.languagetool.org/ , Grammar checker (!), extensions for Chrome, Firefox, LibreOffice

1.6. Engines

1.6.1. Ispell

The oldest and the most widespread engine. Slackware only has an English dictionary for it.

It is very old, and only checks spelling on a word-by-word basis. It has some limited form of morphology, called “affixes”. Still, its interface is what everyone tries to emulate. https://www.cs.hmc.edu/~geoff/ispell.html

By default it only has one dictionary, that is American English dictionary. You can download more, here: https://www.cs.hmc.edu/~geoff/ispell-dictionaries.html#Russian-dicts

1.6.2. Aspell

Supposedly the best engine for English, but Slackware has dictionaries for most languages.

Russian one is from Lebedev.

1.6.3. myspell

A library that is built into Hunspell.

1.6.4. Hunspell

Supposedly, the best spellchecker for all languages, except, maybe, English.

1.6.5. TODO Nuspell, and improved version of Hunspell.

Package for Slackware.

1.6.6. Enchant

A meta-checker that is ispell-compatible, but can use other engines. Should be used if the software allows you to choose a language.

~/.config/enchant/enchant.ordering

*:nuspell,hunspell,aspell,ispell
en:aspell,hunspell,nuspell
en_GB:aspell,hunspell,nuspell

Check for a dictionary:

enchant-lsmod-2 -list-dicts | grep ru

1.6.7. LanguageTool

A grammar checker, supports English and Russian.

A great tool, actually. And the emacs package has great quality. Eats a lot of CPU, though.

Is on slackbuilds.org

1.6.8. proselint

Stylistics checker, quite strong.

https://gitlab.com/Lockywolf/lwfslackbuilds/-/tree/master/proselint

flycheck has buit-in support for proselint.

You need to enable it in Emacs, using

(flycheck-verify-setup)

1.6.9. textlint

on Slackware you need nodeenv, packaged at https://gitlab.com/Lockywolf/lwfslackbuilds/-/tree/master/nodeenv

Auxiliary (but enormous) checker for everything text.

DATE=$(date --iso)
DNAME="$DATE_textlint"
mkdir -p ~/bin/"$DNAME"
source ~/bin/"$DNAME/bin/activate"
npm install --global textlint
npm install --global textlint-rule-rousseau
npm install --global textlint-rule-diacritics

1.7. APIs

1.7.1. Emacs

  1. ispell-mode for Emacs

    This checks grammar using when you ask.

    (use-package ispell                     ;
                   :demand t
                  :ensure t
                  :hook
                  (tex-mode-hook . (lambda () (setq ispell-parser 'tex)))
                  :config
                  ;; (ispell-change-dictionary "british-ise-w_accents" t)  ;; aspell-specific
                  ;; превед привет
                  (setf ispell-program-name (executable-find "aspell"))
                  ;; This will not change the language automatically. You still have to select it manually.
                  (setq ispell-local-dictionary-alist
                               '(("ru-local"
                                  "[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщьыъэюя]"
                                  "[^АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЫЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщьыъэюя]"
                                  "[-]"  nil ("-d" "ru-yo") nil utf-8) ;; aspell
                                 ("british-local"
                                  "[A-Za-z]" "[^A-Za-z]"
                                  "[']"  nil ("-d" "en_GB-ise") nil utf-8)))
                  (ispell-change-dictionary "british-local" t)
                  (setq ispell-silently-savep t)
                  )
    

    ispell-dictionary-alist is probably not the best here.

    In Russian files, switch the language by using ispell-change-dictionary.

    Should be configured in some weird way in order to check both languages.

  2. flyspell-mode
    (use-package flyspell
                   :demand t
                   :ensure t
                   :hook
                   ((text-mode . flyspell-mode)
                    (prog-mode . flyspell-prog-mode))
                   :config
                  (diminish 'flyspell-mode "🦋🧙")
                   (setf flyspell-use-meta-tab nil)
                   )
    (use-package flyspell-correct ; -ido
                   :ensure t
                   :demand t
                   :bind
                  (:map flyspell-mode-map
                               ("C-;" . flyspell-correct-wrapper))
                   :init
                   (setq flyspell-correct-interface #'flyspell-correct-popup))
    

    Works, but I would recommend trying spell-fu.

  3. spell-fu

    Only uses aspell, but supports what was flyspell-mode, and what was flyspell-prog-mode Also is very fast.

    (use-package spell-fu
                 :ensure t
                 :demand t
                 :config
                 (setf spell-fu-faces-exclude '(org-meta-line org-link org-code))
                 (global-spell-fu-mode)
                 :bind
                 (("C-," . spell-fu-goto-next-error)))
    
    
  4. flycheck-aspell

    Quite unfinished.

  5. languagetool
    ;;;; LanguageTool
    (use-package languagetool
                  :ensure t
                  :demand t
                  :config
                  (setf languagetool-language-tool-jar
                     "/usr/share/LanguageTool/languagetool-commandline.jar")
                  (setq languagetool-java-arguments '("-Dfile.encoding=UTF-8"))
                  (setq languagetool-default-language "en-GB")
                  (setq languagetool-server-language-tool-jar "/usr/share/LanguageTool/languagetool-server.jar")
                  (languagetool-server-start))
    

    Add languagetool-server-mode to your hooks, if you have a strong CPU.

    Otherwise, use languagetool-check.

  6. flycheck-proselint
    (flycheck-verify-setup)
    

    and enable the spelling checker there.

  7. flycheck-textlint
    (use-package flycheck
                 :demand t
                 :ensure t
                 :config
                 (setf flycheck-textlint-executable "~/binary_software/2021-05-29_textlint/bin/textlint")
                 (setf flycheck-textlint-config "~/.textlintrc")
                 (add-to-list 'flycheck-checkers 'proselint) ;; also enabled somewhere in customize
                 (global-flycheck-mode)
                 (diminish 'flycheck-mode "🦋✓"))
    
  8. guess-language mode from MELPA
  9. company-wordfreq
  10. dyncloze
  11. langtool
  12. langtool-ignore-fonts
  13. langtool-popup
  14. dictionary-mode (built into emacs 28.1)