Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt) from the RAN orthographic dictionary, for the Эрудит ruleset. - Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants. - Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) + libmorph + orthographic notes select common nouns (nom. sing.); --trace explains a word's fate, --dump writes the in-memory buckets. - libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in. - orfo_dict_2025.txt is the committed pdftotext source of truth. - See dictprep/README.md for methodology and reproducibility.
Russian word-list preparation (dictprep)
Builds the Russian noun word list for the Scrabble/Эрудит solver out of the official Russian academic orthographic dictionary, cross-checked against two independent morphological dictionaries.
The goal of the pipeline is a list of common nouns in the nominative singular
(dictprep/russian/scrabble.txt), plus an ambiguous tail for manual review.
This directory is self-contained tooling for building the word list. It is not part of the solver library. The committed result lives in
dictprep/russian/.
Source
orfo_dict_2025.pdf — Русский орфографический словарь РАН (≈ 200 000 entries), the
authority for spelling. It encodes declension type in its grammatical notes but does
not reliably mark part of speech.
- Source: https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf
- Mirror: https://rus-gos.spbu.ru/index.php/dictionary
The PDF is git-ignored (large, third-party); place it here as orfo_dict_2025.pdf. Its
pdftotext output is committed as russian/orfo_dict_2025.txt, so the word list rebuilds
from the text alone — the binary PDF is needed only to regenerate that text.
Outputs (dictprep/russian/)
The committed result is three files; every other bucket stays in the Stage-2
process's memory (dump it with --dump, query it with --trace WORD).
| File | Committed | Meaning |
|---|---|---|
orfo_dict_2025.txt |
✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
all.txt |
✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
manual_confirm.txt |
✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
scrabble.txt |
✓ | Stage 2 result: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. |
undefined.txt |
— | the ambiguous tail; kept in memory, written only with --dump. |
--dump also writes adjectives.txt, verbs.txt, singulars.txt and fate.tsv (every
word with the reason it did or did not reach the dictionary); these are git-ignored debug
artifacts. Stage 1 also writes /tmp/ru_{skip,singulars,variants}.txt, intermediate inputs
the brain consumes.
Prerequisites
# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils
# 2. Go toolchain (Stage 1) — already required by the parent module
# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
If dictprep/libmorph_check is absent, Stage 2 still runs — it simply drops libmorph from
the stack and reports libmorph_helper=MISSING.
How to run
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
go run ./dictprep/ruwords
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python dictprep/ru_stage2.py
# ask how a word did or did not reach the dictionary
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python dictprep/ru_stage2.py --dump
-from/-to (defaulting to 452/168808) bound the column word-list section of
russian/orfo_dict_2025.txt (line 452 = the first entry а1, …; line 168808 = the last,
я́щурный). The preface above line 452 is prose and is skipped. Verify these bounds if the
PDF is re-exported.
Algorithm
Stage 1 — ruwords (Go)
Per dictionary line in [from, to] it collects, normalised (stress marks U+0300/U+0301
stripped, lowercased, ё kept, hyphenated/capitalised/non-Cyrillic rejected):
- the headword (leading token). Leading whitespace including the form-feed
\fpdftotext puts at every page top is trimmed — otherwise the first headword of each page is lost; - the singular of a plural headword when the entry gives it after
ед., in full (ящеры, …, ед. ящер) or as a replacement suffix (…, ед. -вец, spliced where the suffix best overlaps the headword); the plural is then dropped (a plural that has a singular is never needed) and the singular is also recorded (/tmp/ru_singulars.txt); - variant headwords after
иthat carry their own grammatical note (аблатив, -а и аблятив, -а;регги и реггей, нескл.), excluding inflected forms.
Everything else (every maximal Cyrillic token not selected above) goes to
/tmp/ru_skip.txt, a safety net for a later morphology re-check.
Stage 2 — ru_stage2.py (Python)
Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:
- OpenCorpora (
words.dawg, read directly — not the predictor): a common-noun reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also added (so nouns absent from the PDF are included). - libmorph (independent dictionary, via
libmorph_check): a common-noun reading ⇒ keep the libmorph lemma. The two dictionaries are treated as complementary — a noun reading in either is enough (their disagreements were reviewed and resolved this way, since each is incomplete in different places). A singular reconstructed from "ед." that neither dictionary knows is accepted as a noun (the orthographic note attests it). - A word both dictionaries miss is classified by the orthographic note
(
-ая, -ое⇒ adjective;-ть,сов./несов.⇒ verb; single genitive-а/-иorнескл., м./ж./с.⇒ noun). A note-noun goes straight toscrabble.txt; an adjective or verb is dropped; anything undecided goes toundefined.txt. - Variant rescue: when the dictionary joins two spellings with "и" (
травмопункт и травмпункт,регги и реггей) and one is already a confirmed noun, the other is moved from review/undefined into the result as well, propagated transitively through chains. The plural-form variants the dictionaries already resolve never reach this step.
The nominative singular always comes from the dictionary that recognised the word, or from
the orthographic ед. note — never from a predictor guess (libmorph and the predictor
mis-lemmatise out-of-dictionary words, e.g. витебчане → витебчан instead of витебчанин).
The libmorph bridge — libmorph_check.cpp
libmorph (A. Kovalenko, MIT) ships as libmorphrus.so. libmorph_check is a thin
stdin→stdout filter: one UTF-8 word per line in, one line out:
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
<known> is CheckWord (1 = in the dictionary). <pos> is wdInfo & 0x3f, the part of
speech. The codes were reverse-engineered (the docs omit the table):
| codes | part of speech |
|---|---|
| 7–21, 24 | noun (all genders / declensions / animacy; pluralia tantum is 24) |
| 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral |
| 38–39 | proper noun (excluded) · 48–58 comparative/adverb · 49–53 function words |
The analyser instance is requested with the key libmorph.api.v4:utf-8 so words are
passed and lemmas returned in UTF-8.
Notes & caveats
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in no morphological
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
signal there. Compound and very recent nouns (
робототехник,толкинист) live here. - OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on
all.txt) and ≈ 96 % overlapping, but complementary (each contributes ≈ 2 200 unique nouns), which is why both are kept. The mawo predictor "knows" ~98 % of everything by guessing and is therefore used only as a weak confirming vote, never as dictionary membership. - Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.