# Russian word-list preparation (`tools`) Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official Russian academic **orthographic dictionary**, cross-checked against two independent morphological dictionaries. The goal of the pipeline is a list of **common nouns in the nominative singular** (`sources/scrabble_ru/scrabble.txt`), plus an ambiguous tail for manual review. > This directory is self-contained tooling for *building* the word list. It is not part > of the solver library. The committed result lives in `sources/scrabble_ru/`. ## Source `orfo_dict_2025.pdf` — *Русский орфографический словарь РАН* (≈ 200 000 entries), the authority for **spelling**. It encodes declension type in its grammatical notes but does **not** reliably mark part of speech. - Source: - Mirror: The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds from the text alone — the binary PDF is needed only to regenerate that text. ## Outputs (`sources/scrabble_ru/`) The committed result is **three** files; every other bucket stays in the Stage-2 process's memory (dump it with `--dump`, query it with `--trace WORD`). | File | Committed | Meaning | |------|:--:|---------| | `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). | | `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. | | `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. | | `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. | | `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. | `--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every word with the reason it did or did not reach the dictionary); these are git-ignored debug artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs the brain consumes. ## Prerequisites ```sh # 1. pdftotext (Poppler) sudo apt-get install -y poppler-utils # 2. Go toolchain (Stage 1) — already required by the parent module # 3. Python + the OpenCorpora analyser (Stage 2) sudo apt-get install -y python3-venv python3-pip python3 -m venv ru-venv ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg) # 4. libmorph — the independent morphological dictionary (Stage 2 cross-check) sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev g++ -std=c++17 -O2 tools/libmorph_check.cpp -lmorphrus -lmoonycode -o tools/libmorph_check ``` If `tools/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from the stack and reports `libmorph_helper=MISSING`. ## How to run ```sh # Stage 0 — PDF -> plain text (committed as the source of truth; run once) pdftotext tools/orfo_dict_2025.pdf sources/scrabble_ru/orfo_dict_2025.txt # Stage 1 — build the base word list (Go): sources/scrabble_ru/all.txt + /tmp/ru_*.txt go run ./tools/ruwords # Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt ru-venv/bin/python tools/ru_stage2.py # ask how a word did or did not reach the dictionary ru-venv/bin/python tools/ru_stage2.py --trace травмпункт # also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv) ru-venv/bin/python tools/ru_stage2.py --dump ``` `-from`/`-to` (defaulting to 452/168808) bound the column word-list section of `russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last, `я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the PDF is re-exported. ## Algorithm ### Stage 1 — `ruwords` (Go) Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301 stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected): - the **headword** (leading token). Leading whitespace including the form-feed `\f` pdftotext puts at every page top is trimmed — otherwise the first headword of each page is lost; - the **singular of a plural headword** when the entry gives it after `ед.`, in full (`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the suffix best overlaps the headword); the plural is then dropped (a plural that has a singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`); - **variant headwords** after `и` that carry their own grammatical note (`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms. Everything else (every maximal Cyrillic token not selected above) goes to `/tmp/ru_skip.txt`, a safety net for a later morphology re-check. ### Stage 2 — `ru_stage2.py` (Python) Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first: 1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also added (so nouns absent from the PDF are included). 2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒ keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun reading in *either* is enough (their disagreements were reviewed and resolved this way, since each is incomplete in different places). A singular reconstructed from "ед." that neither dictionary knows is accepted as a noun (the orthographic note attests it). 3. A word **both dictionaries miss** is classified by the orthographic **note** (`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or `нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or verb is dropped; anything undecided goes to `undefined.txt`. 4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved from review/undefined into the result as well, propagated transitively through chains. The plural-form variants the dictionaries already resolve never reach this step. The nominative singular always comes from the dictionary that recognised the word, or from the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`). ### The libmorph bridge — `libmorph_check.cpp` libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin stdin→stdout filter: one UTF-8 word per line in, one line out: ``` \t:\t:... ``` `` is `CheckWord` (1 = in the dictionary). `` is `wdInfo & 0x3f`, the part of speech. The codes were reverse-engineered (the docs omit the table): | codes | part of speech | |------|----------------| | **7–21, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) | | 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral | | 38–39 | **proper noun** (excluded) · 48–58 comparative/adverb · 49–53 function words | The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are passed and lemmas returned in UTF-8. ## Notes & caveats - The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological dictionary; only the orthographic dictionary attests them, so the PDF note is the sole signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here. - OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`) and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns), which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing and is therefore used only as a weak confirming vote, never as dictionary membership. - Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.