dictprep: Russian orthographic dictionary → Scrabble noun pipeline

Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt) from the RAN orthographic dictionary, for the Эрудит ruleset. - Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants. - Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) + libmorph + orthographic notes select common nouns (nom. sing.); --trace explains a word's fate, --dump writes the in-memory buckets. - libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in. - orfo_dict_2025.txt is the committed pdftotext source of truth. - See dictprep/README.md for methodology and reproducibility.
2026-06-01 23:27:17 +02:00
parent 15c7959d96
commit 540ee32178
9 changed files with 402226 additions and 1 deletions
@@ -0,0 +1,164 @@
+# Russian word-list preparation (`dictprep`)
+
+Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
+Russian academic **orthographic dictionary**, cross-checked against two independent
+morphological dictionaries.
+
+The goal of the pipeline is a list of **common nouns in the nominative singular**
+(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
+
+> This directory is self-contained tooling for *building* the word list. It is not part
+> of the solver library. The committed result lives in `dictprep/russian/`.
+
+## Source
+
+`orfo_dict_2025.pdf` — *Русский орфографический словарь РАН* (≈ 200 000 entries), the
+authority for **spelling**. It encodes declension type in its grammatical notes but does
+**not** reliably mark part of speech.
+
+- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
+- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
+
+The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
+pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
+from the text alone — the binary PDF is needed only to regenerate that text.
+
+## Outputs (`dictprep/russian/`)
+
+The committed result is **three** files; every other bucket stays in the Stage-2
+process's memory (dump it with `--dump`, query it with `--trace WORD`).
+
+| File | Committed | Meaning |
+|------|:--:|---------|
+| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
+| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
+| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
+| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. |
+| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
+
+`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
+word with the reason it did or did not reach the dictionary); these are git-ignored debug
+artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
+the brain consumes.
+
+## Prerequisites
+
+```sh
+# 1. pdftotext (Poppler)
+sudo apt-get install -y poppler-utils
+
+# 2. Go toolchain (Stage 1) — already required by the parent module
+
+# 3. Python + the OpenCorpora analyser (Stage 2)
+sudo apt-get install -y python3-venv python3-pip
+python3 -m venv ru-venv
+ru-venv/bin/pip install mawo-pymorphy3            # bundles OpenCorpora 2025 (words.dawg)
+
+# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
+sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
+g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
+```
+
+If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
+the stack and reports `libmorph_helper=MISSING`.
+
+## How to run
+
+```sh
+# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
+pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
+
+# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
+go run ./dictprep/ruwords
+
+# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
+ru-venv/bin/python dictprep/ru_stage2.py
+
+# ask how a word did or did not reach the dictionary
+ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
+# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
+ru-venv/bin/python dictprep/ru_stage2.py --dump
+```
+
+`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
+`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
+`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
+PDF is re-exported.
+
+## Algorithm
+
+### Stage 1 — `ruwords` (Go)
+
+Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
+stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
+
+- the **headword** (leading token). Leading whitespace including the form-feed `\f`
+  pdftotext puts at every page top is trimmed — otherwise the first headword of each page
+  is lost;
+- the **singular of a plural headword** when the entry gives it after `ед.`, in full
+  (`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
+  suffix best overlaps the headword); the plural is then dropped (a plural that has a
+  singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
+- **variant headwords** after `и` that carry their own grammatical note
+  (`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
+
+Everything else (every maximal Cyrillic token not selected above) goes to
+`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
+
+### Stage 2 — `ru_stage2.py` (Python)
+
+Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:
+
+1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
+   reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
+   added (so nouns absent from the PDF are included).
+2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
+   keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
+   reading in *either* is enough (their disagreements were reviewed and resolved this way,
+   since each is incomplete in different places). A singular reconstructed from "ед." that
+   neither dictionary knows is accepted as a noun (the orthographic note attests it).
+3. A word **both dictionaries miss** is classified by the orthographic **note**
+   (`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
+   `нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
+   verb is dropped; anything undecided goes to `undefined.txt`.
+4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
+   травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
+   from review/undefined into the result as well, propagated transitively through chains.
+   The plural-form variants the dictionaries already resolve never reach this step.
+
+The nominative singular always comes from the dictionary that recognised the word, or from
+the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
+mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
+
+### The libmorph bridge — `libmorph_check.cpp`
+
+libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
+stdin→stdout filter: one UTF-8 word per line in, one line out:
+
+```
+<known>\t<pos>:<lemma>\t<pos>:<lemma>...
+```
+
+`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
+speech. The codes were reverse-engineered (the docs omit the table):
+
+| codes | part of speech |
+|------|----------------|
+| **7–21, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
+| 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral |
+| 38–39 | **proper noun** (excluded) · 48–58 comparative/adverb · 49–53 function words |
+
+The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
+passed and lemmas returned in UTF-8.
+
+## Notes & caveats
+
+- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
+  dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
+  signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
+- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
+  and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
+  which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
+  and is therefore used only as a weak confirming vote, never as dictionary membership.
+- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
+  dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.