540ee32178
Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt) from the RAN orthographic dictionary, for the Эрудит ruleset. - Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants. - Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) + libmorph + orthographic notes select common nouns (nom. sing.); --trace explains a word's fate, --dump writes the in-memory buckets. - libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in. - orfo_dict_2025.txt is the committed pdftotext source of truth. - See dictprep/README.md for methodology and reproducibility.
165 lines
8.6 KiB
Markdown
165 lines
8.6 KiB
Markdown
# Russian word-list preparation (`dictprep`)
|
||
|
||
Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
|
||
Russian academic **orthographic dictionary**, cross-checked against two independent
|
||
morphological dictionaries.
|
||
|
||
The goal of the pipeline is a list of **common nouns in the nominative singular**
|
||
(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
|
||
|
||
> This directory is self-contained tooling for *building* the word list. It is not part
|
||
> of the solver library. The committed result lives in `dictprep/russian/`.
|
||
|
||
## Source
|
||
|
||
`orfo_dict_2025.pdf` — *Русский орфографический словарь РАН* (≈ 200 000 entries), the
|
||
authority for **spelling**. It encodes declension type in its grammatical notes but does
|
||
**not** reliably mark part of speech.
|
||
|
||
- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
|
||
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
|
||
|
||
The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
|
||
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
|
||
from the text alone — the binary PDF is needed only to regenerate that text.
|
||
|
||
## Outputs (`dictprep/russian/`)
|
||
|
||
The committed result is **three** files; every other bucket stays in the Stage-2
|
||
process's memory (dump it with `--dump`, query it with `--trace WORD`).
|
||
|
||
| File | Committed | Meaning |
|
||
|------|:--:|---------|
|
||
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
|
||
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
|
||
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
|
||
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. |
|
||
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
|
||
|
||
`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
|
||
word with the reason it did or did not reach the dictionary); these are git-ignored debug
|
||
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
|
||
the brain consumes.
|
||
|
||
## Prerequisites
|
||
|
||
```sh
|
||
# 1. pdftotext (Poppler)
|
||
sudo apt-get install -y poppler-utils
|
||
|
||
# 2. Go toolchain (Stage 1) — already required by the parent module
|
||
|
||
# 3. Python + the OpenCorpora analyser (Stage 2)
|
||
sudo apt-get install -y python3-venv python3-pip
|
||
python3 -m venv ru-venv
|
||
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
|
||
|
||
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
|
||
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
|
||
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
|
||
```
|
||
|
||
If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
|
||
the stack and reports `libmorph_helper=MISSING`.
|
||
|
||
## How to run
|
||
|
||
```sh
|
||
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
|
||
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
|
||
|
||
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
|
||
go run ./dictprep/ruwords
|
||
|
||
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
|
||
ru-venv/bin/python dictprep/ru_stage2.py
|
||
|
||
# ask how a word did or did not reach the dictionary
|
||
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
|
||
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
|
||
ru-venv/bin/python dictprep/ru_stage2.py --dump
|
||
```
|
||
|
||
`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
|
||
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
|
||
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
|
||
PDF is re-exported.
|
||
|
||
## Algorithm
|
||
|
||
### Stage 1 — `ruwords` (Go)
|
||
|
||
Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
|
||
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
|
||
|
||
- the **headword** (leading token). Leading whitespace including the form-feed `\f`
|
||
pdftotext puts at every page top is trimmed — otherwise the first headword of each page
|
||
is lost;
|
||
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
|
||
(`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
|
||
suffix best overlaps the headword); the plural is then dropped (a plural that has a
|
||
singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
|
||
- **variant headwords** after `и` that carry their own grammatical note
|
||
(`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
|
||
|
||
Everything else (every maximal Cyrillic token not selected above) goes to
|
||
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
|
||
|
||
### Stage 2 — `ru_stage2.py` (Python)
|
||
|
||
Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:
|
||
|
||
1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
|
||
reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
|
||
added (so nouns absent from the PDF are included).
|
||
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
|
||
keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
|
||
reading in *either* is enough (their disagreements were reviewed and resolved this way,
|
||
since each is incomplete in different places). A singular reconstructed from "ед." that
|
||
neither dictionary knows is accepted as a noun (the orthographic note attests it).
|
||
3. A word **both dictionaries miss** is classified by the orthographic **note**
|
||
(`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
|
||
`нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
|
||
verb is dropped; anything undecided goes to `undefined.txt`.
|
||
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
|
||
травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
|
||
from review/undefined into the result as well, propagated transitively through chains.
|
||
The plural-form variants the dictionaries already resolve never reach this step.
|
||
|
||
The nominative singular always comes from the dictionary that recognised the word, or from
|
||
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
|
||
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
|
||
|
||
### The libmorph bridge — `libmorph_check.cpp`
|
||
|
||
libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
|
||
stdin→stdout filter: one UTF-8 word per line in, one line out:
|
||
|
||
```
|
||
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
|
||
```
|
||
|
||
`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
|
||
speech. The codes were reverse-engineered (the docs omit the table):
|
||
|
||
| codes | part of speech |
|
||
|------|----------------|
|
||
| **7–21, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
|
||
| 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral |
|
||
| 38–39 | **proper noun** (excluded) · 48–58 comparative/adverb · 49–53 function words |
|
||
|
||
The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
|
||
passed and lemmas returned in UTF-8.
|
||
|
||
## Notes & caveats
|
||
|
||
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
|
||
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
|
||
signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
|
||
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
|
||
and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
|
||
which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
|
||
and is therefore used only as a weak confirming vote, never as dictionary membership.
|
||
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
|
||
dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.
|