dictprep: Russian orthographic dictionary → Scrabble noun pipeline

Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt)
from the RAN orthographic dictionary, for the Эрудит ruleset.

- Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts
  headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants.
- Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) +
  libmorph + orthographic notes select common nouns (nom. sing.); --trace explains
  a word's fate, --dump writes the in-memory buckets.
- libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in.
- orfo_dict_2025.txt is the committed pdftotext source of truth.
- See dictprep/README.md for methodology and reproducibility.
This commit is contained in:
Ilia Denisov
2026-06-01 23:27:17 +02:00
parent 15c7959d96
commit 540ee32178
9 changed files with 402226 additions and 1 deletions
+164
View File
@@ -0,0 +1,164 @@
# Russian word-list preparation (`dictprep`)
Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
Russian academic **orthographic dictionary**, cross-checked against two independent
morphological dictionaries.
The goal of the pipeline is a list of **common nouns in the nominative singular**
(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
> This directory is self-contained tooling for *building* the word list. It is not part
> of the solver library. The committed result lives in `dictprep/russian/`.
## Source
`orfo_dict_2025.pdf`*Русский орфографический словарь РАН* (≈ 200 000 entries), the
authority for **spelling**. It encodes declension type in its grammatical notes but does
**not** reliably mark part of speech.
- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
from the text alone — the binary PDF is needed only to regenerate that text.
## Outputs (`dictprep/russian/`)
The committed result is **three** files; every other bucket stays in the Stage-2
process's memory (dump it with `--dump`, query it with `--trace WORD`).
| File | Committed | Meaning |
|------|:--:|---------|
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 215 — the working dictionary. |
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
word with the reason it did or did not reach the dictionary); these are git-ignored debug
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
the brain consumes.
## Prerequisites
```sh
# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils
# 2. Go toolchain (Stage 1) — already required by the parent module
# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
```
If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
the stack and reports `libmorph_helper=MISSING`.
## How to run
```sh
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
go run ./dictprep/ruwords
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python dictprep/ru_stage2.py
# ask how a word did or did not reach the dictionary
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python dictprep/ru_stage2.py --dump
```
`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
PDF is re-exported.
## Algorithm
### Stage 1 — `ruwords` (Go)
Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
- the **headword** (leading token). Leading whitespace including the form-feed `\f`
pdftotext puts at every page top is trimmed — otherwise the first headword of each page
is lost;
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
(`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
suffix best overlaps the headword); the plural is then dropped (a plural that has a
singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
- **variant headwords** after `и` that carry their own grammatical note
(`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
Everything else (every maximal Cyrillic token not selected above) goes to
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
### Stage 2 — `ru_stage2.py` (Python)
Each Stage-1 word (length 215) is routed by three sources, most authoritative first:
1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
added (so nouns absent from the PDF are included).
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
reading in *either* is enough (their disagreements were reviewed and resolved this way,
since each is incomplete in different places). A singular reconstructed from "ед." that
neither dictionary knows is accepted as a noun (the orthographic note attests it).
3. A word **both dictionaries miss** is classified by the orthographic **note**
(`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
`нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
verb is dropped; anything undecided goes to `undefined.txt`.
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
from review/undefined into the result as well, propagated transitively through chains.
The plural-form variants the dictionaries already resolve never reach this step.
The nominative singular always comes from the dictionary that recognised the word, or from
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
### The libmorph bridge — `libmorph_check.cpp`
libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
stdin→stdout filter: one UTF-8 word per line in, one line out:
```
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
```
`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
speech. The codes were reverse-engineered (the docs omit the table):
| codes | part of speech |
|------|----------------|
| **721, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
| 13 | verb · 25, 27 adjective · 2832 pronoun · 3336 numeral |
| 3839 | **proper noun** (excluded) · 4858 comparative/adverb · 4953 function words |
The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
passed and lemmas returned in UTF-8.
## Notes & caveats
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
and is therefore used only as a weak confirming vote, never as dictionary membership.
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.