Files

T

Ilia Denisov 540ee32178 dictprep: Russian orthographic dictionary → Scrabble noun pipeline

Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt)
from the RAN orthographic dictionary, for the Эрудит ruleset.

- Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts
  headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants.
- Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) +
  libmorph + orthographic notes select common nouns (nom. sing.); --trace explains
  a word's fate, --dump writes the in-memory buckets.
- libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in.
- orfo_dict_2025.txt is the committed pdftotext source of truth.
- See dictprep/README.md for methodology and reproducibility.

2026-06-01 23:27:17 +02:00

8.6 KiB

Raw Blame History

Russian word-list preparation (`dictprep`)

Builds the Russian noun word list for the Scrabble/Эрудит solver out of the official Russian academic orthographic dictionary, cross-checked against two independent morphological dictionaries.

The goal of the pipeline is a list of common nouns in the nominative singular (dictprep/russian/scrabble.txt), plus an ambiguous tail for manual review.

This directory is self-contained tooling for building the word list. It is not part of the solver library. The committed result lives in dictprep/russian/.

Source

orfo_dict_2025.pdf — Русский орфографический словарь РАН (≈ 200 000 entries), the authority for spelling. It encodes declension type in its grammatical notes but does not reliably mark part of speech.

The PDF is git-ignored (large, third-party); place it here as orfo_dict_2025.pdf. Its pdftotext output is committed as russian/orfo_dict_2025.txt, so the word list rebuilds from the text alone — the binary PDF is needed only to regenerate that text.

Outputs (`dictprep/russian/`)

The committed result is three files; every other bucket stays in the Stage-2 process's memory (dump it with --dump, query it with --trace WORD).

File	Committed	Meaning
`orfo_dict_2025.txt`	✓	the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild).
`all.txt`	✓	Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular.
`manual_confirm.txt`	✓	hand-reviewed nouns from the undefined tail; the brain merges them into the result.
`scrabble.txt`	✓	Stage 2 result: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary.
`undefined.txt`	—	the ambiguous tail; kept in memory, written only with `--dump`.

--dump also writes adjectives.txt, verbs.txt, singulars.txt and fate.tsv (every word with the reason it did or did not reach the dictionary); these are git-ignored debug artifacts. Stage 1 also writes /tmp/ru_{skip,singulars,variants}.txt, intermediate inputs the brain consumes.

Prerequisites

# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils

# 2. Go toolchain (Stage 1) — already required by the parent module

# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3            # bundles OpenCorpora 2025 (words.dawg)

# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check

If dictprep/libmorph_check is absent, Stage 2 still runs — it simply drops libmorph from the stack and reports libmorph_helper=MISSING.

How to run

# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt

# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
go run ./dictprep/ruwords

# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python dictprep/ru_stage2.py

# ask how a word did or did not reach the dictionary
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python dictprep/ru_stage2.py --dump

-from/-to (defaulting to 452/168808) bound the column word-list section of russian/orfo_dict_2025.txt (line 452 = the first entry а1, …; line 168808 = the last, я́щурный). The preface above line 452 is prose and is skipped. Verify these bounds if the PDF is re-exported.

Algorithm

Stage 1 — `ruwords` (Go)

Per dictionary line in [from, to] it collects, normalised (stress marks U+0300/U+0301 stripped, lowercased, ё kept, hyphenated/capitalised/non-Cyrillic rejected):

the headword (leading token). Leading whitespace including the form-feed \f pdftotext puts at every page top is trimmed — otherwise the first headword of each page is lost;
the singular of a plural headword when the entry gives it after ед., in full (ящеры, …, ед. ящер) or as a replacement suffix (…, ед. -вец, spliced where the suffix best overlaps the headword); the plural is then dropped (a plural that has a singular is never needed) and the singular is also recorded (/tmp/ru_singulars.txt);
variant headwords after и that carry their own grammatical note (аблатив, -а и аблятив, -а; регги и реггей, нескл.), excluding inflected forms.

Everything else (every maximal Cyrillic token not selected above) goes to /tmp/ru_skip.txt, a safety net for a later morphology re-check.

Stage 2 — `ru_stage2.py` (Python)

Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:

OpenCorpora (words.dawg, read directly — not the predictor): a common-noun reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also added (so nouns absent from the PDF are included).
libmorph (independent dictionary, via libmorph_check): a common-noun reading ⇒ keep the libmorph lemma. The two dictionaries are treated as complementary — a noun reading in either is enough (their disagreements were reviewed and resolved this way, since each is incomplete in different places). A singular reconstructed from "ед." that neither dictionary knows is accepted as a noun (the orthographic note attests it).
A word both dictionaries miss is classified by the orthographic note (-ая, -ое ⇒ adjective; -ть, сов./несов. ⇒ verb; single genitive -а/-и or нескл., м./ж./с. ⇒ noun). A note-noun goes straight to scrabble.txt; an adjective or verb is dropped; anything undecided goes to undefined.txt.
Variant rescue: when the dictionary joins two spellings with "и" (травмопункт и травмпункт, регги и реггей) and one is already a confirmed noun, the other is moved from review/undefined into the result as well, propagated transitively through chains. The plural-form variants the dictionaries already resolve never reach this step.

The nominative singular always comes from the dictionary that recognised the word, or from the orthographic ед. note — never from a predictor guess (libmorph and the predictor mis-lemmatise out-of-dictionary words, e.g. витебчане → витебчан instead of витебчанин).

The libmorph bridge — `libmorph_check.cpp`

libmorph (A. Kovalenko, MIT) ships as libmorphrus.so. libmorph_check is a thin stdin→stdout filter: one UTF-8 word per line in, one line out:

<known>\t<pos>:<lemma>\t<pos>:<lemma>...

<known> is CheckWord (1 = in the dictionary). <pos> is wdInfo & 0x3f, the part of speech. The codes were reverse-engineered (the docs omit the table):

codes	part of speech
7–21, 24	noun (all genders / declensions / animacy; pluralia tantum is 24)
1–3	verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral
38–39	proper noun (excluded) · 48–58 comparative/adverb · 49–53 function words

The analyser instance is requested with the key libmorph.api.v4:utf-8 so words are passed and lemmas returned in UTF-8.

Notes & caveats

The hard tail (≈ 35 000 Stage-1 words / our candidates) is in no morphological dictionary; only the orthographic dictionary attests them, so the PDF note is the sole signal there. Compound and very recent nouns (робототехник, толкинист) live here.
OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on all.txt) and ≈ 96 % overlapping, but complementary (each contributes ≈ 2 200 unique nouns), which is why both are kept. The mawo predictor "knows" ~98 % of everything by guessing and is therefore used only as a weak confirming vote, never as dictionary membership.
Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.

8.6 KiB Raw Blame History Unescape Escape

Russian word-list preparation (dictprep)