dictprep: Russian orthographic dictionary → Scrabble noun pipeline

Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt) from the RAN orthographic dictionary, for the Эрудит ruleset. - Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants. - Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) + libmorph + orthographic notes select common nouns (nom. sing.); --trace explains a word's fate, --dump writes the in-memory buckets. - libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in. - orfo_dict_2025.txt is the committed pdftotext source of truth. - See dictprep/README.md for methodology and reproducibility.
2026-06-01 23:27:17 +02:00
parent 15c7959d96
commit 540ee32178
9 changed files with 402226 additions and 1 deletions
@@ -6,4 +6,13 @@
 # Local scratch
 /tmp/
-*.pdf
+
 # Compiled libmorph bridge (build artifact; see dictprep/README.md)
 /dictprep/libmorph_check
 # Stage 2 --dump debug buckets (committed: all, scrabble, manual_confirm, orfo_dict_2025)
 /dictprep/russian/undefined.txt
 /dictprep/russian/adjectives.txt
 /dictprep/russian/verbs.txt
 /dictprep/russian/singulars.txt
 /dictprep/russian/fate.tsv
@@ -0,0 +1,164 @@
 # Russian word-list preparation (`dictprep`)
 Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
 Russian academic **orthographic dictionary**, cross-checked against two independent
 morphological dictionaries.
 The goal of the pipeline is a list of **common nouns in the nominative singular**
 (`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
 > This directory is self-contained tooling for *building* the word list. It is not part
 > of the solver library. The committed result lives in `dictprep/russian/`.
 ## Source
 `orfo_dict_2025.pdf` — *Русский орфографический словарь РАН* (≈ 200 000 entries), the
 authority for **spelling**. It encodes declension type in its grammatical notes but does
 **not** reliably mark part of speech.
 - Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
 - Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
 The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
 pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
 from the text alone — the binary PDF is needed only to regenerate that text.
 ## Outputs (`dictprep/russian/`)
 The committed result is **three** files; every other bucket stays in the Stage-2
 process's memory (dump it with `--dump`, query it with `--trace WORD`).
 | File | Committed | Meaning |
 |------|:--:|---------|
 | `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
 | `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
 | `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
 | `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. |
 | `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
 `--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
 word with the reason it did or did not reach the dictionary); these are git-ignored debug
 artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
 the brain consumes.
 ## Prerequisites
 ```sh
 # 1. pdftotext (Poppler)
 sudo apt-get install -y poppler-utils
 # 2. Go toolchain (Stage 1) — already required by the parent module
 # 3. Python + the OpenCorpora analyser (Stage 2)
 sudo apt-get install -y python3-venv python3-pip
 python3 -m venv ru-venv
 ru-venv/bin/pip install mawo-pymorphy3            # bundles OpenCorpora 2025 (words.dawg)
 # 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
 sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
 g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
 ```
 If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
 the stack and reports `libmorph_helper=MISSING`.
 ## How to run
 ```sh
 # Stage 0 — PDF -> plain text (committed as the source of truth; run once)
 pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
 # Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
 go run ./dictprep/ruwords
 # Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
 ru-venv/bin/python dictprep/ru_stage2.py
 # ask how a word did or did not reach the dictionary
 ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
 # also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
 ru-venv/bin/python dictprep/ru_stage2.py --dump
 ```
 `-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
 `russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
 `я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
 PDF is re-exported.
 ## Algorithm
 ### Stage 1 — `ruwords` (Go)
 Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
 stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
 - the **headword** (leading token). Leading whitespace including the form-feed `\f`
  pdftotext puts at every page top is trimmed — otherwise the first headword of each page
  is lost;
 - the **singular of a plural headword** when the entry gives it after `ед.`, in full
  (`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
  suffix best overlaps the headword); the plural is then dropped (a plural that has a
  singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
 - **variant headwords** after `и` that carry their own grammatical note
  (`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
 Everything else (every maximal Cyrillic token not selected above) goes to
 `/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
 ### Stage 2 — `ru_stage2.py` (Python)
 Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:
 1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
   reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
   added (so nouns absent from the PDF are included).
 2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
   keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
   reading in *either* is enough (their disagreements were reviewed and resolved this way,
   since each is incomplete in different places). A singular reconstructed from "ед." that
   neither dictionary knows is accepted as a noun (the orthographic note attests it).
 3. A word **both dictionaries miss** is classified by the orthographic **note**
   (`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
   `нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
   verb is dropped; anything undecided goes to `undefined.txt`.
 4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
   травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
   from review/undefined into the result as well, propagated transitively through chains.
   The plural-form variants the dictionaries already resolve never reach this step.
 The nominative singular always comes from the dictionary that recognised the word, or from
 the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
 mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
 ### The libmorph bridge — `libmorph_check.cpp`
 libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
 stdin→stdout filter: one UTF-8 word per line in, one line out:
 ```
 <known>\t<pos>:<lemma>\t<pos>:<lemma>...
 ```
 `<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
 speech. The codes were reverse-engineered (the docs omit the table):
 | codes | part of speech |
 |------|----------------|
 | **7–21, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
 | 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral |
 | 38–39 | **proper noun** (excluded) · 48–58 comparative/adverb · 49–53 function words |
 The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
 passed and lemmas returned in UTF-8.
 ## Notes & caveats
 - The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
  dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
  signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
 - OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
  and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
  which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
  and is therefore used only as a weak confirming vote, never as dictionary membership.
 - Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
  dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.
@@ -0,0 +1,47 @@
 // libmorph_check: a thin stdin->stdout bridge to the libmorph Russian morphological
 // analyser, for use by the Stage-2 classifier (scripts/ru_stage2.py).
 //
 // Reads one word per line (bytes are passed through verbatim — the caller encodes to
 // the code page the libmorph char interface expects, CP1251). For each word it writes
 // a line:
 //
 //     <known>\t<pos>:<lemma>\t<pos>:<lemma>...
 //
 // where <known> is CheckWord's result (1 = in the dictionary, 0 = not), and each
 // following field is one lexeme: its part of speech (wdInfo & 0x3f) and lemma.
 //
 // Build: g++ -std=c++17 -O2 scripts/libmorph_check.cpp -lmorphrus -lmoonycode -o libmorph_check
 #include <libmorph/rus.h>
 #include <libmorph/api.hpp>
 #include <cstdio>
 #include <iostream>
 #include <string>
 int main(int argc, char** argv) {
  // The factory key selects the code page: "libmorph.api.v4:<charset>". Use the
  // UTF-8 instance so words pass through verbatim. IMlmaMbXX only adds non-virtual
  // convenience wrappers over IMlmaMb, so the filled pointer can be used as such.
  const char* key = argc > 1 ? argv[1] : "libmorph.api.v4:utf-8";
  IMlmaMbXX* mlma = nullptr;
  int rc = mlmaruGetAPI(key, (void**)&mlma);
  if (mlma == nullptr) {
    std::fprintf(stderr, "libmorph_check: GetAPI('%s') failed, rc=%d\n", key, rc);
    return 1;
  }
  std::string line;
  while (std::getline(std::cin, line)) {
    if (!line.empty() && line.back() == '\r') line.pop_back();
    IMlmaMbXX::inword w(line.c_str(), line.size());
    int known = mlma->CheckWord(w, sfIgnoreCapitals);
    std::cout << known;
    try {
      for (auto& lx : mlma->Lemmatize(w, sfIgnoreCapitals)) {
        unsigned pos = lx.ngrams > 0 ? (lx.pgrams[0].wdInfo & 0x3f) : 0xffu;
        std::cout << '\t' << pos << ':' << (lx.plemma ? lx.plemma : "");
      }
    } catch (...) {
    }
    std::cout << '\n';
  }
  return 0;
 }
@@ -0,0 +1,341 @@
 #!/usr/bin/env python3
 """Stage 2 — the "brain" of the Russian Scrabble word-list pipeline.
 It reads the Stage-1 base word list (built once by ruwords so the heavy PDF is not
 re-parsed) together with the grammatical notes and the singular/variant structure, runs
 the whole noun-selection logic in memory, and writes a minimal result:
    dictprep/russian/scrabble.txt   — the working dictionary (common nouns, nom. sing.)
    dictprep/russian/undefined.txt  — the ambiguous tail, left for manual review
 (dictprep/russian/all.txt is the Stage-1 base.) Every other bucket — adjectives, verbs,
 the merged note-nouns, singulars, variants — stays in memory. Pass --dump to also write
 them; pass --trace WORD to ask how a single word did or did not reach the dictionary.
 Note: all.txt is a plain word list, so the grammatical notes, "ед." singulars and "и"
 variants are read from the pdftotext output (slov.txt) and the Stage-1 side files; the
 expensive PDF parse itself runs only once.
 Sources, most authoritative first: OpenCorpora (mawo-pymorphy3), libmorph (libmorph_check),
 and the orthographic dictionary's own notes. See dictprep/README.md.
 Run:  ru-venv/bin/python dictprep/ru_stage2.py [--dump] [--trace WORD]
 """
 import argparse
 import os
 import re
 import subprocess
 HERE = os.path.dirname(os.path.abspath(__file__))
 OUT_DIR = os.path.join(HERE, "russian")
 SLOV = os.path.join(OUT_DIR, "orfo_dict_2025.txt")  # committed pdftotext output (source of truth)
 WL_FROM, WL_TO = 452, 168808  # 1-based inclusive bounds of the column word-list section
 OC_CACHE = "/tmp/oc_nouns.txt"
 LIBMORPH_BIN = os.path.join(HERE, "libmorph_check")
 ALPHABET = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
 ORDER = {c: i for i, c in enumerate(ALPHABET)}
 PROPER = {"Name", "Surn", "Patr", "Geox", "Orgn", "Trad"}
 LIBMORPH_NOUN_CODES = set(range(7, 22)) | {24}  # 7..21 plus 24 (pluralia tantum)
 ADJ_END = {"ая", "яя", "ое", "ее", "ье", "ья", "ьи"}
 VERB3 = ("ет", "ёт", "ит", "ют", "ут", "ает", "яет", "ует", "уют", "нет", "жет", "чет")
 GENPL = ("ов", "ёв", "ев", "ей")
 def key(w):
    return [ORDER.get(c, 99) for c in w]
 def destress(s):
    return "".join(c for c in s if ord(c) not in (0x0300, 0x0301)).lower()
 def cyr_ok(w):
    return 2 <= len(w) <= 15 and all(("а" <= c <= "я") or c == "ё" for c in w)
 def load(p):
    return [l.strip() for l in open(p, encoding="utf-8") if l.strip()] if os.path.exists(p) else []
 def write(path, words):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    open(path, "w", encoding="utf-8").write("\n".join(sorted(set(words), key=key)) + "\n")
 import mawo_pymorphy3  # noqa: E402
 M = mawo_pymorphy3.MorphAnalyzer()
 D = M._dawg_dict
 def oc_noun_lemmas():
    """Every common-noun lemma (nom. sing. / pluralia tantum) in OpenCorpora's words.dawg."""
    gp, pt = D.get_paradigm, D.parse_tag_string
    para0, tagc = {}, {}
    def g0(pid):
        r = para0.get(pid)
        if r is None:
            suf0, tag0, pre0 = gp(pid, 0)
            _, gr = pt(tag0)
            r = (pre0, suf0, gr)
            para0[pid] = r
        return r
    def gt(pid, idx):
        k = (pid, idx)
        r = tagc.get(k)
        if r is None:
            suf, tag, pre = gp(pid, idx)
            pos, gr = pt(tag)
            r = (suf, pre, pos, gr)
            tagc[k] = r
        return r
    out = set()
    for word, rec in D.words_dawg.iteritems():
        pid, idx = rec
        suf, pre, pos, gr = gt(pid, idx)
        if pos != "NOUN":
            continue
        pre0, suf0, gr0 = g0(pid)
        if (PROPER & gr) or (PROPER & gr0):
            continue
        stem = word[len(pre):len(word) - len(suf)] if suf else word[len(pre):]
        out.add(pre0 + stem + suf0)
    return {w for w in out if cyr_ok(w)}
 def oc_status(word):
    """(is_common_noun, in_dictionary) for word, from OpenCorpora only."""
    parses = D.get_word_parses(word)
    if not parses:
        return False, False
    gp, pt = D.get_paradigm, D.parse_tag_string
    for pid, idx in parses:
        suf, tag, pre = gp(pid, idx)
        pos, gr = pt(tag)
        if pos == "NOUN":
            _, tag0, _ = gp(pid, 0)
            _, gr0 = pt(tag0)
            if not (PROPER & gr or PROPER & gr0):
                return True, True
    return False, True
 def libmorph_analyze(words):
    """Map each word to (known, noun_lemma, codes) per libmorph; noun_lemma is None when it
    is not a common noun there. Empty result if the helper binary is not built."""
    words = list(words)
    if not words or not os.path.exists(LIBMORPH_BIN):
        return {}
    proc = subprocess.run([LIBMORPH_BIN], input="\n".join(words), capture_output=True, text=True)
    out = {}
    for w, line in zip(words, proc.stdout.split("\n")):
        fields = line.split("\t")
        known = fields[:1] == ["1"]
        codes, noun_lemmas = set(), []
        for field in fields[1:]:
            code, _, lex = field.partition(":")
            if code.isdigit():
                codes.add(int(code))
                if int(code) in LIBMORPH_NOUN_CODES:
                    noun_lemmas.append(lex)
        lemma = (w if w in noun_lemmas else noun_lemmas[0]) if noun_lemmas else None
        out[w] = (known, lemma, codes)
    return out
 def build_notes():
    """Map each headword (destressed, lowercased) to its grammatical note."""
    def is_hw(ch):
        o = ord(ch)
        return (0x0430 <= o <= 0x044F) or (0x0410 <= o <= 0x042F) or o in (0x0401, 0x0451, 0x0300, 0x0301)
    hmap = {}
    lines = open(SLOV, encoding="utf-8").read().split("\n")
    for l in lines[WL_FROM - 1:WL_TO]:
        s = l.lstrip()
        e = 0
        for ch in s:
            if is_hw(ch):
                e += 1
            else:
                break
        hw = destress(s[:e])
        if hw and hw not in hmap:
            hmap[hw] = destress(s[e:]).strip()
    return hmap
 def classify(w, note):
    """Coarse part of speech of an out-of-dictionary word from its PDF note."""
    if note is None:
        return "amb"
    n = re.sub(r"\([^)]*\)", "", note).strip()  # drop domain/etymology parentheticals
    if "кр. ф" in n or "кр.ф" in n or "прич." in n or "прил." in n:
        return "adj"
    ends = re.findall(r"-([а-яё]+)", n)
    if any(e in ADJ_END for e in ends):
        return "adj"
    if "сов." in n or "несов." in n or "безл." in n:
        return "verb"
    if w.endswith("ся"):  # reflexive: no Russian noun ends in -ся
        return "verb"
    if any(e.endswith(VERB3) for e in ends) and not any(m in n for m in ("ед.", "тв.", "род.", "м.", "ж.", "с.")):
        return "verb"
    if n == "" and w.endswith(("ый", "ий", "ой", "ая", "ое", "ые", "ие", "яя", "ее")):
        return "adj"
    if "нескл" in n:
        return "noun" if any(g in n for g in ("м.", "ж.", "с.", "мн.")) else "amb"
    if ends:
        return "noun"
    if n == "" and w.endswith(("ать", "ять", "еть", "ить", "оть", "уть", "ыть", "ти", "чь")):
        return "verb"
    return "amb"
 def singular(w, note):
    """Nominative singular of a noun headword from the PDF note (authoritative) or, for a
    plural headword without an explicit singular, the mawo lemma; pluralia tantum kept."""
    n = note or ""
    full = re.search(r"ед\.\s+([а-яё]+)", n)
    if full:
        return full.group(1)
    suf = re.search(r"ед\.\s+-([а-яё]+)", n)
    if suf:
        s = suf.group(1)
        i = w.rfind(s[0])
        return w[:i] + s if i > 0 else w
    ends = re.findall(r"-([а-яё]+)", re.sub(r"\([^)]*\)", "", n))
    if ends and ends[0].endswith(GENPL):
        for p in M.parse(w):
            if str(p.tag.POS) == "NOUN":
                return p.normal_form
        return w
    return w
 def build():
    """Run the whole pipeline in memory. Returns the result sets plus a `fate` map giving
    every word's outcome, so a word's path can be traced or the buckets dumped."""
    oc = set(load(OC_CACHE)) or oc_noun_lemmas()
    if not os.path.exists(OC_CACHE):
        write(OC_CACHE, oc)
    hmap = build_notes()
    all_words = load(os.path.join(OUT_DIR, "all.txt"))
    ed_nouns = set(load("/tmp/ru_singulars.txt"))
    pairs = [tuple(p) for l in load("/tmp/ru_variants.txt") if len(p := l.split("\t")) == 2]
    pdf = [w for w in all_words if cyr_ok(w)]
    lm = libmorph_analyze(pdf)
    def to_singular(w):
        s = singular(w, hmap.get(w))
        return s if cyr_ok(s) else w
    fate = {}
    scrabble = set(oc)
    adj, verb, amb = [], [], []
    for w in pdf:
        oc_noun, oc_known = oc_status(w)
        if oc_noun:
            fate[w] = "scrabble: сущ. по OpenCorpora"
            continue
        lm_known, lm_lemma, _ = lm.get(w, (False, None, frozenset()))
        if lm_lemma is not None:
            s = lm_lemma if cyr_ok(lm_lemma) else to_singular(w)
            scrabble.add(s)
            fate[w] = "scrabble: сущ. по libmorph" + ("" if s == w else f" → {s}")
            continue
        if oc_known or lm_known:
            fate[w] = "отброшено: словарь знает как не-существительное"
            continue
        if w in ed_nouns:
            scrabble.add(w)
            fate[w] = "scrabble: ед.ч. по помете «ед.»"
            continue
        c = classify(w, hmap.get(w))
        if c == "noun":
            s = to_singular(w)
            scrabble.add(s)
            fate[w] = "scrabble: сущ. по помете орфословаря" + ("" if s == w else f" → {s}")
        elif c == "adj":
            adj.append(w)
            fate[w] = "отброшено: прилагательное (помета орфословаря)"
        elif c == "verb":
            verb.append(w)
            fate[w] = "отброшено: глагол (помета орфословаря)"
        else:
            amb.append(w)
            fate[w] = "undefined: неоднозначное (нет в словарях, помета не определяет)"
    # Manual confirmations: nouns the maintainer approved from the undefined tail.
    for w in load(os.path.join(OUT_DIR, "manual_confirm.txt")):
        if cyr_ok(w):
            scrabble.add(w)
            fate[w] = "scrabble: подтверждено вручную (manual_confirm.txt)"
    # Variant rescue: a word joined by "и" to a confirmed noun is itself a noun.
    pending = set(amb) - scrabble
    changed = True
    while changed:
        changed = False
        for a, b in pairs:
            for x, y in ((a, b), (b, a)):
                if x in scrabble and y in pending:
                    scrabble.add(y)
                    pending.discard(y)
                    fate[y] = f"scrabble: вариант от «{x}» (через «и»)"
                    changed = True
    undefined = [w for w in amb if w not in scrabble]
    return {
        "oc": oc, "scrabble": scrabble, "undefined": undefined,
        "adjectives": adj, "verbs": verb, "singulars": ed_nouns,
        "fate": fate, "all": set(all_words),
    }
 def trace(word, r):
    w = destress(word)
    if w in r["fate"]:
        return r["fate"][w]
    if w in r["scrabble"]:
        return "scrabble: лексикон OpenCorpora" if w in r["oc"] else "scrabble: производная/лемма"
    if w not in r["all"]:
        return "нет в russian_all (не извлечено на Stage 1 — нет в .pdf, либо имя собств./дефис/форма)"
    if not cyr_ok(w):
        return "отсеяно: длина или символы вне диапазона (2–15 кириллица)"
    return "не определено"
 def main():
    ap = argparse.ArgumentParser(description="Stage 2 brain: build the noun dictionary, trace a word, or dump buckets.")
    ap.add_argument("--dump", action="store_true", help="also write the in-memory buckets (adjectives, verbs, singulars, variants, fate)")
    ap.add_argument("--trace", metavar="WORD", help="report how WORD did or did not reach the dictionary, then exit")
    args = ap.parse_args()
    r = build()
    if args.trace:
        print(f"{args.trace}: {trace(args.trace, r)}")
        return
    write(os.path.join(OUT_DIR, "scrabble.txt"), r["scrabble"])
    print(f"=> dictprep/russian/scrabble.txt   {len(r['scrabble'])}")
    print(f"   undefined kept in memory: {len(set(r['undefined']))} (use --dump to write it)")
    if args.dump:
        write(os.path.join(OUT_DIR, "undefined.txt"), r["undefined"])
        write(os.path.join(OUT_DIR, "adjectives.txt"), r["adjectives"])
        write(os.path.join(OUT_DIR, "verbs.txt"), r["verbs"])
        write(os.path.join(OUT_DIR, "singulars.txt"), r["singulars"])
        fate_path = os.path.join(OUT_DIR, "fate.tsv")
        os.makedirs(OUT_DIR, exist_ok=True)
        with open(fate_path, "w", encoding="utf-8") as f:
            for w in sorted(r["fate"], key=key):
                f.write(f"{w}\t{r['fate'][w]}\n")
        print(f"   dumped: undefined.txt ({len(set(r['undefined']))}), adjectives.txt, verbs.txt, singulars.txt, fate.tsv")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,135 @@
 артгруппа
 бутень
 вебинар
 видеодневник
 водозащита
 генацвале
 жакоб
 оберфюрер
 околоть
 особина
 полбазара
 полбака
 полбалкона
 полбанана
 полбарана
 полбатальона
 полбатона
 полбиблиотеки
 полблокнота
 полбокала
 полбуханки
 полвагона
 полвечера
 полвзвода
 полвинта
 полгазеты
 полгектара
 полгостиницы
 полграмма
 полгруппы
 полдачи
 полдвора
 полдекабря
 полдеревни
 полдетсада
 полдивана
 полдивизии
 полдыни
 полжурнала
 ползавода
 ползарплаты
 полздания
 полканикул
 полканистры
 полкартофелины
 полкастрюли
 полквартиры
 полкилограмма
 полкласса
 полкниги
 полколлекции
 полкольца
 полкоманды
 полкоробки
 полкочана
 полкурса
 полкуска
 полмагазина
 полмандарина
 полмарта
 полматча
 полмиллиметра
 полмузея
 полноября
 полпакета
 полпарка
 полпартии
 полпинты
 полпирога
 полпирожка
 полпируэта
 полпоезда
 полполена
 полполка
 полполки
 полполосы
 полпомидора
 полпоросёнка
 полпосёлка
 полпредовский
 полпроцента
 полпузырька
 полрайона
 полромана
 полроты
 полрулона
 полряда
 полсада
 полсажени
 полсезона
 полсентября
 полсловаря
 полсостава
 полсрока
 полстада
 полстены
 полстолетия
 полстраницы
 полстроки
 полтаблетки
 полтайма
 полтакта
 полтарелки
 полтетради
 полтома
 полтона
 полторта
 полтысячелетия
 полтюбика
 полусанаторий
 полфакультета
 полфевраля
 полфлакона
 полфразы
 полхаты
 полцарства
 полцентнера
 полцистерны
 полчайника
 полчемодана
 полшажка
 полшажочка
 полшара
 полшкафа
 полшколы
 полщеки
 принт
 промо
 рентгеноаппарат
 сивец
 соцнаём
 срывка
 флеш
 флешмобер
 шиноремонт
@@ -0,0 +1,434 @@
 // Command ruwords extracts a clean Cyrillic word list from the plain text of a Russian
 // orthographic dictionary (the output of `pdftotext`).
 //
 // Stage 1 (this tool): from the column word-list section [from, to] it collects, per
 // entry, the headword (the leading token). When the headword is plural and the entry
 // gives its singular after "ед." — in full ("ящеры, …, ед. ящер") or as a replacement
 // suffix ("…, ед. -вец") — only the singular is kept, since a plural that has a singular
 // is never needed. It drops stress marks, lowercases, keeps ё, and discards proper nouns
 // (capitalized), hyphenated words, acronyms and non-Cyrillic tokens. The result is
 // de-duplicated and sorted in Russian alphabetical order (ё right after е), LF-separated.
 //
 // It also collects a variant headword joined by "и" when it carries its own grammatical
 // note (e.g. "аблатив, -а и аблятив, -а"). Suffix-singular reconstruction is heuristic;
 // Stage 2 (dictprep/ru_stage2.py) re-checks the words against real dictionaries.
 //
 //	pdftotext dictprep/orfo_dict_2025.pdf /tmp/slov.txt
 //	go run ./dictprep/ruwords -in /tmp/slov.txt -from 452 -to 168808 \
 //	    -out russian_all.txt -skip russian_skip.txt
 package main
 import (
 	"bufio"
 	"flag"
 	"fmt"
 	"log"
 	"os"
 	"path/filepath"
 	"sort"
 	"strings"
 	"unicode"
 )
 // ruAlphabet is the Russian alphabet in collation order (ё directly after е).
 const ruAlphabet = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
 var ruRank = func() map[rune]int {
 	m := make(map[rune]int, len(ruAlphabet))
 	for i, r := range []rune(ruAlphabet) {
 		m[r] = i
 	}
 	return m
 }()
 func isCyrLetter(r rune) bool {
 	return (r >= 'а' && r <= 'я') || (r >= 'А' && r <= 'Я') || r == 'ё' || r == 'Ё'
 }
 func isUpperCyr(r rune) bool { return (r >= 'А' && r <= 'Я') || r == 'Ё' }
 func isStress(r rune) bool { return r == 0x0300 || r == 0x0301 }
 // cleanWord normalizes a run of letters/stress-marks into a lowercase Cyrillic word, or
 // returns ok=false for proper nouns (capitalized), hyphenated or non-Cyrillic runs.
 func cleanWord(run []rune) (string, bool) {
 	if len(run) == 0 || isUpperCyr(run[0]) {
 		return "", false
 	}
 	var b strings.Builder
 	for _, r := range run {
 		switch {
 		case isStress(r), r == '': // drop stress accents and soft hyphens
 		case r == '-': // a real hyphen means a hyphenated word: reject it
 			return "", false
 		default:
 			b.WriteRune(unicode.ToLower(r))
 		}
 	}
 	w := b.String()
 	if w == "" {
 		return "", false
 	}
 	for _, r := range w {
 		if !((r >= 'а' && r <= 'я') || r == 'ё') {
 			return "", false
 		}
 	}
 	return w, true
 }
 // headword returns the entry's headword: the leading run of letters, stress marks and
 // hyphens, normalized.
 func headword(line string) (string, bool) {
 	// Trim leading whitespace, including the form-feed (U+000C) that pdftotext puts at
 	// the top of each page — otherwise the first headword on every page is lost.
 	line = strings.TrimLeftFunc(line, unicode.IsSpace)
 	var run []rune
 	for _, r := range line {
 		if isCyrLetter(r) || isStress(r) || r == '-' || r == '' {
 			run = append(run, r)
 		} else {
 			break
 		}
 	}
 	return cleanWord(run)
 }
 // embeddedSingulars returns the singular form of a plural headword spelled out after
 // "ед.", either in full ("ед. ящер") or as a replacement suffix ("ед. -вец",
 // reconstructed from headword). It skips gender marks ("ед. м") and abbreviations that
 // merely start with "ед." ("ед. измер.", "ден. ед.").
 func embeddedSingulars(line, headword string) []string {
 	var out []string
 	for i := 0; ; {
 		j := strings.Index(line[i:], "ед.")
 		if j < 0 {
 			break
 		}
 		i += j + len("ед.")
 		rest := strings.TrimLeft(line[i:], "  \t")
 		if strings.HasPrefix(rest, "-") { // suffix form: reconstruct from the headword
 			var suf []rune
 			for _, r := range rest[len("-"):] {
 				if isCyrLetter(r) || isStress(r) {
 					suf = append(suf, r)
 				} else {
 					break
 				}
 			}
 			if s, ok := cleanWord(suf); ok && len([]rune(s)) >= 2 {
 				if recon := reconstructSingular(headword, s); recon != "" {
 					out = append(out, recon)
 				}
 			}
 			continue
 		}
 		var run []rune
 		consumed := 0
 		for _, r := range rest {
 			if isCyrLetter(r) || isStress(r) {
 				run = append(run, r)
 				consumed += len(string(r))
 			} else {
 				break
 			}
 		}
 		if len(run) == 0 {
 			continue
 		}
 		if strings.HasPrefix(rest[consumed:], ".") {
 			continue // an abbreviation like "ед. измер." rather than a singular form
 		}
 		w, ok := cleanWord(run)
 		if !ok || len([]rune(w)) < 2 { // 2+ letters excludes the gender marks м/ж/с
 			continue
 		}
 		out = append(out, w)
 	}
 	return out
 }
 // reconstructSingular builds the singular from a plural headword and the replacement
 // suffix from "ед. -<suffix>", splicing where the suffix best overlaps the tail of the
 // headword (the position of longest common prefix between the suffix and a headword
 // suffix). It is a heuristic; Stage 2 re-checks the words against real dictionaries.
 func reconstructSingular(headword, suffix string) string {
 	hw, sf := []rune(headword), []rune(suffix)
 	bestK, bestLen := -1, 0
 	for k := 0; k < len(hw); k++ {
 		m := 0
 		for k+m < len(hw) && m < len(sf) && hw[k+m] == sf[m] {
 			m++
 		}
 		if m > bestLen {
 			bestK, bestLen = k, m
 		}
 	}
 	if bestK < 0 {
 		return ""
 	}
 	return string(hw[:bestK]) + suffix
 }
 // headwordNotes are the grammatical notes that mark a parallel headword (a lemma) after
 // "и", as opposed to an inflected form. A "-" ending also marks one; form labels such as
 // деепр. (gerund) or сравн. (comparative) deliberately do not.
 var headwordNotes = map[string]bool{
 	"нескл": true, "неизм": true, "предлог": true, "предл": true, "нареч": true,
 	"нар": true, "прил": true, "союз": true, "частица": true, "част": true,
 	"межд": true, "мн": true, "ед": true, "тв": true, "числ": true, "мест": true,
 	"м": true, "ж": true, "с": true, "вводн": true, "сказ": true,
 }
 // variantNoteOK reports whether the note following a candidate variant marks a headword:
 // a "-" inflection ending or one of headwordNotes (and not a bare inflected word).
 func variantNoteOK(note string) bool {
 	if strings.HasPrefix(note, "-") {
 		return true
 	}
 	var stem []rune
 	for _, r := range note {
 		if (r >= 'а' && r <= 'я') || r == 'ё' {
 			stem = append(stem, r)
 		} else {
 			break
 		}
 	}
 	return headwordNotes[string(stem)]
 }
 // variants returns the second (and further) headwords of an entry, written as a parallel
 // form after " и ", e.g. "аблатив, -а и аблятив, -а" yields "аблятив" and "регги и реггей,
 // нескл." yields "реггей". Requiring a headword note after the comma keeps this from
 // matching "и" inside examples or picking up inflected forms.
 func variants(line string) []string {
 	var out []string
 	const sep = " и "
 	for i := 0; ; {
 		j := strings.Index(line[i:], sep)
 		if j < 0 {
 			break
 		}
 		i += j + len(sep)
 		rest := line[i:]
 		var run []rune
 		consumed := 0
 		for _, r := range rest {
 			if isCyrLetter(r) || isStress(r) {
 				run = append(run, r)
 				consumed += len(string(r))
 			} else {
 				break
 			}
 		}
 		if len(run) == 0 {
 			continue
 		}
 		after := rest[consumed:]
 		if !strings.HasPrefix(after, ", ") || !variantNoteOK(after[len(", "):]) {
 			continue
 		}
 		if w, ok := cleanWord(run); ok && len([]rune(w)) >= 2 {
 			out = append(out, w)
 		}
 	}
 	return out
 }
 // normToken normalizes any token (a run of letters and stress marks) for the skip set:
 // lowercase, stress removed, kept only if it is 2+ all-Cyrillic letters. Unlike
 // cleanWord it does NOT reject capitalized tokens — a lowercased proper noun belongs in
 // the skip set so it can be re-checked by a morphological analyzer.
 func normToken(run []rune) (string, bool) {
 	var b strings.Builder
 	for _, r := range run {
 		if isStress(r) {
 			continue
 		}
 		b.WriteRune(unicode.ToLower(r))
 	}
 	w := b.String()
 	if len([]rune(w)) < 2 {
 		return "", false
 	}
 	for _, r := range w {
 		if !((r >= 'а' && r <= 'я') || r == 'ё') {
 			return "", false
 		}
 	}
 	return w, true
 }
 // tokens returns every maximal run of Cyrillic letters (plus stress marks) in the line,
 // normalized; runs are split on every other character (so hyphens split a word).
 func tokens(line string) []string {
 	var out []string
 	var run []rune
 	flush := func() {
 		if len(run) > 0 {
 			if w, ok := normToken(run); ok {
 				out = append(out, w)
 			}
 			run = run[:0]
 		}
 	}
 	for _, r := range line {
 		if isCyrLetter(r) || isStress(r) {
 			run = append(run, r)
 		} else {
 			flush()
 		}
 	}
 	flush()
 	return out
 }
 func lessRu(a, b string) bool {
 	ra, rb := []rune(a), []rune(b)
 	for i := 0; i < len(ra) && i < len(rb); i++ {
 		if ra[i] != rb[i] {
 			return ruRank[ra[i]] < ruRank[rb[i]]
 		}
 	}
 	return len(ra) < len(rb)
 }
 func sortedRu(set map[string]struct{}) []string {
 	words := make([]string, 0, len(set))
 	for w := range set {
 		words = append(words, w)
 	}
 	sort.Slice(words, func(i, j int) bool { return lessRu(words[i], words[j]) })
 	return words
 }
 func writeWords(path string, words []string) error {
 	if dir := filepath.Dir(path); dir != "" && dir != "." {
 		if err := os.MkdirAll(dir, 0o755); err != nil {
 			return err
 		}
 	}
 	o, err := os.Create(path)
 	if err != nil {
 		return err
 	}
 	w := bufio.NewWriter(o)
 	for _, word := range words {
 		w.WriteString(word)
 		w.WriteByte('\n')
 	}
 	if err := w.Flush(); err != nil {
 		o.Close()
 		return err
 	}
 	return o.Close()
 }
 func main() {
 	in := flag.String("in", "dictprep/russian/orfo_dict_2025.txt", "plain-text dictionary (pdftotext output)")
 	out := flag.String("out", "dictprep/russian/all.txt", "output: the base word list (clean headwords + reconstructed singulars + variants)")
 	skip := flag.String("skip", "/tmp/ru_skip.txt", "output: every other token, for a later morphology re-check")
 	sings := flag.String("singulars", "/tmp/ru_singulars.txt", "output: singulars reconstructed from \"ед.\" (known nouns)")
 	varsOut := flag.String("variants", "/tmp/ru_variants.txt", "output: variant pairs joined by \"и\" (primary<TAB>variant)")
 	from := flag.Int("from", 452, "first line of the word-list section (1-based, inclusive)")
 	to := flag.Int("to", 168808, "last line of the word-list section (inclusive)")
 	flag.Parse()
 	if *in == "" {
 		log.Fatal("ruwords: -in is required")
 	}
 	f, err := os.Open(*in)
 	if err != nil {
 		log.Fatal(err)
 	}
 	defer f.Close()
 	all := make(map[string]struct{})
 	allTokens := make(map[string]struct{})
 	singulars := make(map[string]struct{})
 	variantPairs := make(map[string]struct{})
 	entries, fromHead, fromSing, fromVar := 0, 0, 0, 0
 	sc := bufio.NewScanner(f)
 	sc.Buffer(make([]byte, 1<<20), 1<<20)
 	for line := 0; sc.Scan(); {
 		line++
 		if line < *from || line > *to {
 			continue
 		}
 		entries++
 		text := sc.Text()
 		hw, hwOK := headword(text)
 		var sings []string
 		if hwOK {
 			sings = embeddedSingulars(text, hw)
 		}
 		primary := ""
 		if len(sings) > 0 {
 			// the headword is plural and the entry gives its singular: keep only the singular
 			primary = sings[0]
 			for _, w := range sings {
 				if _, seen := all[w]; !seen {
 					fromSing++
 					all[w] = struct{}{}
 				}
 				singulars[w] = struct{}{}
 			}
 		} else if hwOK {
 			primary = hw
 			if _, seen := all[hw]; !seen {
 				fromHead++
 			}
 			all[hw] = struct{}{}
 		}
 		for _, w := range variants(text) {
 			if _, seen := all[w]; !seen {
 				fromVar++
 				all[w] = struct{}{}
 			}
 			if primary != "" && primary != w {
 				variantPairs[primary+"\t"+w] = struct{}{}
 			}
 		}
 		for _, w := range tokens(text) {
 			allTokens[w] = struct{}{}
 		}
 	}
 	if err := sc.Err(); err != nil {
 		log.Fatal(err)
 	}
 	skipSet := make(map[string]struct{})
 	for w := range allTokens {
 		if _, ok := all[w]; !ok {
 			skipSet[w] = struct{}{}
 		}
 	}
 	allWords := sortedRu(all)
 	skipWords := sortedRu(skipSet)
 	if err := writeWords(*out, allWords); err != nil {
 		log.Fatal(err)
 	}
 	if err := writeWords(*skip, skipWords); err != nil {
 		log.Fatal(err)
 	}
 	if err := writeWords(*sings, sortedRu(singulars)); err != nil {
 		log.Fatal(err)
 	}
 	pairList := make([]string, 0, len(variantPairs))
 	for p := range variantPairs {
 		pairList = append(pairList, p)
 	}
 	sort.Strings(pairList)
 	if err := writeWords(*varsOut, pairList); err != nil {
 		log.Fatal(err)
 	}
 	fmt.Printf("scanned %d entries\n", entries)
 	fmt.Printf("  %-20s %7d words (%d headwords + %d embedded singulars + %d variants)\n", *out, len(allWords), fromHead, fromSing, fromVar)
 	fmt.Printf("  %-20s %7d words (tokens not in %s; for a morphology re-check)\n", *skip, len(skipWords), *out)
 	fmt.Printf("  %-20s %7d words (singulars from \"ед.\"; known nouns)\n", *sings, len(singulars))
 	fmt.Printf("  %-20s %7d pairs (variants joined by \"и\")\n", *varsOut, len(variantPairs))
 }