Implement Scrabble move generator (DAWG) #1

Merged
owner merged 6 commits from feat/scrabble-solver into master 2026-06-01 22:02:56 +00:00
9 changed files with 402226 additions and 1 deletions
Showing only changes of commit 540ee32178 - Show all commits
+10 -1
View File
@@ -6,4 +6,13 @@
# Local scratch
/tmp/
*.pdf
# Compiled libmorph bridge (build artifact; see dictprep/README.md)
/dictprep/libmorph_check
# Stage 2 --dump debug buckets (committed: all, scrabble, manual_confirm, orfo_dict_2025)
/dictprep/russian/undefined.txt
/dictprep/russian/adjectives.txt
/dictprep/russian/verbs.txt
/dictprep/russian/singulars.txt
/dictprep/russian/fate.tsv
+164
View File
@@ -0,0 +1,164 @@
# Russian word-list preparation (`dictprep`)
Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
Russian academic **orthographic dictionary**, cross-checked against two independent
morphological dictionaries.
The goal of the pipeline is a list of **common nouns in the nominative singular**
(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
> This directory is self-contained tooling for *building* the word list. It is not part
> of the solver library. The committed result lives in `dictprep/russian/`.
## Source
`orfo_dict_2025.pdf`*Русский орфографический словарь РАН* (≈ 200 000 entries), the
authority for **spelling**. It encodes declension type in its grammatical notes but does
**not** reliably mark part of speech.
- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
from the text alone — the binary PDF is needed only to regenerate that text.
## Outputs (`dictprep/russian/`)
The committed result is **three** files; every other bucket stays in the Stage-2
process's memory (dump it with `--dump`, query it with `--trace WORD`).
| File | Committed | Meaning |
|------|:--:|---------|
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 215 — the working dictionary. |
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
word with the reason it did or did not reach the dictionary); these are git-ignored debug
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
the brain consumes.
## Prerequisites
```sh
# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils
# 2. Go toolchain (Stage 1) — already required by the parent module
# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
```
If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
the stack and reports `libmorph_helper=MISSING`.
## How to run
```sh
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
go run ./dictprep/ruwords
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python dictprep/ru_stage2.py
# ask how a word did or did not reach the dictionary
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python dictprep/ru_stage2.py --dump
```
`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
PDF is re-exported.
## Algorithm
### Stage 1 — `ruwords` (Go)
Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
- the **headword** (leading token). Leading whitespace including the form-feed `\f`
pdftotext puts at every page top is trimmed — otherwise the first headword of each page
is lost;
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
(`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
suffix best overlaps the headword); the plural is then dropped (a plural that has a
singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
- **variant headwords** after `и` that carry their own grammatical note
(`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
Everything else (every maximal Cyrillic token not selected above) goes to
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
### Stage 2 — `ru_stage2.py` (Python)
Each Stage-1 word (length 215) is routed by three sources, most authoritative first:
1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
added (so nouns absent from the PDF are included).
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
reading in *either* is enough (their disagreements were reviewed and resolved this way,
since each is incomplete in different places). A singular reconstructed from "ед." that
neither dictionary knows is accepted as a noun (the orthographic note attests it).
3. A word **both dictionaries miss** is classified by the orthographic **note**
(`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
`нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
verb is dropped; anything undecided goes to `undefined.txt`.
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
from review/undefined into the result as well, propagated transitively through chains.
The plural-form variants the dictionaries already resolve never reach this step.
The nominative singular always comes from the dictionary that recognised the word, or from
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
### The libmorph bridge — `libmorph_check.cpp`
libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
stdin→stdout filter: one UTF-8 word per line in, one line out:
```
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
```
`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
speech. The codes were reverse-engineered (the docs omit the table):
| codes | part of speech |
|------|----------------|
| **721, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
| 13 | verb · 25, 27 adjective · 2832 pronoun · 3336 numeral |
| 3839 | **proper noun** (excluded) · 4858 comparative/adverb · 4953 function words |
The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
passed and lemmas returned in UTF-8.
## Notes & caveats
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
and is therefore used only as a weak confirming vote, never as dictionary membership.
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.
+47
View File
@@ -0,0 +1,47 @@
// libmorph_check: a thin stdin->stdout bridge to the libmorph Russian morphological
// analyser, for use by the Stage-2 classifier (scripts/ru_stage2.py).
//
// Reads one word per line (bytes are passed through verbatim — the caller encodes to
// the code page the libmorph char interface expects, CP1251). For each word it writes
// a line:
//
// <known>\t<pos>:<lemma>\t<pos>:<lemma>...
//
// where <known> is CheckWord's result (1 = in the dictionary, 0 = not), and each
// following field is one lexeme: its part of speech (wdInfo & 0x3f) and lemma.
//
// Build: g++ -std=c++17 -O2 scripts/libmorph_check.cpp -lmorphrus -lmoonycode -o libmorph_check
#include <libmorph/rus.h>
#include <libmorph/api.hpp>
#include <cstdio>
#include <iostream>
#include <string>
int main(int argc, char** argv) {
// The factory key selects the code page: "libmorph.api.v4:<charset>". Use the
// UTF-8 instance so words pass through verbatim. IMlmaMbXX only adds non-virtual
// convenience wrappers over IMlmaMb, so the filled pointer can be used as such.
const char* key = argc > 1 ? argv[1] : "libmorph.api.v4:utf-8";
IMlmaMbXX* mlma = nullptr;
int rc = mlmaruGetAPI(key, (void**)&mlma);
if (mlma == nullptr) {
std::fprintf(stderr, "libmorph_check: GetAPI('%s') failed, rc=%d\n", key, rc);
return 1;
}
std::string line;
while (std::getline(std::cin, line)) {
if (!line.empty() && line.back() == '\r') line.pop_back();
IMlmaMbXX::inword w(line.c_str(), line.size());
int known = mlma->CheckWord(w, sfIgnoreCapitals);
std::cout << known;
try {
for (auto& lx : mlma->Lemmatize(w, sfIgnoreCapitals)) {
unsigned pos = lx.ngrams > 0 ? (lx.pgrams[0].wdInfo & 0x3f) : 0xffu;
std::cout << '\t' << pos << ':' << (lx.plemma ? lx.plemma : "");
}
} catch (...) {
}
std::cout << '\n';
}
return 0;
}
+341
View File
@@ -0,0 +1,341 @@
#!/usr/bin/env python3
"""Stage 2 — the "brain" of the Russian Scrabble word-list pipeline.
It reads the Stage-1 base word list (built once by ruwords so the heavy PDF is not
re-parsed) together with the grammatical notes and the singular/variant structure, runs
the whole noun-selection logic in memory, and writes a minimal result:
dictprep/russian/scrabble.txt — the working dictionary (common nouns, nom. sing.)
dictprep/russian/undefined.txt — the ambiguous tail, left for manual review
(dictprep/russian/all.txt is the Stage-1 base.) Every other bucket — adjectives, verbs,
the merged note-nouns, singulars, variants — stays in memory. Pass --dump to also write
them; pass --trace WORD to ask how a single word did or did not reach the dictionary.
Note: all.txt is a plain word list, so the grammatical notes, "ед." singulars and "и"
variants are read from the pdftotext output (slov.txt) and the Stage-1 side files; the
expensive PDF parse itself runs only once.
Sources, most authoritative first: OpenCorpora (mawo-pymorphy3), libmorph (libmorph_check),
and the orthographic dictionary's own notes. See dictprep/README.md.
Run: ru-venv/bin/python dictprep/ru_stage2.py [--dump] [--trace WORD]
"""
import argparse
import os
import re
import subprocess
HERE = os.path.dirname(os.path.abspath(__file__))
OUT_DIR = os.path.join(HERE, "russian")
SLOV = os.path.join(OUT_DIR, "orfo_dict_2025.txt") # committed pdftotext output (source of truth)
WL_FROM, WL_TO = 452, 168808 # 1-based inclusive bounds of the column word-list section
OC_CACHE = "/tmp/oc_nouns.txt"
LIBMORPH_BIN = os.path.join(HERE, "libmorph_check")
ALPHABET = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
ORDER = {c: i for i, c in enumerate(ALPHABET)}
PROPER = {"Name", "Surn", "Patr", "Geox", "Orgn", "Trad"}
LIBMORPH_NOUN_CODES = set(range(7, 22)) | {24} # 7..21 plus 24 (pluralia tantum)
ADJ_END = {"ая", "яя", "ое", "ее", "ье", "ья", "ьи"}
VERB3 = ("ет", "ёт", "ит", "ют", "ут", "ает", "яет", "ует", "уют", "нет", "жет", "чет")
GENPL = ("ов", "ёв", "ев", "ей")
def key(w):
return [ORDER.get(c, 99) for c in w]
def destress(s):
return "".join(c for c in s if ord(c) not in (0x0300, 0x0301)).lower()
def cyr_ok(w):
return 2 <= len(w) <= 15 and all(("а" <= c <= "я") or c == "ё" for c in w)
def load(p):
return [l.strip() for l in open(p, encoding="utf-8") if l.strip()] if os.path.exists(p) else []
def write(path, words):
os.makedirs(os.path.dirname(path), exist_ok=True)
open(path, "w", encoding="utf-8").write("\n".join(sorted(set(words), key=key)) + "\n")
import mawo_pymorphy3 # noqa: E402
M = mawo_pymorphy3.MorphAnalyzer()
D = M._dawg_dict
def oc_noun_lemmas():
"""Every common-noun lemma (nom. sing. / pluralia tantum) in OpenCorpora's words.dawg."""
gp, pt = D.get_paradigm, D.parse_tag_string
para0, tagc = {}, {}
def g0(pid):
r = para0.get(pid)
if r is None:
suf0, tag0, pre0 = gp(pid, 0)
_, gr = pt(tag0)
r = (pre0, suf0, gr)
para0[pid] = r
return r
def gt(pid, idx):
k = (pid, idx)
r = tagc.get(k)
if r is None:
suf, tag, pre = gp(pid, idx)
pos, gr = pt(tag)
r = (suf, pre, pos, gr)
tagc[k] = r
return r
out = set()
for word, rec in D.words_dawg.iteritems():
pid, idx = rec
suf, pre, pos, gr = gt(pid, idx)
if pos != "NOUN":
continue
pre0, suf0, gr0 = g0(pid)
if (PROPER & gr) or (PROPER & gr0):
continue
stem = word[len(pre):len(word) - len(suf)] if suf else word[len(pre):]
out.add(pre0 + stem + suf0)
return {w for w in out if cyr_ok(w)}
def oc_status(word):
"""(is_common_noun, in_dictionary) for word, from OpenCorpora only."""
parses = D.get_word_parses(word)
if not parses:
return False, False
gp, pt = D.get_paradigm, D.parse_tag_string
for pid, idx in parses:
suf, tag, pre = gp(pid, idx)
pos, gr = pt(tag)
if pos == "NOUN":
_, tag0, _ = gp(pid, 0)
_, gr0 = pt(tag0)
if not (PROPER & gr or PROPER & gr0):
return True, True
return False, True
def libmorph_analyze(words):
"""Map each word to (known, noun_lemma, codes) per libmorph; noun_lemma is None when it
is not a common noun there. Empty result if the helper binary is not built."""
words = list(words)
if not words or not os.path.exists(LIBMORPH_BIN):
return {}
proc = subprocess.run([LIBMORPH_BIN], input="\n".join(words), capture_output=True, text=True)
out = {}
for w, line in zip(words, proc.stdout.split("\n")):
fields = line.split("\t")
known = fields[:1] == ["1"]
codes, noun_lemmas = set(), []
for field in fields[1:]:
code, _, lex = field.partition(":")
if code.isdigit():
codes.add(int(code))
if int(code) in LIBMORPH_NOUN_CODES:
noun_lemmas.append(lex)
lemma = (w if w in noun_lemmas else noun_lemmas[0]) if noun_lemmas else None
out[w] = (known, lemma, codes)
return out
def build_notes():
"""Map each headword (destressed, lowercased) to its grammatical note."""
def is_hw(ch):
o = ord(ch)
return (0x0430 <= o <= 0x044F) or (0x0410 <= o <= 0x042F) or o in (0x0401, 0x0451, 0x0300, 0x0301)
hmap = {}
lines = open(SLOV, encoding="utf-8").read().split("\n")
for l in lines[WL_FROM - 1:WL_TO]:
s = l.lstrip()
e = 0
for ch in s:
if is_hw(ch):
e += 1
else:
break
hw = destress(s[:e])
if hw and hw not in hmap:
hmap[hw] = destress(s[e:]).strip()
return hmap
def classify(w, note):
"""Coarse part of speech of an out-of-dictionary word from its PDF note."""
if note is None:
return "amb"
n = re.sub(r"\([^)]*\)", "", note).strip() # drop domain/etymology parentheticals
if "кр. ф" in n or "кр.ф" in n or "прич." in n or "прил." in n:
return "adj"
ends = re.findall(r"-([а-яё]+)", n)
if any(e in ADJ_END for e in ends):
return "adj"
if "сов." in n or "несов." in n or "безл." in n:
return "verb"
if w.endswith("ся"): # reflexive: no Russian noun ends in -ся
return "verb"
if any(e.endswith(VERB3) for e in ends) and not any(m in n for m in ("ед.", "тв.", "род.", "м.", "ж.", "с.")):
return "verb"
if n == "" and w.endswith(("ый", "ий", "ой", "ая", "ое", "ые", "ие", "яя", "ее")):
return "adj"
if "нескл" in n:
return "noun" if any(g in n for g in ("м.", "ж.", "с.", "мн.")) else "amb"
if ends:
return "noun"
if n == "" and w.endswith(("ать", "ять", "еть", "ить", "оть", "уть", "ыть", "ти", "чь")):
return "verb"
return "amb"
def singular(w, note):
"""Nominative singular of a noun headword from the PDF note (authoritative) or, for a
plural headword without an explicit singular, the mawo lemma; pluralia tantum kept."""
n = note or ""
full = re.search(r"ед\.\s+([а-яё]+)", n)
if full:
return full.group(1)
suf = re.search(r"ед\.\s+-([а-яё]+)", n)
if suf:
s = suf.group(1)
i = w.rfind(s[0])
return w[:i] + s if i > 0 else w
ends = re.findall(r"-([а-яё]+)", re.sub(r"\([^)]*\)", "", n))
if ends and ends[0].endswith(GENPL):
for p in M.parse(w):
if str(p.tag.POS) == "NOUN":
return p.normal_form
return w
return w
def build():
"""Run the whole pipeline in memory. Returns the result sets plus a `fate` map giving
every word's outcome, so a word's path can be traced or the buckets dumped."""
oc = set(load(OC_CACHE)) or oc_noun_lemmas()
if not os.path.exists(OC_CACHE):
write(OC_CACHE, oc)
hmap = build_notes()
all_words = load(os.path.join(OUT_DIR, "all.txt"))
ed_nouns = set(load("/tmp/ru_singulars.txt"))
pairs = [tuple(p) for l in load("/tmp/ru_variants.txt") if len(p := l.split("\t")) == 2]
pdf = [w for w in all_words if cyr_ok(w)]
lm = libmorph_analyze(pdf)
def to_singular(w):
s = singular(w, hmap.get(w))
return s if cyr_ok(s) else w
fate = {}
scrabble = set(oc)
adj, verb, amb = [], [], []
for w in pdf:
oc_noun, oc_known = oc_status(w)
if oc_noun:
fate[w] = "scrabble: сущ. по OpenCorpora"
continue
lm_known, lm_lemma, _ = lm.get(w, (False, None, frozenset()))
if lm_lemma is not None:
s = lm_lemma if cyr_ok(lm_lemma) else to_singular(w)
scrabble.add(s)
fate[w] = "scrabble: сущ. по libmorph" + ("" if s == w else f"{s}")
continue
if oc_known or lm_known:
fate[w] = "отброшено: словарь знает как не-существительное"
continue
if w in ed_nouns:
scrabble.add(w)
fate[w] = "scrabble: ед.ч. по помете «ед.»"
continue
c = classify(w, hmap.get(w))
if c == "noun":
s = to_singular(w)
scrabble.add(s)
fate[w] = "scrabble: сущ. по помете орфословаря" + ("" if s == w else f"{s}")
elif c == "adj":
adj.append(w)
fate[w] = "отброшено: прилагательное (помета орфословаря)"
elif c == "verb":
verb.append(w)
fate[w] = "отброшено: глагол (помета орфословаря)"
else:
amb.append(w)
fate[w] = "undefined: неоднозначное (нет в словарях, помета не определяет)"
# Manual confirmations: nouns the maintainer approved from the undefined tail.
for w in load(os.path.join(OUT_DIR, "manual_confirm.txt")):
if cyr_ok(w):
scrabble.add(w)
fate[w] = "scrabble: подтверждено вручную (manual_confirm.txt)"
# Variant rescue: a word joined by "и" to a confirmed noun is itself a noun.
pending = set(amb) - scrabble
changed = True
while changed:
changed = False
for a, b in pairs:
for x, y in ((a, b), (b, a)):
if x in scrabble and y in pending:
scrabble.add(y)
pending.discard(y)
fate[y] = f"scrabble: вариант от «{x}» (через «и»)"
changed = True
undefined = [w for w in amb if w not in scrabble]
return {
"oc": oc, "scrabble": scrabble, "undefined": undefined,
"adjectives": adj, "verbs": verb, "singulars": ed_nouns,
"fate": fate, "all": set(all_words),
}
def trace(word, r):
w = destress(word)
if w in r["fate"]:
return r["fate"][w]
if w in r["scrabble"]:
return "scrabble: лексикон OpenCorpora" if w in r["oc"] else "scrabble: производная/лемма"
if w not in r["all"]:
return "нет в russian_all (не извлечено на Stage 1 — нет в .pdf, либо имя собств./дефис/форма)"
if not cyr_ok(w):
return "отсеяно: длина или символы вне диапазона (2–15 кириллица)"
return "не определено"
def main():
ap = argparse.ArgumentParser(description="Stage 2 brain: build the noun dictionary, trace a word, or dump buckets.")
ap.add_argument("--dump", action="store_true", help="also write the in-memory buckets (adjectives, verbs, singulars, variants, fate)")
ap.add_argument("--trace", metavar="WORD", help="report how WORD did or did not reach the dictionary, then exit")
args = ap.parse_args()
r = build()
if args.trace:
print(f"{args.trace}: {trace(args.trace, r)}")
return
write(os.path.join(OUT_DIR, "scrabble.txt"), r["scrabble"])
print(f"=> dictprep/russian/scrabble.txt {len(r['scrabble'])}")
print(f" undefined kept in memory: {len(set(r['undefined']))} (use --dump to write it)")
if args.dump:
write(os.path.join(OUT_DIR, "undefined.txt"), r["undefined"])
write(os.path.join(OUT_DIR, "adjectives.txt"), r["adjectives"])
write(os.path.join(OUT_DIR, "verbs.txt"), r["verbs"])
write(os.path.join(OUT_DIR, "singulars.txt"), r["singulars"])
fate_path = os.path.join(OUT_DIR, "fate.tsv")
os.makedirs(OUT_DIR, exist_ok=True)
with open(fate_path, "w", encoding="utf-8") as f:
for w in sorted(r["fate"], key=key):
f.write(f"{w}\t{r['fate'][w]}\n")
print(f" dumped: undefined.txt ({len(set(r['undefined']))}), adjectives.txt, verbs.txt, singulars.txt, fate.tsv")
if __name__ == "__main__":
main()
File diff suppressed because it is too large Load Diff
+135
View File
@@ -0,0 +1,135 @@
артгруппа
бутень
вебинар
видеодневник
водозащита
генацвале
жакоб
оберфюрер
околоть
особина
полбазара
полбака
полбалкона
полбанана
полбарана
полбатальона
полбатона
полбиблиотеки
полблокнота
полбокала
полбуханки
полвагона
полвечера
полвзвода
полвинта
полгазеты
полгектара
полгостиницы
полграмма
полгруппы
полдачи
полдвора
полдекабря
полдеревни
полдетсада
полдивана
полдивизии
полдыни
полжурнала
ползавода
ползарплаты
полздания
полканикул
полканистры
полкартофелины
полкастрюли
полквартиры
полкилограмма
полкласса
полкниги
полколлекции
полкольца
полкоманды
полкоробки
полкочана
полкурса
полкуска
полмагазина
полмандарина
полмарта
полматча
полмиллиметра
полмузея
полноября
полпакета
полпарка
полпартии
полпинты
полпирога
полпирожка
полпируэта
полпоезда
полполена
полполка
полполки
полполосы
полпомидора
полпоросёнка
полпосёлка
полпредовский
полпроцента
полпузырька
полрайона
полромана
полроты
полрулона
полряда
полсада
полсажени
полсезона
полсентября
полсловаря
полсостава
полсрока
полстада
полстены
полстолетия
полстраницы
полстроки
полтаблетки
полтайма
полтакта
полтарелки
полтетради
полтома
полтона
полторта
полтысячелетия
полтюбика
полусанаторий
полфакультета
полфевраля
полфлакона
полфразы
полхаты
полцарства
полцентнера
полцистерны
полчайника
полчемодана
полшажка
полшажочка
полшара
полшкафа
полшколы
полщеки
принт
промо
рентгеноаппарат
сивец
соцнаём
срывка
флеш
флешмобер
шиноремонт
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+434
View File
@@ -0,0 +1,434 @@
// Command ruwords extracts a clean Cyrillic word list from the plain text of a Russian
// orthographic dictionary (the output of `pdftotext`).
//
// Stage 1 (this tool): from the column word-list section [from, to] it collects, per
// entry, the headword (the leading token). When the headword is plural and the entry
// gives its singular after "ед." — in full ("ящеры, …, ед. ящер") or as a replacement
// suffix ("…, ед. -вец") — only the singular is kept, since a plural that has a singular
// is never needed. It drops stress marks, lowercases, keeps ё, and discards proper nouns
// (capitalized), hyphenated words, acronyms and non-Cyrillic tokens. The result is
// de-duplicated and sorted in Russian alphabetical order (ё right after е), LF-separated.
//
// It also collects a variant headword joined by "и" when it carries its own grammatical
// note (e.g. "аблатив, -а и аблятив, -а"). Suffix-singular reconstruction is heuristic;
// Stage 2 (dictprep/ru_stage2.py) re-checks the words against real dictionaries.
//
// pdftotext dictprep/orfo_dict_2025.pdf /tmp/slov.txt
// go run ./dictprep/ruwords -in /tmp/slov.txt -from 452 -to 168808 \
// -out russian_all.txt -skip russian_skip.txt
package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
"path/filepath"
"sort"
"strings"
"unicode"
)
// ruAlphabet is the Russian alphabet in collation order (ё directly after е).
const ruAlphabet = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
var ruRank = func() map[rune]int {
m := make(map[rune]int, len(ruAlphabet))
for i, r := range []rune(ruAlphabet) {
m[r] = i
}
return m
}()
func isCyrLetter(r rune) bool {
return (r >= 'а' && r <= 'я') || (r >= 'А' && r <= 'Я') || r == 'ё' || r == 'Ё'
}
func isUpperCyr(r rune) bool { return (r >= 'А' && r <= 'Я') || r == 'Ё' }
func isStress(r rune) bool { return r == 0x0300 || r == 0x0301 }
// cleanWord normalizes a run of letters/stress-marks into a lowercase Cyrillic word, or
// returns ok=false for proper nouns (capitalized), hyphenated or non-Cyrillic runs.
func cleanWord(run []rune) (string, bool) {
if len(run) == 0 || isUpperCyr(run[0]) {
return "", false
}
var b strings.Builder
for _, r := range run {
switch {
case isStress(r), r == '­': // drop stress accents and soft hyphens
case r == '-': // a real hyphen means a hyphenated word: reject it
return "", false
default:
b.WriteRune(unicode.ToLower(r))
}
}
w := b.String()
if w == "" {
return "", false
}
for _, r := range w {
if !((r >= 'а' && r <= 'я') || r == 'ё') {
return "", false
}
}
return w, true
}
// headword returns the entry's headword: the leading run of letters, stress marks and
// hyphens, normalized.
func headword(line string) (string, bool) {
// Trim leading whitespace, including the form-feed (U+000C) that pdftotext puts at
// the top of each page — otherwise the first headword on every page is lost.
line = strings.TrimLeftFunc(line, unicode.IsSpace)
var run []rune
for _, r := range line {
if isCyrLetter(r) || isStress(r) || r == '-' || r == '­' {
run = append(run, r)
} else {
break
}
}
return cleanWord(run)
}
// embeddedSingulars returns the singular form of a plural headword spelled out after
// "ед.", either in full ("ед. ящер") or as a replacement suffix ("ед. -вец",
// reconstructed from headword). It skips gender marks ("ед. м") and abbreviations that
// merely start with "ед." ("ед. измер.", "ден. ед.").
func embeddedSingulars(line, headword string) []string {
var out []string
for i := 0; ; {
j := strings.Index(line[i:], "ед.")
if j < 0 {
break
}
i += j + len("ед.")
rest := strings.TrimLeft(line[i:], "  \t")
if strings.HasPrefix(rest, "-") { // suffix form: reconstruct from the headword
var suf []rune
for _, r := range rest[len("-"):] {
if isCyrLetter(r) || isStress(r) {
suf = append(suf, r)
} else {
break
}
}
if s, ok := cleanWord(suf); ok && len([]rune(s)) >= 2 {
if recon := reconstructSingular(headword, s); recon != "" {
out = append(out, recon)
}
}
continue
}
var run []rune
consumed := 0
for _, r := range rest {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
consumed += len(string(r))
} else {
break
}
}
if len(run) == 0 {
continue
}
if strings.HasPrefix(rest[consumed:], ".") {
continue // an abbreviation like "ед. измер." rather than a singular form
}
w, ok := cleanWord(run)
if !ok || len([]rune(w)) < 2 { // 2+ letters excludes the gender marks м/ж/с
continue
}
out = append(out, w)
}
return out
}
// reconstructSingular builds the singular from a plural headword and the replacement
// suffix from "ед. -<suffix>", splicing where the suffix best overlaps the tail of the
// headword (the position of longest common prefix between the suffix and a headword
// suffix). It is a heuristic; Stage 2 re-checks the words against real dictionaries.
func reconstructSingular(headword, suffix string) string {
hw, sf := []rune(headword), []rune(suffix)
bestK, bestLen := -1, 0
for k := 0; k < len(hw); k++ {
m := 0
for k+m < len(hw) && m < len(sf) && hw[k+m] == sf[m] {
m++
}
if m > bestLen {
bestK, bestLen = k, m
}
}
if bestK < 0 {
return ""
}
return string(hw[:bestK]) + suffix
}
// headwordNotes are the grammatical notes that mark a parallel headword (a lemma) after
// "и", as opposed to an inflected form. A "-" ending also marks one; form labels such as
// деепр. (gerund) or сравн. (comparative) deliberately do not.
var headwordNotes = map[string]bool{
"нескл": true, "неизм": true, "предлог": true, "предл": true, "нареч": true,
"нар": true, "прил": true, "союз": true, "частица": true, "част": true,
"межд": true, "мн": true, "ед": true, "тв": true, "числ": true, "мест": true,
"м": true, "ж": true, "с": true, "вводн": true, "сказ": true,
}
// variantNoteOK reports whether the note following a candidate variant marks a headword:
// a "-" inflection ending or one of headwordNotes (and not a bare inflected word).
func variantNoteOK(note string) bool {
if strings.HasPrefix(note, "-") {
return true
}
var stem []rune
for _, r := range note {
if (r >= 'а' && r <= 'я') || r == 'ё' {
stem = append(stem, r)
} else {
break
}
}
return headwordNotes[string(stem)]
}
// variants returns the second (and further) headwords of an entry, written as a parallel
// form after " и ", e.g. "аблатив, -а и аблятив, -а" yields "аблятив" and "регги и реггей,
// нескл." yields "реггей". Requiring a headword note after the comma keeps this from
// matching "и" inside examples or picking up inflected forms.
func variants(line string) []string {
var out []string
const sep = " и "
for i := 0; ; {
j := strings.Index(line[i:], sep)
if j < 0 {
break
}
i += j + len(sep)
rest := line[i:]
var run []rune
consumed := 0
for _, r := range rest {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
consumed += len(string(r))
} else {
break
}
}
if len(run) == 0 {
continue
}
after := rest[consumed:]
if !strings.HasPrefix(after, ", ") || !variantNoteOK(after[len(", "):]) {
continue
}
if w, ok := cleanWord(run); ok && len([]rune(w)) >= 2 {
out = append(out, w)
}
}
return out
}
// normToken normalizes any token (a run of letters and stress marks) for the skip set:
// lowercase, stress removed, kept only if it is 2+ all-Cyrillic letters. Unlike
// cleanWord it does NOT reject capitalized tokens — a lowercased proper noun belongs in
// the skip set so it can be re-checked by a morphological analyzer.
func normToken(run []rune) (string, bool) {
var b strings.Builder
for _, r := range run {
if isStress(r) {
continue
}
b.WriteRune(unicode.ToLower(r))
}
w := b.String()
if len([]rune(w)) < 2 {
return "", false
}
for _, r := range w {
if !((r >= 'а' && r <= 'я') || r == 'ё') {
return "", false
}
}
return w, true
}
// tokens returns every maximal run of Cyrillic letters (plus stress marks) in the line,
// normalized; runs are split on every other character (so hyphens split a word).
func tokens(line string) []string {
var out []string
var run []rune
flush := func() {
if len(run) > 0 {
if w, ok := normToken(run); ok {
out = append(out, w)
}
run = run[:0]
}
}
for _, r := range line {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
} else {
flush()
}
}
flush()
return out
}
func lessRu(a, b string) bool {
ra, rb := []rune(a), []rune(b)
for i := 0; i < len(ra) && i < len(rb); i++ {
if ra[i] != rb[i] {
return ruRank[ra[i]] < ruRank[rb[i]]
}
}
return len(ra) < len(rb)
}
func sortedRu(set map[string]struct{}) []string {
words := make([]string, 0, len(set))
for w := range set {
words = append(words, w)
}
sort.Slice(words, func(i, j int) bool { return lessRu(words[i], words[j]) })
return words
}
func writeWords(path string, words []string) error {
if dir := filepath.Dir(path); dir != "" && dir != "." {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
}
o, err := os.Create(path)
if err != nil {
return err
}
w := bufio.NewWriter(o)
for _, word := range words {
w.WriteString(word)
w.WriteByte('\n')
}
if err := w.Flush(); err != nil {
o.Close()
return err
}
return o.Close()
}
func main() {
in := flag.String("in", "dictprep/russian/orfo_dict_2025.txt", "plain-text dictionary (pdftotext output)")
out := flag.String("out", "dictprep/russian/all.txt", "output: the base word list (clean headwords + reconstructed singulars + variants)")
skip := flag.String("skip", "/tmp/ru_skip.txt", "output: every other token, for a later morphology re-check")
sings := flag.String("singulars", "/tmp/ru_singulars.txt", "output: singulars reconstructed from \"ед.\" (known nouns)")
varsOut := flag.String("variants", "/tmp/ru_variants.txt", "output: variant pairs joined by \"и\" (primary<TAB>variant)")
from := flag.Int("from", 452, "first line of the word-list section (1-based, inclusive)")
to := flag.Int("to", 168808, "last line of the word-list section (inclusive)")
flag.Parse()
if *in == "" {
log.Fatal("ruwords: -in is required")
}
f, err := os.Open(*in)
if err != nil {
log.Fatal(err)
}
defer f.Close()
all := make(map[string]struct{})
allTokens := make(map[string]struct{})
singulars := make(map[string]struct{})
variantPairs := make(map[string]struct{})
entries, fromHead, fromSing, fromVar := 0, 0, 0, 0
sc := bufio.NewScanner(f)
sc.Buffer(make([]byte, 1<<20), 1<<20)
for line := 0; sc.Scan(); {
line++
if line < *from || line > *to {
continue
}
entries++
text := sc.Text()
hw, hwOK := headword(text)
var sings []string
if hwOK {
sings = embeddedSingulars(text, hw)
}
primary := ""
if len(sings) > 0 {
// the headword is plural and the entry gives its singular: keep only the singular
primary = sings[0]
for _, w := range sings {
if _, seen := all[w]; !seen {
fromSing++
all[w] = struct{}{}
}
singulars[w] = struct{}{}
}
} else if hwOK {
primary = hw
if _, seen := all[hw]; !seen {
fromHead++
}
all[hw] = struct{}{}
}
for _, w := range variants(text) {
if _, seen := all[w]; !seen {
fromVar++
all[w] = struct{}{}
}
if primary != "" && primary != w {
variantPairs[primary+"\t"+w] = struct{}{}
}
}
for _, w := range tokens(text) {
allTokens[w] = struct{}{}
}
}
if err := sc.Err(); err != nil {
log.Fatal(err)
}
skipSet := make(map[string]struct{})
for w := range allTokens {
if _, ok := all[w]; !ok {
skipSet[w] = struct{}{}
}
}
allWords := sortedRu(all)
skipWords := sortedRu(skipSet)
if err := writeWords(*out, allWords); err != nil {
log.Fatal(err)
}
if err := writeWords(*skip, skipWords); err != nil {
log.Fatal(err)
}
if err := writeWords(*sings, sortedRu(singulars)); err != nil {
log.Fatal(err)
}
pairList := make([]string, 0, len(variantPairs))
for p := range variantPairs {
pairList = append(pairList, p)
}
sort.Strings(pairList)
if err := writeWords(*varsOut, pairList); err != nil {
log.Fatal(err)
}
fmt.Printf("scanned %d entries\n", entries)
fmt.Printf(" %-20s %7d words (%d headwords + %d embedded singulars + %d variants)\n", *out, len(allWords), fromHead, fromSing, fromVar)
fmt.Printf(" %-20s %7d words (tokens not in %s; for a morphology re-check)\n", *skip, len(skipWords), *out)
fmt.Printf(" %-20s %7d words (singulars from \"ед.\"; known nouns)\n", *sings, len(singulars))
fmt.Printf(" %-20s %7d pairs (variants joined by \"и\")\n", *varsOut, len(variantPairs))
}