dictprep: Russian orthographic dictionary → Scrabble noun pipeline
Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt) from the RAN orthographic dictionary, for the Эрудит ruleset. - Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants. - Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) + libmorph + orthographic notes select common nouns (nom. sing.); --trace explains a word's fate, --dump writes the in-memory buckets. - libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in. - orfo_dict_2025.txt is the committed pdftotext source of truth. - See dictprep/README.md for methodology and reproducibility.
This commit is contained in:
+10
-1
@@ -6,4 +6,13 @@
|
|||||||
|
|
||||||
# Local scratch
|
# Local scratch
|
||||||
/tmp/
|
/tmp/
|
||||||
*.pdf
|
|
||||||
|
# Compiled libmorph bridge (build artifact; see dictprep/README.md)
|
||||||
|
/dictprep/libmorph_check
|
||||||
|
|
||||||
|
# Stage 2 --dump debug buckets (committed: all, scrabble, manual_confirm, orfo_dict_2025)
|
||||||
|
/dictprep/russian/undefined.txt
|
||||||
|
/dictprep/russian/adjectives.txt
|
||||||
|
/dictprep/russian/verbs.txt
|
||||||
|
/dictprep/russian/singulars.txt
|
||||||
|
/dictprep/russian/fate.tsv
|
||||||
|
|||||||
@@ -0,0 +1,164 @@
|
|||||||
|
# Russian word-list preparation (`dictprep`)
|
||||||
|
|
||||||
|
Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
|
||||||
|
Russian academic **orthographic dictionary**, cross-checked against two independent
|
||||||
|
morphological dictionaries.
|
||||||
|
|
||||||
|
The goal of the pipeline is a list of **common nouns in the nominative singular**
|
||||||
|
(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
|
||||||
|
|
||||||
|
> This directory is self-contained tooling for *building* the word list. It is not part
|
||||||
|
> of the solver library. The committed result lives in `dictprep/russian/`.
|
||||||
|
|
||||||
|
## Source
|
||||||
|
|
||||||
|
`orfo_dict_2025.pdf` — *Русский орфографический словарь РАН* (≈ 200 000 entries), the
|
||||||
|
authority for **spelling**. It encodes declension type in its grammatical notes but does
|
||||||
|
**not** reliably mark part of speech.
|
||||||
|
|
||||||
|
- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
|
||||||
|
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
|
||||||
|
|
||||||
|
The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
|
||||||
|
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
|
||||||
|
from the text alone — the binary PDF is needed only to regenerate that text.
|
||||||
|
|
||||||
|
## Outputs (`dictprep/russian/`)
|
||||||
|
|
||||||
|
The committed result is **three** files; every other bucket stays in the Stage-2
|
||||||
|
process's memory (dump it with `--dump`, query it with `--trace WORD`).
|
||||||
|
|
||||||
|
| File | Committed | Meaning |
|
||||||
|
|------|:--:|---------|
|
||||||
|
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
|
||||||
|
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
|
||||||
|
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
|
||||||
|
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. |
|
||||||
|
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
|
||||||
|
|
||||||
|
`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
|
||||||
|
word with the reason it did or did not reach the dictionary); these are git-ignored debug
|
||||||
|
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
|
||||||
|
the brain consumes.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# 1. pdftotext (Poppler)
|
||||||
|
sudo apt-get install -y poppler-utils
|
||||||
|
|
||||||
|
# 2. Go toolchain (Stage 1) — already required by the parent module
|
||||||
|
|
||||||
|
# 3. Python + the OpenCorpora analyser (Stage 2)
|
||||||
|
sudo apt-get install -y python3-venv python3-pip
|
||||||
|
python3 -m venv ru-venv
|
||||||
|
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
|
||||||
|
|
||||||
|
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
|
||||||
|
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
|
||||||
|
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
|
||||||
|
```
|
||||||
|
|
||||||
|
If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
|
||||||
|
the stack and reports `libmorph_helper=MISSING`.
|
||||||
|
|
||||||
|
## How to run
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
|
||||||
|
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
|
||||||
|
|
||||||
|
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
|
||||||
|
go run ./dictprep/ruwords
|
||||||
|
|
||||||
|
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
|
||||||
|
ru-venv/bin/python dictprep/ru_stage2.py
|
||||||
|
|
||||||
|
# ask how a word did or did not reach the dictionary
|
||||||
|
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
|
||||||
|
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
|
||||||
|
ru-venv/bin/python dictprep/ru_stage2.py --dump
|
||||||
|
```
|
||||||
|
|
||||||
|
`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
|
||||||
|
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
|
||||||
|
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
|
||||||
|
PDF is re-exported.
|
||||||
|
|
||||||
|
## Algorithm
|
||||||
|
|
||||||
|
### Stage 1 — `ruwords` (Go)
|
||||||
|
|
||||||
|
Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
|
||||||
|
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
|
||||||
|
|
||||||
|
- the **headword** (leading token). Leading whitespace including the form-feed `\f`
|
||||||
|
pdftotext puts at every page top is trimmed — otherwise the first headword of each page
|
||||||
|
is lost;
|
||||||
|
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
|
||||||
|
(`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
|
||||||
|
suffix best overlaps the headword); the plural is then dropped (a plural that has a
|
||||||
|
singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
|
||||||
|
- **variant headwords** after `и` that carry their own grammatical note
|
||||||
|
(`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
|
||||||
|
|
||||||
|
Everything else (every maximal Cyrillic token not selected above) goes to
|
||||||
|
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
|
||||||
|
|
||||||
|
### Stage 2 — `ru_stage2.py` (Python)
|
||||||
|
|
||||||
|
Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:
|
||||||
|
|
||||||
|
1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
|
||||||
|
reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
|
||||||
|
added (so nouns absent from the PDF are included).
|
||||||
|
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
|
||||||
|
keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
|
||||||
|
reading in *either* is enough (their disagreements were reviewed and resolved this way,
|
||||||
|
since each is incomplete in different places). A singular reconstructed from "ед." that
|
||||||
|
neither dictionary knows is accepted as a noun (the orthographic note attests it).
|
||||||
|
3. A word **both dictionaries miss** is classified by the orthographic **note**
|
||||||
|
(`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
|
||||||
|
`нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
|
||||||
|
verb is dropped; anything undecided goes to `undefined.txt`.
|
||||||
|
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
|
||||||
|
травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
|
||||||
|
from review/undefined into the result as well, propagated transitively through chains.
|
||||||
|
The plural-form variants the dictionaries already resolve never reach this step.
|
||||||
|
|
||||||
|
The nominative singular always comes from the dictionary that recognised the word, or from
|
||||||
|
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
|
||||||
|
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
|
||||||
|
|
||||||
|
### The libmorph bridge — `libmorph_check.cpp`
|
||||||
|
|
||||||
|
libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
|
||||||
|
stdin→stdout filter: one UTF-8 word per line in, one line out:
|
||||||
|
|
||||||
|
```
|
||||||
|
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
|
||||||
|
```
|
||||||
|
|
||||||
|
`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
|
||||||
|
speech. The codes were reverse-engineered (the docs omit the table):
|
||||||
|
|
||||||
|
| codes | part of speech |
|
||||||
|
|------|----------------|
|
||||||
|
| **7–21, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
|
||||||
|
| 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral |
|
||||||
|
| 38–39 | **proper noun** (excluded) · 48–58 comparative/adverb · 49–53 function words |
|
||||||
|
|
||||||
|
The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
|
||||||
|
passed and lemmas returned in UTF-8.
|
||||||
|
|
||||||
|
## Notes & caveats
|
||||||
|
|
||||||
|
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
|
||||||
|
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
|
||||||
|
signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
|
||||||
|
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
|
||||||
|
and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
|
||||||
|
which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
|
||||||
|
and is therefore used only as a weak confirming vote, never as dictionary membership.
|
||||||
|
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
|
||||||
|
dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
// libmorph_check: a thin stdin->stdout bridge to the libmorph Russian morphological
|
||||||
|
// analyser, for use by the Stage-2 classifier (scripts/ru_stage2.py).
|
||||||
|
//
|
||||||
|
// Reads one word per line (bytes are passed through verbatim — the caller encodes to
|
||||||
|
// the code page the libmorph char interface expects, CP1251). For each word it writes
|
||||||
|
// a line:
|
||||||
|
//
|
||||||
|
// <known>\t<pos>:<lemma>\t<pos>:<lemma>...
|
||||||
|
//
|
||||||
|
// where <known> is CheckWord's result (1 = in the dictionary, 0 = not), and each
|
||||||
|
// following field is one lexeme: its part of speech (wdInfo & 0x3f) and lemma.
|
||||||
|
//
|
||||||
|
// Build: g++ -std=c++17 -O2 scripts/libmorph_check.cpp -lmorphrus -lmoonycode -o libmorph_check
|
||||||
|
#include <libmorph/rus.h>
|
||||||
|
#include <libmorph/api.hpp>
|
||||||
|
#include <cstdio>
|
||||||
|
#include <iostream>
|
||||||
|
#include <string>
|
||||||
|
|
||||||
|
int main(int argc, char** argv) {
|
||||||
|
// The factory key selects the code page: "libmorph.api.v4:<charset>". Use the
|
||||||
|
// UTF-8 instance so words pass through verbatim. IMlmaMbXX only adds non-virtual
|
||||||
|
// convenience wrappers over IMlmaMb, so the filled pointer can be used as such.
|
||||||
|
const char* key = argc > 1 ? argv[1] : "libmorph.api.v4:utf-8";
|
||||||
|
IMlmaMbXX* mlma = nullptr;
|
||||||
|
int rc = mlmaruGetAPI(key, (void**)&mlma);
|
||||||
|
if (mlma == nullptr) {
|
||||||
|
std::fprintf(stderr, "libmorph_check: GetAPI('%s') failed, rc=%d\n", key, rc);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
std::string line;
|
||||||
|
while (std::getline(std::cin, line)) {
|
||||||
|
if (!line.empty() && line.back() == '\r') line.pop_back();
|
||||||
|
IMlmaMbXX::inword w(line.c_str(), line.size());
|
||||||
|
int known = mlma->CheckWord(w, sfIgnoreCapitals);
|
||||||
|
std::cout << known;
|
||||||
|
try {
|
||||||
|
for (auto& lx : mlma->Lemmatize(w, sfIgnoreCapitals)) {
|
||||||
|
unsigned pos = lx.ngrams > 0 ? (lx.pgrams[0].wdInfo & 0x3f) : 0xffu;
|
||||||
|
std::cout << '\t' << pos << ':' << (lx.plemma ? lx.plemma : "");
|
||||||
|
}
|
||||||
|
} catch (...) {
|
||||||
|
}
|
||||||
|
std::cout << '\n';
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
@@ -0,0 +1,341 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Stage 2 — the "brain" of the Russian Scrabble word-list pipeline.
|
||||||
|
|
||||||
|
It reads the Stage-1 base word list (built once by ruwords so the heavy PDF is not
|
||||||
|
re-parsed) together with the grammatical notes and the singular/variant structure, runs
|
||||||
|
the whole noun-selection logic in memory, and writes a minimal result:
|
||||||
|
|
||||||
|
dictprep/russian/scrabble.txt — the working dictionary (common nouns, nom. sing.)
|
||||||
|
dictprep/russian/undefined.txt — the ambiguous tail, left for manual review
|
||||||
|
|
||||||
|
(dictprep/russian/all.txt is the Stage-1 base.) Every other bucket — adjectives, verbs,
|
||||||
|
the merged note-nouns, singulars, variants — stays in memory. Pass --dump to also write
|
||||||
|
them; pass --trace WORD to ask how a single word did or did not reach the dictionary.
|
||||||
|
|
||||||
|
Note: all.txt is a plain word list, so the grammatical notes, "ед." singulars and "и"
|
||||||
|
variants are read from the pdftotext output (slov.txt) and the Stage-1 side files; the
|
||||||
|
expensive PDF parse itself runs only once.
|
||||||
|
|
||||||
|
Sources, most authoritative first: OpenCorpora (mawo-pymorphy3), libmorph (libmorph_check),
|
||||||
|
and the orthographic dictionary's own notes. See dictprep/README.md.
|
||||||
|
|
||||||
|
Run: ru-venv/bin/python dictprep/ru_stage2.py [--dump] [--trace WORD]
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
|
||||||
|
HERE = os.path.dirname(os.path.abspath(__file__))
|
||||||
|
OUT_DIR = os.path.join(HERE, "russian")
|
||||||
|
SLOV = os.path.join(OUT_DIR, "orfo_dict_2025.txt") # committed pdftotext output (source of truth)
|
||||||
|
WL_FROM, WL_TO = 452, 168808 # 1-based inclusive bounds of the column word-list section
|
||||||
|
OC_CACHE = "/tmp/oc_nouns.txt"
|
||||||
|
LIBMORPH_BIN = os.path.join(HERE, "libmorph_check")
|
||||||
|
|
||||||
|
ALPHABET = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
|
||||||
|
ORDER = {c: i for i, c in enumerate(ALPHABET)}
|
||||||
|
PROPER = {"Name", "Surn", "Patr", "Geox", "Orgn", "Trad"}
|
||||||
|
LIBMORPH_NOUN_CODES = set(range(7, 22)) | {24} # 7..21 plus 24 (pluralia tantum)
|
||||||
|
ADJ_END = {"ая", "яя", "ое", "ее", "ье", "ья", "ьи"}
|
||||||
|
VERB3 = ("ет", "ёт", "ит", "ют", "ут", "ает", "яет", "ует", "уют", "нет", "жет", "чет")
|
||||||
|
GENPL = ("ов", "ёв", "ев", "ей")
|
||||||
|
|
||||||
|
|
||||||
|
def key(w):
|
||||||
|
return [ORDER.get(c, 99) for c in w]
|
||||||
|
|
||||||
|
|
||||||
|
def destress(s):
|
||||||
|
return "".join(c for c in s if ord(c) not in (0x0300, 0x0301)).lower()
|
||||||
|
|
||||||
|
|
||||||
|
def cyr_ok(w):
|
||||||
|
return 2 <= len(w) <= 15 and all(("а" <= c <= "я") or c == "ё" for c in w)
|
||||||
|
|
||||||
|
|
||||||
|
def load(p):
|
||||||
|
return [l.strip() for l in open(p, encoding="utf-8") if l.strip()] if os.path.exists(p) else []
|
||||||
|
|
||||||
|
|
||||||
|
def write(path, words):
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
open(path, "w", encoding="utf-8").write("\n".join(sorted(set(words), key=key)) + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
import mawo_pymorphy3 # noqa: E402
|
||||||
|
|
||||||
|
M = mawo_pymorphy3.MorphAnalyzer()
|
||||||
|
D = M._dawg_dict
|
||||||
|
|
||||||
|
|
||||||
|
def oc_noun_lemmas():
|
||||||
|
"""Every common-noun lemma (nom. sing. / pluralia tantum) in OpenCorpora's words.dawg."""
|
||||||
|
gp, pt = D.get_paradigm, D.parse_tag_string
|
||||||
|
para0, tagc = {}, {}
|
||||||
|
|
||||||
|
def g0(pid):
|
||||||
|
r = para0.get(pid)
|
||||||
|
if r is None:
|
||||||
|
suf0, tag0, pre0 = gp(pid, 0)
|
||||||
|
_, gr = pt(tag0)
|
||||||
|
r = (pre0, suf0, gr)
|
||||||
|
para0[pid] = r
|
||||||
|
return r
|
||||||
|
|
||||||
|
def gt(pid, idx):
|
||||||
|
k = (pid, idx)
|
||||||
|
r = tagc.get(k)
|
||||||
|
if r is None:
|
||||||
|
suf, tag, pre = gp(pid, idx)
|
||||||
|
pos, gr = pt(tag)
|
||||||
|
r = (suf, pre, pos, gr)
|
||||||
|
tagc[k] = r
|
||||||
|
return r
|
||||||
|
|
||||||
|
out = set()
|
||||||
|
for word, rec in D.words_dawg.iteritems():
|
||||||
|
pid, idx = rec
|
||||||
|
suf, pre, pos, gr = gt(pid, idx)
|
||||||
|
if pos != "NOUN":
|
||||||
|
continue
|
||||||
|
pre0, suf0, gr0 = g0(pid)
|
||||||
|
if (PROPER & gr) or (PROPER & gr0):
|
||||||
|
continue
|
||||||
|
stem = word[len(pre):len(word) - len(suf)] if suf else word[len(pre):]
|
||||||
|
out.add(pre0 + stem + suf0)
|
||||||
|
return {w for w in out if cyr_ok(w)}
|
||||||
|
|
||||||
|
|
||||||
|
def oc_status(word):
|
||||||
|
"""(is_common_noun, in_dictionary) for word, from OpenCorpora only."""
|
||||||
|
parses = D.get_word_parses(word)
|
||||||
|
if not parses:
|
||||||
|
return False, False
|
||||||
|
gp, pt = D.get_paradigm, D.parse_tag_string
|
||||||
|
for pid, idx in parses:
|
||||||
|
suf, tag, pre = gp(pid, idx)
|
||||||
|
pos, gr = pt(tag)
|
||||||
|
if pos == "NOUN":
|
||||||
|
_, tag0, _ = gp(pid, 0)
|
||||||
|
_, gr0 = pt(tag0)
|
||||||
|
if not (PROPER & gr or PROPER & gr0):
|
||||||
|
return True, True
|
||||||
|
return False, True
|
||||||
|
|
||||||
|
|
||||||
|
def libmorph_analyze(words):
|
||||||
|
"""Map each word to (known, noun_lemma, codes) per libmorph; noun_lemma is None when it
|
||||||
|
is not a common noun there. Empty result if the helper binary is not built."""
|
||||||
|
words = list(words)
|
||||||
|
if not words or not os.path.exists(LIBMORPH_BIN):
|
||||||
|
return {}
|
||||||
|
proc = subprocess.run([LIBMORPH_BIN], input="\n".join(words), capture_output=True, text=True)
|
||||||
|
out = {}
|
||||||
|
for w, line in zip(words, proc.stdout.split("\n")):
|
||||||
|
fields = line.split("\t")
|
||||||
|
known = fields[:1] == ["1"]
|
||||||
|
codes, noun_lemmas = set(), []
|
||||||
|
for field in fields[1:]:
|
||||||
|
code, _, lex = field.partition(":")
|
||||||
|
if code.isdigit():
|
||||||
|
codes.add(int(code))
|
||||||
|
if int(code) in LIBMORPH_NOUN_CODES:
|
||||||
|
noun_lemmas.append(lex)
|
||||||
|
lemma = (w if w in noun_lemmas else noun_lemmas[0]) if noun_lemmas else None
|
||||||
|
out[w] = (known, lemma, codes)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def build_notes():
|
||||||
|
"""Map each headword (destressed, lowercased) to its grammatical note."""
|
||||||
|
def is_hw(ch):
|
||||||
|
o = ord(ch)
|
||||||
|
return (0x0430 <= o <= 0x044F) or (0x0410 <= o <= 0x042F) or o in (0x0401, 0x0451, 0x0300, 0x0301)
|
||||||
|
|
||||||
|
hmap = {}
|
||||||
|
lines = open(SLOV, encoding="utf-8").read().split("\n")
|
||||||
|
for l in lines[WL_FROM - 1:WL_TO]:
|
||||||
|
s = l.lstrip()
|
||||||
|
e = 0
|
||||||
|
for ch in s:
|
||||||
|
if is_hw(ch):
|
||||||
|
e += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
hw = destress(s[:e])
|
||||||
|
if hw and hw not in hmap:
|
||||||
|
hmap[hw] = destress(s[e:]).strip()
|
||||||
|
return hmap
|
||||||
|
|
||||||
|
|
||||||
|
def classify(w, note):
|
||||||
|
"""Coarse part of speech of an out-of-dictionary word from its PDF note."""
|
||||||
|
if note is None:
|
||||||
|
return "amb"
|
||||||
|
n = re.sub(r"\([^)]*\)", "", note).strip() # drop domain/etymology parentheticals
|
||||||
|
if "кр. ф" in n or "кр.ф" in n or "прич." in n or "прил." in n:
|
||||||
|
return "adj"
|
||||||
|
ends = re.findall(r"-([а-яё]+)", n)
|
||||||
|
if any(e in ADJ_END for e in ends):
|
||||||
|
return "adj"
|
||||||
|
if "сов." in n or "несов." in n or "безл." in n:
|
||||||
|
return "verb"
|
||||||
|
if w.endswith("ся"): # reflexive: no Russian noun ends in -ся
|
||||||
|
return "verb"
|
||||||
|
if any(e.endswith(VERB3) for e in ends) and not any(m in n for m in ("ед.", "тв.", "род.", "м.", "ж.", "с.")):
|
||||||
|
return "verb"
|
||||||
|
if n == "" and w.endswith(("ый", "ий", "ой", "ая", "ое", "ые", "ие", "яя", "ее")):
|
||||||
|
return "adj"
|
||||||
|
if "нескл" in n:
|
||||||
|
return "noun" if any(g in n for g in ("м.", "ж.", "с.", "мн.")) else "amb"
|
||||||
|
if ends:
|
||||||
|
return "noun"
|
||||||
|
if n == "" and w.endswith(("ать", "ять", "еть", "ить", "оть", "уть", "ыть", "ти", "чь")):
|
||||||
|
return "verb"
|
||||||
|
return "amb"
|
||||||
|
|
||||||
|
|
||||||
|
def singular(w, note):
|
||||||
|
"""Nominative singular of a noun headword from the PDF note (authoritative) or, for a
|
||||||
|
plural headword without an explicit singular, the mawo lemma; pluralia tantum kept."""
|
||||||
|
n = note or ""
|
||||||
|
full = re.search(r"ед\.\s+([а-яё]+)", n)
|
||||||
|
if full:
|
||||||
|
return full.group(1)
|
||||||
|
suf = re.search(r"ед\.\s+-([а-яё]+)", n)
|
||||||
|
if suf:
|
||||||
|
s = suf.group(1)
|
||||||
|
i = w.rfind(s[0])
|
||||||
|
return w[:i] + s if i > 0 else w
|
||||||
|
ends = re.findall(r"-([а-яё]+)", re.sub(r"\([^)]*\)", "", n))
|
||||||
|
if ends and ends[0].endswith(GENPL):
|
||||||
|
for p in M.parse(w):
|
||||||
|
if str(p.tag.POS) == "NOUN":
|
||||||
|
return p.normal_form
|
||||||
|
return w
|
||||||
|
return w
|
||||||
|
|
||||||
|
|
||||||
|
def build():
|
||||||
|
"""Run the whole pipeline in memory. Returns the result sets plus a `fate` map giving
|
||||||
|
every word's outcome, so a word's path can be traced or the buckets dumped."""
|
||||||
|
oc = set(load(OC_CACHE)) or oc_noun_lemmas()
|
||||||
|
if not os.path.exists(OC_CACHE):
|
||||||
|
write(OC_CACHE, oc)
|
||||||
|
hmap = build_notes()
|
||||||
|
all_words = load(os.path.join(OUT_DIR, "all.txt"))
|
||||||
|
ed_nouns = set(load("/tmp/ru_singulars.txt"))
|
||||||
|
pairs = [tuple(p) for l in load("/tmp/ru_variants.txt") if len(p := l.split("\t")) == 2]
|
||||||
|
pdf = [w for w in all_words if cyr_ok(w)]
|
||||||
|
lm = libmorph_analyze(pdf)
|
||||||
|
|
||||||
|
def to_singular(w):
|
||||||
|
s = singular(w, hmap.get(w))
|
||||||
|
return s if cyr_ok(s) else w
|
||||||
|
|
||||||
|
fate = {}
|
||||||
|
scrabble = set(oc)
|
||||||
|
adj, verb, amb = [], [], []
|
||||||
|
for w in pdf:
|
||||||
|
oc_noun, oc_known = oc_status(w)
|
||||||
|
if oc_noun:
|
||||||
|
fate[w] = "scrabble: сущ. по OpenCorpora"
|
||||||
|
continue
|
||||||
|
lm_known, lm_lemma, _ = lm.get(w, (False, None, frozenset()))
|
||||||
|
if lm_lemma is not None:
|
||||||
|
s = lm_lemma if cyr_ok(lm_lemma) else to_singular(w)
|
||||||
|
scrabble.add(s)
|
||||||
|
fate[w] = "scrabble: сущ. по libmorph" + ("" if s == w else f" → {s}")
|
||||||
|
continue
|
||||||
|
if oc_known or lm_known:
|
||||||
|
fate[w] = "отброшено: словарь знает как не-существительное"
|
||||||
|
continue
|
||||||
|
if w in ed_nouns:
|
||||||
|
scrabble.add(w)
|
||||||
|
fate[w] = "scrabble: ед.ч. по помете «ед.»"
|
||||||
|
continue
|
||||||
|
c = classify(w, hmap.get(w))
|
||||||
|
if c == "noun":
|
||||||
|
s = to_singular(w)
|
||||||
|
scrabble.add(s)
|
||||||
|
fate[w] = "scrabble: сущ. по помете орфословаря" + ("" if s == w else f" → {s}")
|
||||||
|
elif c == "adj":
|
||||||
|
adj.append(w)
|
||||||
|
fate[w] = "отброшено: прилагательное (помета орфословаря)"
|
||||||
|
elif c == "verb":
|
||||||
|
verb.append(w)
|
||||||
|
fate[w] = "отброшено: глагол (помета орфословаря)"
|
||||||
|
else:
|
||||||
|
amb.append(w)
|
||||||
|
fate[w] = "undefined: неоднозначное (нет в словарях, помета не определяет)"
|
||||||
|
|
||||||
|
# Manual confirmations: nouns the maintainer approved from the undefined tail.
|
||||||
|
for w in load(os.path.join(OUT_DIR, "manual_confirm.txt")):
|
||||||
|
if cyr_ok(w):
|
||||||
|
scrabble.add(w)
|
||||||
|
fate[w] = "scrabble: подтверждено вручную (manual_confirm.txt)"
|
||||||
|
|
||||||
|
# Variant rescue: a word joined by "и" to a confirmed noun is itself a noun.
|
||||||
|
pending = set(amb) - scrabble
|
||||||
|
changed = True
|
||||||
|
while changed:
|
||||||
|
changed = False
|
||||||
|
for a, b in pairs:
|
||||||
|
for x, y in ((a, b), (b, a)):
|
||||||
|
if x in scrabble and y in pending:
|
||||||
|
scrabble.add(y)
|
||||||
|
pending.discard(y)
|
||||||
|
fate[y] = f"scrabble: вариант от «{x}» (через «и»)"
|
||||||
|
changed = True
|
||||||
|
|
||||||
|
undefined = [w for w in amb if w not in scrabble]
|
||||||
|
return {
|
||||||
|
"oc": oc, "scrabble": scrabble, "undefined": undefined,
|
||||||
|
"adjectives": adj, "verbs": verb, "singulars": ed_nouns,
|
||||||
|
"fate": fate, "all": set(all_words),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def trace(word, r):
|
||||||
|
w = destress(word)
|
||||||
|
if w in r["fate"]:
|
||||||
|
return r["fate"][w]
|
||||||
|
if w in r["scrabble"]:
|
||||||
|
return "scrabble: лексикон OpenCorpora" if w in r["oc"] else "scrabble: производная/лемма"
|
||||||
|
if w not in r["all"]:
|
||||||
|
return "нет в russian_all (не извлечено на Stage 1 — нет в .pdf, либо имя собств./дефис/форма)"
|
||||||
|
if not cyr_ok(w):
|
||||||
|
return "отсеяно: длина или символы вне диапазона (2–15 кириллица)"
|
||||||
|
return "не определено"
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser(description="Stage 2 brain: build the noun dictionary, trace a word, or dump buckets.")
|
||||||
|
ap.add_argument("--dump", action="store_true", help="also write the in-memory buckets (adjectives, verbs, singulars, variants, fate)")
|
||||||
|
ap.add_argument("--trace", metavar="WORD", help="report how WORD did or did not reach the dictionary, then exit")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
r = build()
|
||||||
|
if args.trace:
|
||||||
|
print(f"{args.trace}: {trace(args.trace, r)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
write(os.path.join(OUT_DIR, "scrabble.txt"), r["scrabble"])
|
||||||
|
print(f"=> dictprep/russian/scrabble.txt {len(r['scrabble'])}")
|
||||||
|
print(f" undefined kept in memory: {len(set(r['undefined']))} (use --dump to write it)")
|
||||||
|
if args.dump:
|
||||||
|
write(os.path.join(OUT_DIR, "undefined.txt"), r["undefined"])
|
||||||
|
write(os.path.join(OUT_DIR, "adjectives.txt"), r["adjectives"])
|
||||||
|
write(os.path.join(OUT_DIR, "verbs.txt"), r["verbs"])
|
||||||
|
write(os.path.join(OUT_DIR, "singulars.txt"), r["singulars"])
|
||||||
|
fate_path = os.path.join(OUT_DIR, "fate.tsv")
|
||||||
|
os.makedirs(OUT_DIR, exist_ok=True)
|
||||||
|
with open(fate_path, "w", encoding="utf-8") as f:
|
||||||
|
for w in sorted(r["fate"], key=key):
|
||||||
|
f.write(f"{w}\t{r['fate'][w]}\n")
|
||||||
|
print(f" dumped: undefined.txt ({len(set(r['undefined']))}), adjectives.txt, verbs.txt, singulars.txt, fate.tsv")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
+148900
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,135 @@
|
|||||||
|
артгруппа
|
||||||
|
бутень
|
||||||
|
вебинар
|
||||||
|
видеодневник
|
||||||
|
водозащита
|
||||||
|
генацвале
|
||||||
|
жакоб
|
||||||
|
оберфюрер
|
||||||
|
околоть
|
||||||
|
особина
|
||||||
|
полбазара
|
||||||
|
полбака
|
||||||
|
полбалкона
|
||||||
|
полбанана
|
||||||
|
полбарана
|
||||||
|
полбатальона
|
||||||
|
полбатона
|
||||||
|
полбиблиотеки
|
||||||
|
полблокнота
|
||||||
|
полбокала
|
||||||
|
полбуханки
|
||||||
|
полвагона
|
||||||
|
полвечера
|
||||||
|
полвзвода
|
||||||
|
полвинта
|
||||||
|
полгазеты
|
||||||
|
полгектара
|
||||||
|
полгостиницы
|
||||||
|
полграмма
|
||||||
|
полгруппы
|
||||||
|
полдачи
|
||||||
|
полдвора
|
||||||
|
полдекабря
|
||||||
|
полдеревни
|
||||||
|
полдетсада
|
||||||
|
полдивана
|
||||||
|
полдивизии
|
||||||
|
полдыни
|
||||||
|
полжурнала
|
||||||
|
ползавода
|
||||||
|
ползарплаты
|
||||||
|
полздания
|
||||||
|
полканикул
|
||||||
|
полканистры
|
||||||
|
полкартофелины
|
||||||
|
полкастрюли
|
||||||
|
полквартиры
|
||||||
|
полкилограмма
|
||||||
|
полкласса
|
||||||
|
полкниги
|
||||||
|
полколлекции
|
||||||
|
полкольца
|
||||||
|
полкоманды
|
||||||
|
полкоробки
|
||||||
|
полкочана
|
||||||
|
полкурса
|
||||||
|
полкуска
|
||||||
|
полмагазина
|
||||||
|
полмандарина
|
||||||
|
полмарта
|
||||||
|
полматча
|
||||||
|
полмиллиметра
|
||||||
|
полмузея
|
||||||
|
полноября
|
||||||
|
полпакета
|
||||||
|
полпарка
|
||||||
|
полпартии
|
||||||
|
полпинты
|
||||||
|
полпирога
|
||||||
|
полпирожка
|
||||||
|
полпируэта
|
||||||
|
полпоезда
|
||||||
|
полполена
|
||||||
|
полполка
|
||||||
|
полполки
|
||||||
|
полполосы
|
||||||
|
полпомидора
|
||||||
|
полпоросёнка
|
||||||
|
полпосёлка
|
||||||
|
полпредовский
|
||||||
|
полпроцента
|
||||||
|
полпузырька
|
||||||
|
полрайона
|
||||||
|
полромана
|
||||||
|
полроты
|
||||||
|
полрулона
|
||||||
|
полряда
|
||||||
|
полсада
|
||||||
|
полсажени
|
||||||
|
полсезона
|
||||||
|
полсентября
|
||||||
|
полсловаря
|
||||||
|
полсостава
|
||||||
|
полсрока
|
||||||
|
полстада
|
||||||
|
полстены
|
||||||
|
полстолетия
|
||||||
|
полстраницы
|
||||||
|
полстроки
|
||||||
|
полтаблетки
|
||||||
|
полтайма
|
||||||
|
полтакта
|
||||||
|
полтарелки
|
||||||
|
полтетради
|
||||||
|
полтома
|
||||||
|
полтона
|
||||||
|
полторта
|
||||||
|
полтысячелетия
|
||||||
|
полтюбика
|
||||||
|
полусанаторий
|
||||||
|
полфакультета
|
||||||
|
полфевраля
|
||||||
|
полфлакона
|
||||||
|
полфразы
|
||||||
|
полхаты
|
||||||
|
полцарства
|
||||||
|
полцентнера
|
||||||
|
полцистерны
|
||||||
|
полчайника
|
||||||
|
полчемодана
|
||||||
|
полшажка
|
||||||
|
полшажочка
|
||||||
|
полшара
|
||||||
|
полшкафа
|
||||||
|
полшколы
|
||||||
|
полщеки
|
||||||
|
принт
|
||||||
|
промо
|
||||||
|
рентгеноаппарат
|
||||||
|
сивец
|
||||||
|
соцнаём
|
||||||
|
срывка
|
||||||
|
флеш
|
||||||
|
флешмобер
|
||||||
|
шиноремонт
|
||||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,434 @@
|
|||||||
|
// Command ruwords extracts a clean Cyrillic word list from the plain text of a Russian
|
||||||
|
// orthographic dictionary (the output of `pdftotext`).
|
||||||
|
//
|
||||||
|
// Stage 1 (this tool): from the column word-list section [from, to] it collects, per
|
||||||
|
// entry, the headword (the leading token). When the headword is plural and the entry
|
||||||
|
// gives its singular after "ед." — in full ("ящеры, …, ед. ящер") or as a replacement
|
||||||
|
// suffix ("…, ед. -вец") — only the singular is kept, since a plural that has a singular
|
||||||
|
// is never needed. It drops stress marks, lowercases, keeps ё, and discards proper nouns
|
||||||
|
// (capitalized), hyphenated words, acronyms and non-Cyrillic tokens. The result is
|
||||||
|
// de-duplicated and sorted in Russian alphabetical order (ё right after е), LF-separated.
|
||||||
|
//
|
||||||
|
// It also collects a variant headword joined by "и" when it carries its own grammatical
|
||||||
|
// note (e.g. "аблатив, -а и аблятив, -а"). Suffix-singular reconstruction is heuristic;
|
||||||
|
// Stage 2 (dictprep/ru_stage2.py) re-checks the words against real dictionaries.
|
||||||
|
//
|
||||||
|
// pdftotext dictprep/orfo_dict_2025.pdf /tmp/slov.txt
|
||||||
|
// go run ./dictprep/ruwords -in /tmp/slov.txt -from 452 -to 168808 \
|
||||||
|
// -out russian_all.txt -skip russian_skip.txt
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bufio"
|
||||||
|
"flag"
|
||||||
|
"fmt"
|
||||||
|
"log"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"sort"
|
||||||
|
"strings"
|
||||||
|
"unicode"
|
||||||
|
)
|
||||||
|
|
||||||
|
// ruAlphabet is the Russian alphabet in collation order (ё directly after е).
|
||||||
|
const ruAlphabet = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
|
||||||
|
|
||||||
|
var ruRank = func() map[rune]int {
|
||||||
|
m := make(map[rune]int, len(ruAlphabet))
|
||||||
|
for i, r := range []rune(ruAlphabet) {
|
||||||
|
m[r] = i
|
||||||
|
}
|
||||||
|
return m
|
||||||
|
}()
|
||||||
|
|
||||||
|
func isCyrLetter(r rune) bool {
|
||||||
|
return (r >= 'а' && r <= 'я') || (r >= 'А' && r <= 'Я') || r == 'ё' || r == 'Ё'
|
||||||
|
}
|
||||||
|
|
||||||
|
func isUpperCyr(r rune) bool { return (r >= 'А' && r <= 'Я') || r == 'Ё' }
|
||||||
|
|
||||||
|
func isStress(r rune) bool { return r == 0x0300 || r == 0x0301 }
|
||||||
|
|
||||||
|
// cleanWord normalizes a run of letters/stress-marks into a lowercase Cyrillic word, or
|
||||||
|
// returns ok=false for proper nouns (capitalized), hyphenated or non-Cyrillic runs.
|
||||||
|
func cleanWord(run []rune) (string, bool) {
|
||||||
|
if len(run) == 0 || isUpperCyr(run[0]) {
|
||||||
|
return "", false
|
||||||
|
}
|
||||||
|
var b strings.Builder
|
||||||
|
for _, r := range run {
|
||||||
|
switch {
|
||||||
|
case isStress(r), r == '': // drop stress accents and soft hyphens
|
||||||
|
case r == '-': // a real hyphen means a hyphenated word: reject it
|
||||||
|
return "", false
|
||||||
|
default:
|
||||||
|
b.WriteRune(unicode.ToLower(r))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
w := b.String()
|
||||||
|
if w == "" {
|
||||||
|
return "", false
|
||||||
|
}
|
||||||
|
for _, r := range w {
|
||||||
|
if !((r >= 'а' && r <= 'я') || r == 'ё') {
|
||||||
|
return "", false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return w, true
|
||||||
|
}
|
||||||
|
|
||||||
|
// headword returns the entry's headword: the leading run of letters, stress marks and
|
||||||
|
// hyphens, normalized.
|
||||||
|
func headword(line string) (string, bool) {
|
||||||
|
// Trim leading whitespace, including the form-feed (U+000C) that pdftotext puts at
|
||||||
|
// the top of each page — otherwise the first headword on every page is lost.
|
||||||
|
line = strings.TrimLeftFunc(line, unicode.IsSpace)
|
||||||
|
var run []rune
|
||||||
|
for _, r := range line {
|
||||||
|
if isCyrLetter(r) || isStress(r) || r == '-' || r == '' {
|
||||||
|
run = append(run, r)
|
||||||
|
} else {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return cleanWord(run)
|
||||||
|
}
|
||||||
|
|
||||||
|
// embeddedSingulars returns the singular form of a plural headword spelled out after
|
||||||
|
// "ед.", either in full ("ед. ящер") or as a replacement suffix ("ед. -вец",
|
||||||
|
// reconstructed from headword). It skips gender marks ("ед. м") and abbreviations that
|
||||||
|
// merely start with "ед." ("ед. измер.", "ден. ед.").
|
||||||
|
func embeddedSingulars(line, headword string) []string {
|
||||||
|
var out []string
|
||||||
|
for i := 0; ; {
|
||||||
|
j := strings.Index(line[i:], "ед.")
|
||||||
|
if j < 0 {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
i += j + len("ед.")
|
||||||
|
rest := strings.TrimLeft(line[i:], " \t")
|
||||||
|
|
||||||
|
if strings.HasPrefix(rest, "-") { // suffix form: reconstruct from the headword
|
||||||
|
var suf []rune
|
||||||
|
for _, r := range rest[len("-"):] {
|
||||||
|
if isCyrLetter(r) || isStress(r) {
|
||||||
|
suf = append(suf, r)
|
||||||
|
} else {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if s, ok := cleanWord(suf); ok && len([]rune(s)) >= 2 {
|
||||||
|
if recon := reconstructSingular(headword, s); recon != "" {
|
||||||
|
out = append(out, recon)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
var run []rune
|
||||||
|
consumed := 0
|
||||||
|
for _, r := range rest {
|
||||||
|
if isCyrLetter(r) || isStress(r) {
|
||||||
|
run = append(run, r)
|
||||||
|
consumed += len(string(r))
|
||||||
|
} else {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if len(run) == 0 {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if strings.HasPrefix(rest[consumed:], ".") {
|
||||||
|
continue // an abbreviation like "ед. измер." rather than a singular form
|
||||||
|
}
|
||||||
|
w, ok := cleanWord(run)
|
||||||
|
if !ok || len([]rune(w)) < 2 { // 2+ letters excludes the gender marks м/ж/с
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
out = append(out, w)
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
// reconstructSingular builds the singular from a plural headword and the replacement
|
||||||
|
// suffix from "ед. -<suffix>", splicing where the suffix best overlaps the tail of the
|
||||||
|
// headword (the position of longest common prefix between the suffix and a headword
|
||||||
|
// suffix). It is a heuristic; Stage 2 re-checks the words against real dictionaries.
|
||||||
|
func reconstructSingular(headword, suffix string) string {
|
||||||
|
hw, sf := []rune(headword), []rune(suffix)
|
||||||
|
bestK, bestLen := -1, 0
|
||||||
|
for k := 0; k < len(hw); k++ {
|
||||||
|
m := 0
|
||||||
|
for k+m < len(hw) && m < len(sf) && hw[k+m] == sf[m] {
|
||||||
|
m++
|
||||||
|
}
|
||||||
|
if m > bestLen {
|
||||||
|
bestK, bestLen = k, m
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if bestK < 0 {
|
||||||
|
return ""
|
||||||
|
}
|
||||||
|
return string(hw[:bestK]) + suffix
|
||||||
|
}
|
||||||
|
|
||||||
|
// headwordNotes are the grammatical notes that mark a parallel headword (a lemma) after
|
||||||
|
// "и", as opposed to an inflected form. A "-" ending also marks one; form labels such as
|
||||||
|
// деепр. (gerund) or сравн. (comparative) deliberately do not.
|
||||||
|
var headwordNotes = map[string]bool{
|
||||||
|
"нескл": true, "неизм": true, "предлог": true, "предл": true, "нареч": true,
|
||||||
|
"нар": true, "прил": true, "союз": true, "частица": true, "част": true,
|
||||||
|
"межд": true, "мн": true, "ед": true, "тв": true, "числ": true, "мест": true,
|
||||||
|
"м": true, "ж": true, "с": true, "вводн": true, "сказ": true,
|
||||||
|
}
|
||||||
|
|
||||||
|
// variantNoteOK reports whether the note following a candidate variant marks a headword:
|
||||||
|
// a "-" inflection ending or one of headwordNotes (and not a bare inflected word).
|
||||||
|
func variantNoteOK(note string) bool {
|
||||||
|
if strings.HasPrefix(note, "-") {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
var stem []rune
|
||||||
|
for _, r := range note {
|
||||||
|
if (r >= 'а' && r <= 'я') || r == 'ё' {
|
||||||
|
stem = append(stem, r)
|
||||||
|
} else {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return headwordNotes[string(stem)]
|
||||||
|
}
|
||||||
|
|
||||||
|
// variants returns the second (and further) headwords of an entry, written as a parallel
|
||||||
|
// form after " и ", e.g. "аблатив, -а и аблятив, -а" yields "аблятив" and "регги и реггей,
|
||||||
|
// нескл." yields "реггей". Requiring a headword note after the comma keeps this from
|
||||||
|
// matching "и" inside examples or picking up inflected forms.
|
||||||
|
func variants(line string) []string {
|
||||||
|
var out []string
|
||||||
|
const sep = " и "
|
||||||
|
for i := 0; ; {
|
||||||
|
j := strings.Index(line[i:], sep)
|
||||||
|
if j < 0 {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
i += j + len(sep)
|
||||||
|
rest := line[i:]
|
||||||
|
var run []rune
|
||||||
|
consumed := 0
|
||||||
|
for _, r := range rest {
|
||||||
|
if isCyrLetter(r) || isStress(r) {
|
||||||
|
run = append(run, r)
|
||||||
|
consumed += len(string(r))
|
||||||
|
} else {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if len(run) == 0 {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
after := rest[consumed:]
|
||||||
|
if !strings.HasPrefix(after, ", ") || !variantNoteOK(after[len(", "):]) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if w, ok := cleanWord(run); ok && len([]rune(w)) >= 2 {
|
||||||
|
out = append(out, w)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
// normToken normalizes any token (a run of letters and stress marks) for the skip set:
|
||||||
|
// lowercase, stress removed, kept only if it is 2+ all-Cyrillic letters. Unlike
|
||||||
|
// cleanWord it does NOT reject capitalized tokens — a lowercased proper noun belongs in
|
||||||
|
// the skip set so it can be re-checked by a morphological analyzer.
|
||||||
|
func normToken(run []rune) (string, bool) {
|
||||||
|
var b strings.Builder
|
||||||
|
for _, r := range run {
|
||||||
|
if isStress(r) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
b.WriteRune(unicode.ToLower(r))
|
||||||
|
}
|
||||||
|
w := b.String()
|
||||||
|
if len([]rune(w)) < 2 {
|
||||||
|
return "", false
|
||||||
|
}
|
||||||
|
for _, r := range w {
|
||||||
|
if !((r >= 'а' && r <= 'я') || r == 'ё') {
|
||||||
|
return "", false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return w, true
|
||||||
|
}
|
||||||
|
|
||||||
|
// tokens returns every maximal run of Cyrillic letters (plus stress marks) in the line,
|
||||||
|
// normalized; runs are split on every other character (so hyphens split a word).
|
||||||
|
func tokens(line string) []string {
|
||||||
|
var out []string
|
||||||
|
var run []rune
|
||||||
|
flush := func() {
|
||||||
|
if len(run) > 0 {
|
||||||
|
if w, ok := normToken(run); ok {
|
||||||
|
out = append(out, w)
|
||||||
|
}
|
||||||
|
run = run[:0]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for _, r := range line {
|
||||||
|
if isCyrLetter(r) || isStress(r) {
|
||||||
|
run = append(run, r)
|
||||||
|
} else {
|
||||||
|
flush()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
flush()
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
func lessRu(a, b string) bool {
|
||||||
|
ra, rb := []rune(a), []rune(b)
|
||||||
|
for i := 0; i < len(ra) && i < len(rb); i++ {
|
||||||
|
if ra[i] != rb[i] {
|
||||||
|
return ruRank[ra[i]] < ruRank[rb[i]]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return len(ra) < len(rb)
|
||||||
|
}
|
||||||
|
|
||||||
|
func sortedRu(set map[string]struct{}) []string {
|
||||||
|
words := make([]string, 0, len(set))
|
||||||
|
for w := range set {
|
||||||
|
words = append(words, w)
|
||||||
|
}
|
||||||
|
sort.Slice(words, func(i, j int) bool { return lessRu(words[i], words[j]) })
|
||||||
|
return words
|
||||||
|
}
|
||||||
|
|
||||||
|
func writeWords(path string, words []string) error {
|
||||||
|
if dir := filepath.Dir(path); dir != "" && dir != "." {
|
||||||
|
if err := os.MkdirAll(dir, 0o755); err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
}
|
||||||
|
o, err := os.Create(path)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
w := bufio.NewWriter(o)
|
||||||
|
for _, word := range words {
|
||||||
|
w.WriteString(word)
|
||||||
|
w.WriteByte('\n')
|
||||||
|
}
|
||||||
|
if err := w.Flush(); err != nil {
|
||||||
|
o.Close()
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
return o.Close()
|
||||||
|
}
|
||||||
|
|
||||||
|
func main() {
|
||||||
|
in := flag.String("in", "dictprep/russian/orfo_dict_2025.txt", "plain-text dictionary (pdftotext output)")
|
||||||
|
out := flag.String("out", "dictprep/russian/all.txt", "output: the base word list (clean headwords + reconstructed singulars + variants)")
|
||||||
|
skip := flag.String("skip", "/tmp/ru_skip.txt", "output: every other token, for a later morphology re-check")
|
||||||
|
sings := flag.String("singulars", "/tmp/ru_singulars.txt", "output: singulars reconstructed from \"ед.\" (known nouns)")
|
||||||
|
varsOut := flag.String("variants", "/tmp/ru_variants.txt", "output: variant pairs joined by \"и\" (primary<TAB>variant)")
|
||||||
|
from := flag.Int("from", 452, "first line of the word-list section (1-based, inclusive)")
|
||||||
|
to := flag.Int("to", 168808, "last line of the word-list section (inclusive)")
|
||||||
|
flag.Parse()
|
||||||
|
if *in == "" {
|
||||||
|
log.Fatal("ruwords: -in is required")
|
||||||
|
}
|
||||||
|
|
||||||
|
f, err := os.Open(*in)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
defer f.Close()
|
||||||
|
|
||||||
|
all := make(map[string]struct{})
|
||||||
|
allTokens := make(map[string]struct{})
|
||||||
|
singulars := make(map[string]struct{})
|
||||||
|
variantPairs := make(map[string]struct{})
|
||||||
|
entries, fromHead, fromSing, fromVar := 0, 0, 0, 0
|
||||||
|
sc := bufio.NewScanner(f)
|
||||||
|
sc.Buffer(make([]byte, 1<<20), 1<<20)
|
||||||
|
for line := 0; sc.Scan(); {
|
||||||
|
line++
|
||||||
|
if line < *from || line > *to {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
entries++
|
||||||
|
text := sc.Text()
|
||||||
|
hw, hwOK := headword(text)
|
||||||
|
var sings []string
|
||||||
|
if hwOK {
|
||||||
|
sings = embeddedSingulars(text, hw)
|
||||||
|
}
|
||||||
|
primary := ""
|
||||||
|
if len(sings) > 0 {
|
||||||
|
// the headword is plural and the entry gives its singular: keep only the singular
|
||||||
|
primary = sings[0]
|
||||||
|
for _, w := range sings {
|
||||||
|
if _, seen := all[w]; !seen {
|
||||||
|
fromSing++
|
||||||
|
all[w] = struct{}{}
|
||||||
|
}
|
||||||
|
singulars[w] = struct{}{}
|
||||||
|
}
|
||||||
|
} else if hwOK {
|
||||||
|
primary = hw
|
||||||
|
if _, seen := all[hw]; !seen {
|
||||||
|
fromHead++
|
||||||
|
}
|
||||||
|
all[hw] = struct{}{}
|
||||||
|
}
|
||||||
|
for _, w := range variants(text) {
|
||||||
|
if _, seen := all[w]; !seen {
|
||||||
|
fromVar++
|
||||||
|
all[w] = struct{}{}
|
||||||
|
}
|
||||||
|
if primary != "" && primary != w {
|
||||||
|
variantPairs[primary+"\t"+w] = struct{}{}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for _, w := range tokens(text) {
|
||||||
|
allTokens[w] = struct{}{}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if err := sc.Err(); err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
|
||||||
|
skipSet := make(map[string]struct{})
|
||||||
|
for w := range allTokens {
|
||||||
|
if _, ok := all[w]; !ok {
|
||||||
|
skipSet[w] = struct{}{}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
allWords := sortedRu(all)
|
||||||
|
skipWords := sortedRu(skipSet)
|
||||||
|
if err := writeWords(*out, allWords); err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
if err := writeWords(*skip, skipWords); err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
if err := writeWords(*sings, sortedRu(singulars)); err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
pairList := make([]string, 0, len(variantPairs))
|
||||||
|
for p := range variantPairs {
|
||||||
|
pairList = append(pairList, p)
|
||||||
|
}
|
||||||
|
sort.Strings(pairList)
|
||||||
|
if err := writeWords(*varsOut, pairList); err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
|
||||||
|
fmt.Printf("scanned %d entries\n", entries)
|
||||||
|
fmt.Printf(" %-20s %7d words (%d headwords + %d embedded singulars + %d variants)\n", *out, len(allWords), fromHead, fromSing, fromVar)
|
||||||
|
fmt.Printf(" %-20s %7d words (tokens not in %s; for a morphology re-check)\n", *skip, len(skipWords), *out)
|
||||||
|
fmt.Printf(" %-20s %7d words (singulars from \"ед.\"; known nouns)\n", *sings, len(singulars))
|
||||||
|
fmt.Printf(" %-20s %7d pairs (variants joined by \"и\")\n", *varsOut, len(variantPairs))
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user