Commit Graph

3 Commits

Author SHA1 Message Date
Ilia Denisov c8c2eb927d dictprep: add source PDF orfo_dict_2025.pdf (kept for provenance/history) 2026-06-01 23:37:55 +02:00
Ilia Denisov 0d9b998db3 dawg: build committed dictionary DAWGs (en SOWPODS, ru Scrabble, ru Эрудит)
- cmd/builddict: add -alphabet latin|russian (russian = alphabet.Embedded(LangRu)).
- dictprep/fold_yo.py: fold Ё→Е and de-dup, the Эрудит dictionary prep.
- Makefile: `make dawg` rebuilds dawg/{en_sowpods,ru_scrabble,ru_erudit}.dawg.
- dawg/: committed DAWGs verified by enumeration — 267752 / 83385 / 83343 words.
2026-06-01 23:35:37 +02:00
Ilia Denisov 540ee32178 dictprep: Russian orthographic dictionary → Scrabble noun pipeline
Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt)
from the RAN orthographic dictionary, for the Эрудит ruleset.

- Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts
  headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants.
- Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) +
  libmorph + orthographic notes select common nouns (nom. sing.); --trace explains
  a word's fate, --dump writes the in-memory buckets.
- libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in.
- orfo_dict_2025.txt is the committed pdftotext source of truth.
- See dictprep/README.md for methodology and reproducibility.
2026-06-01 23:27:17 +02:00