dictprep: Russian orthographic dictionary → Scrabble noun pipeline

Build a committed Russian common-noun word list (dictprep/russian/scrabble.txt)
from the RAN orthographic dictionary, for the Эрудит ruleset.

- Stage 1 (Go, dictprep/ruwords): orfo_dict_2025.txt -> all.txt; extracts
  headwords, reconstructs "ед." singulars (suppressing plurals), pairs "и" variants.
- Stage 2 (Python brain, dictprep/ru_stage2.py): OpenCorpora (mawo-pymorphy3) +
  libmorph + orthographic notes select common nouns (nom. sing.); --trace explains
  a word's fate, --dump writes the in-memory buckets.
- libmorph C++ bridge (libmorph_check.cpp); manual_confirm.txt is merged in.
- orfo_dict_2025.txt is the committed pdftotext source of truth.
- See dictprep/README.md for methodology and reproducibility.
This commit is contained in:
Ilia Denisov
2026-06-01 23:27:17 +02:00
parent 15c7959d96
commit 540ee32178
9 changed files with 402226 additions and 1 deletions
+10 -1
View File
@@ -6,4 +6,13 @@
# Local scratch
/tmp/
*.pdf
# Compiled libmorph bridge (build artifact; see dictprep/README.md)
/dictprep/libmorph_check
# Stage 2 --dump debug buckets (committed: all, scrabble, manual_confirm, orfo_dict_2025)
/dictprep/russian/undefined.txt
/dictprep/russian/adjectives.txt
/dictprep/russian/verbs.txt
/dictprep/russian/singulars.txt
/dictprep/russian/fate.tsv