Tidy sources into sources/<variant>/ + tools/
build / dawg (pull_request) Successful in 4m22s

Consolidate the scattered build inputs (dictionaries/english/, dictprep/russian/)
into one sources/ tree keyed by the variant labels (scrabble_en/scrabble_ru/
erudit_ru), and move the Russian prep pipeline to tools/. The dawg outputs and
their filenames are unchanged — rebuilt byte-identical (en_sowpods/ru_scrabble/
ru_erudit) — so the release artifact and the backend are unaffected.

ru_stage2.py OUT_DIR and the ruwords flag defaults are repointed to
sources/scrabble_ru/; Makefile / CI / cmd/builddict default / README updated;
pipeline intermediates git-ignored. Verified: make dawg byte-identical to the
committed baseline, py_compile + go vet of the moved tools. The full Russian
regeneration pipeline (pymorphy3/libmorph/orfo PDF) was not run here.
This commit is contained in:
Ilia Denisov
2026-06-09 12:25:33 +02:00
parent 38ad6d3a19
commit dd61ff1d51
17 changed files with 76 additions and 41 deletions
+27
View File
@@ -0,0 +1,27 @@
#!/usr/bin/env python3
"""Fold Ё/ё → Е/е in a word list and de-duplicate — the dictionary prep for "Эрудит".
The Эрудит ruleset has no Ё tile and treats Е/Ё as one letter, so its dictionary must be
folded before the DAWG is built. Folding merges pairs like ёж/еж, hence the de-dup. Output
is sorted (Russian order over the 32 folded letters) and LF-separated.
Run: python3 tools/fold_yo.py sources/scrabble_ru/scrabble.txt > /tmp/ru_erudit_words.txt
"""
import sys
ORDER = {c: i for i, c in enumerate("абвгдежзийклмнопрстуфхцчшщъыьэюя")} # 32 letters, no ё
def key(w):
return [ORDER.get(c, 99) for c in w]
def main():
src = sys.argv[1] if len(sys.argv) > 1 else "/dev/stdin"
words = {line.strip().replace("ё", "е").replace("Ё", "Е") for line in open(src, encoding="utf-8")}
words.discard("")
sys.stdout.write("\n".join(sorted(words, key=key)) + "\n")
if __name__ == "__main__":
main()