Publish as versioned Gitea module; move dictionary pipeline out

- Rename module to gitea.iliadenisov.ru/developer/scrabble-solver so it can be
  consumed as a versioned dependency (no go.work replace / CI clone).
- De-internalize wordlist and dictdawg as public packages.
- Remove cmd/builddict, dictprep/, the dictionaries submodule and the dawg
  Makefile: the word-list parsing and DAWG build now live in the separate
  scrabble-dictionary repository, which publishes the DAWG set as a release artifact.
- internal/dict loads the committed dawg/en_sowpods.dawg fixture for cmd/stress.
- Update README/CLAUDE docs accordingly.
This commit is contained in:
Ilia Denisov
2026-06-04 19:11:46 +02:00
parent 63a7c663bf
commit 256999b42c
41 changed files with 93 additions and 402477 deletions
-3
View File
@@ -1,3 +0,0 @@
[submodule "dictionaries"]
path = dictionaries
url = https://github.com/kamilmielnik/scrabble-dictionaries
+11 -12
View File
@@ -17,23 +17,21 @@ Russian **Эрудит** (`rules` package); Эрудит has no Ё tile and fold
- `board/`, `rack/`, `rules/` — board grid (+ transpose), rack as per-letter counts,
and rulesets (geometry, premium layout, tile values/counts, alphabet, bonus):
`rules.English()`, `rules.RussianScrabble()`, `rules.Erudit()`.
- `internal/` `dictdawg` (build/load/serialise DAWGs over dafsa), `wordlist`
(encode/filter/sort/dedupe + `FoldYo`), `graph`, `dict`.
- `cmd/builddict` — word list → serialised DAWG (`-alphabet latin|russian`).
- `dictdawg/`, `wordlist/`**public** helpers: `dictdawg` (build/load/serialise DAWGs
over dafsa), `wordlist` (encode/filter/sort/dedupe + `FoldYo`). Imported by the separate
`scrabble-dictionary` repo that builds and publishes the DAWG set.
- `internal/``encoding`, `graph`, `dict` (loads the committed `dawg/en_sowpods.dawg`
for `cmd/stress`).
- `cmd/stress`, `selfplay/` — the self-play stress harness behind `RESULTS.md`.
- `dawg/`**committed** dictionaries: `en_sowpods.dawg`, `ru_scrabble.dawg`,
`ru_erudit.dawg` (Ё→Е folded). Rebuild with `make dawg`.
- `dictionaries/``kamilmielnik/scrabble-dictionaries` git submodule (English source).
- `dictprep/` — self-contained tooling that turns the Russian academic orthographic
dictionary into a common-noun word list. See `dictprep/README.md`. Committed output is
`dictprep/russian/{all,scrabble}.txt` (+ `orfo_dict_2025.{pdf,txt}`, `manual_confirm.txt`).
Running Stage 2 needs a Python venv with `mawo-pymorphy3` and the `libmorph` apt packages
(see `dictprep/README.md`).
`ru_erudit.dawg` (Ё→Е folded). The word-list sources and build pipeline live in the
separate [`scrabble-dictionary`](https://gitea.iliadenisov.ru/developer/scrabble-dictionary)
repo (which publishes the DAWG set as a release artifact); these committed copies are
test fixtures.
## Build & test
go test ./... # all packages green; also run go vet ./... and gofmt
make dawg # rebuild dawg/*.dawg from the word lists
Scoring and move generation are validated against **real tournament games** in GCG format
(`scrabble/gcg_test.go` + `scrabble/testdata/*.gcg`, including the 700+ club): for every
@@ -46,4 +44,5 @@ produces the played move with that score — canonical play, not invented cases.
and output bytes only — never inside the graph). The public API is byte-indexed.
- DAWG is the production generator; the GADDAG was removed after measurement.
- Detailed docs: `ALGORITHM.md` (the algorithm — single source of truth), `PLAN.md`
(design and decisions), `RESULTS.md` (DAWG-vs-GADDAG), `dictprep/README.md` (RU pipeline).
(design and decisions), `RESULTS.md` (DAWG-vs-GADDAG). The RU word-list pipeline and the
DAWG build now live in the `scrabble-dictionary` repo.
-28
View File
@@ -1,28 +0,0 @@
# Scrabble-solver build helpers.
#
# `make dawg` (re)builds the committed dictionary DAWGs under dawg/ from their word lists:
# en_sowpods.dawg — English SOWPODS (Latin alphabet)
# ru_scrabble.dawg — Russian Scrabble nouns (Cyrillic, 33 letters)
# ru_erudit.dawg — Эрудит (the same list with Ё→Е folded and de-duped)
GO ?= go
PYTHON ?= python3
DAWG_DIR := dawg
BUILDDICT := $(GO) run ./cmd/builddict
.PHONY: dawg dawg-en dawg-ru dawg-erudit clean-dawg
dawg: dawg-en dawg-ru dawg-erudit
dawg-en:
$(BUILDDICT) -dict dictionaries/english/sowpods.txt -alphabet latin -name en_sowpods -out $(DAWG_DIR)
dawg-ru:
$(BUILDDICT) -dict dictprep/russian/scrabble.txt -alphabet russian -name ru_scrabble -out $(DAWG_DIR)
dawg-erudit:
$(PYTHON) dictprep/fold_yo.py dictprep/russian/scrabble.txt > /tmp/ru_erudit_words.txt
$(BUILDDICT) -dict /tmp/ru_erudit_words.txt -alphabet russian -name ru_erudit -out $(DAWG_DIR)
clean-dawg:
rm -f $(DAWG_DIR)/*.dawg
+7 -10
View File
@@ -24,27 +24,24 @@ See [`ALGORITHM.md`](ALGORITHM.md) for the algorithm (the single source of truth
```
scrabble/ public API: Solver, Move/Play types, DAWG generator, scoring, validation
board/ rack/ rules/ board grid (+transpose), rack, rulesets (English/Russian/Эрудит)
internal/ encoding (byte conventions), wordlist, dictdawg, dict, graph
cmd/builddict/ word list -> serialized DAWG in testdata
wordlist/ dictdawg/ public word-list parsing and DAWG build/load helpers
internal/ encoding (byte conventions), dict (committed-DAWG loader), graph
cmd/stress/ greedy self-play benchmark of the generator
selfplay/ bag + greedy player + game loop
```
## Setup
```sh
git submodule update --init # the dictionaries submodule (SOWPODS, TWL06, …)
go run ./cmd/builddict # build testdata/sowpods.dawg (≈0.2 s, ~730 KB)
```
`go.mod` carries `replace github.com/iliadenisov/dafsa => ../dafsa`: the solver needs
dafsa's low-level traversal `Cursor` (see the patch notes in `../dafsa/SCRABBLE_API.md`).
The committed dictionary DAWGs under `dawg/` (`en_sowpods`, `ru_scrabble`, `ru_erudit`)
are used directly — no build step. The word-list parsing and DAWG build pipeline lives in
the separate [`scrabble-dictionary`](https://gitea.iliadenisov.ru/developer/scrabble-dictionary)
repository, which publishes the DAWG set as a release artifact.
## Usage
```go
rs := rules.English()
finder, _ := dict.EnglishDAWG() // loads testdata/sowpods.dawg
finder, _ := dict.EnglishDAWG() // loads dawg/en_sowpods.dawg
s := scrabble.NewSolver(rs, finder)
b := board.New(rs.Rows, rs.Cols) // empty board (first move)
+1 -1
View File
@@ -9,7 +9,7 @@ import (
"github.com/iliadenisov/alphabet"
"scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
)
// Board is a row-major grid of encoded cells.
+2 -2
View File
@@ -5,8 +5,8 @@ import (
"github.com/iliadenisov/alphabet"
"scrabble-solver/board"
"scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
)
func TestParseAndAccess(t *testing.T) {
-75
View File
@@ -1,75 +0,0 @@
// Command builddict converts a word list into a serialized DAWG. By default it reads the
// English SOWPODS list (Latin alphabet); pass -alphabet russian for the Cyrillic lists.
package main
import (
"flag"
"fmt"
"log"
"os"
"path/filepath"
"time"
"github.com/iliadenisov/alphabet"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/wordlist"
)
func main() {
dict := flag.String("dict", "dictionaries/english/sowpods.txt", "word list file (one word per line)")
out := flag.String("out", "testdata", "output directory")
name := flag.String("name", "sowpods", "base name for the output file")
minLen := flag.Int("min", 2, "minimum word length")
maxLen := flag.Int("max", 15, "maximum word length")
alpha := flag.String("alphabet", "latin", "alphabet: latin (English) or russian")
flag.Parse()
var idx alphabet.Indexer
switch *alpha {
case "latin":
idx = alphabet.Latin()
case "russian":
idx = alphabet.Embedded(alphabet.Langs.LangRu)
default:
log.Fatalf("unknown -alphabet %q (want latin or russian)", *alpha)
}
t0 := time.Now()
words, err := wordlist.Read(*dict, idx, *minLen, *maxLen)
if err != nil {
log.Fatalf("read %s: %v", *dict, err)
}
fmt.Printf("loaded %d words from %s in %s\n", len(words), *dict, time.Since(t0).Round(time.Millisecond))
if err := os.MkdirAll(*out, 0o755); err != nil {
log.Fatal(err)
}
t := time.Now()
f, err := dictdawg.Build(idx, words)
if err != nil {
log.Fatalf("build dawg: %v", err)
}
path := filepath.Join(*out, *name+".dawg")
if err := dictdawg.Save(f, path); err != nil {
log.Fatalf("save: %v", err)
}
size := int64(0)
if fi, err := os.Stat(path); err == nil {
size = fi.Size()
}
fmt.Printf("DAWG %d nodes, %s, built+saved in %s -> %s\n",
f.NumNodes(), humanBytes(size), time.Since(t).Round(time.Millisecond), path)
}
func humanBytes(n int64) string {
switch {
case n >= 1<<20:
return fmt.Sprintf("%.2f MB", float64(n)/(1<<20))
case n >= 1<<10:
return fmt.Sprintf("%.1f KB", float64(n)/(1<<10))
default:
return fmt.Sprintf("%d B", n)
}
}
+5 -5
View File
@@ -12,10 +12,10 @@ import (
"strings"
"time"
"scrabble-solver/internal/dict"
"scrabble-solver/rules"
"scrabble-solver/scrabble"
"scrabble-solver/selfplay"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/dict"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/scrabble"
"gitea.iliadenisov.ru/developer/scrabble-solver/selfplay"
)
func main() {
@@ -24,7 +24,7 @@ func main() {
rs := rules.English()
if !dict.EnglishAvailable() {
log.Fatal("English dictionary not available; run `go run ./cmd/builddict` first")
log.Fatal("English dictionary not available: dawg/en_sowpods.dawg missing")
}
f, err := dict.EnglishDAWG()
if err != nil {
@@ -6,8 +6,8 @@ import (
"github.com/iliadenisov/alphabet"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/wordlist"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/wordlist"
)
func TestBuildAndQuery(t *testing.T) {
Submodule dictionaries deleted from 92f81b2861
-164
View File
@@ -1,164 +0,0 @@
# Russian word-list preparation (`dictprep`)
Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
Russian academic **orthographic dictionary**, cross-checked against two independent
morphological dictionaries.
The goal of the pipeline is a list of **common nouns in the nominative singular**
(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
> This directory is self-contained tooling for *building* the word list. It is not part
> of the solver library. The committed result lives in `dictprep/russian/`.
## Source
`orfo_dict_2025.pdf`*Русский орфографический словарь РАН* (≈ 200 000 entries), the
authority for **spelling**. It encodes declension type in its grammatical notes but does
**not** reliably mark part of speech.
- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
from the text alone — the binary PDF is needed only to regenerate that text.
## Outputs (`dictprep/russian/`)
The committed result is **three** files; every other bucket stays in the Stage-2
process's memory (dump it with `--dump`, query it with `--trace WORD`).
| File | Committed | Meaning |
|------|:--:|---------|
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 215 — the working dictionary. |
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
word with the reason it did or did not reach the dictionary); these are git-ignored debug
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
the brain consumes.
## Prerequisites
```sh
# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils
# 2. Go toolchain (Stage 1) — already required by the parent module
# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
```
If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
the stack and reports `libmorph_helper=MISSING`.
## How to run
```sh
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
go run ./dictprep/ruwords
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python dictprep/ru_stage2.py
# ask how a word did or did not reach the dictionary
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python dictprep/ru_stage2.py --dump
```
`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
PDF is re-exported.
## Algorithm
### Stage 1 — `ruwords` (Go)
Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
- the **headword** (leading token). Leading whitespace including the form-feed `\f`
pdftotext puts at every page top is trimmed — otherwise the first headword of each page
is lost;
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
(`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
suffix best overlaps the headword); the plural is then dropped (a plural that has a
singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
- **variant headwords** after `и` that carry their own grammatical note
(`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
Everything else (every maximal Cyrillic token not selected above) goes to
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
### Stage 2 — `ru_stage2.py` (Python)
Each Stage-1 word (length 215) is routed by three sources, most authoritative first:
1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
added (so nouns absent from the PDF are included).
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
reading in *either* is enough (their disagreements were reviewed and resolved this way,
since each is incomplete in different places). A singular reconstructed from "ед." that
neither dictionary knows is accepted as a noun (the orthographic note attests it).
3. A word **both dictionaries miss** is classified by the orthographic **note**
(`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
`нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
verb is dropped; anything undecided goes to `undefined.txt`.
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
from review/undefined into the result as well, propagated transitively through chains.
The plural-form variants the dictionaries already resolve never reach this step.
The nominative singular always comes from the dictionary that recognised the word, or from
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
### The libmorph bridge — `libmorph_check.cpp`
libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
stdin→stdout filter: one UTF-8 word per line in, one line out:
```
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
```
`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
speech. The codes were reverse-engineered (the docs omit the table):
| codes | part of speech |
|------|----------------|
| **721, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
| 13 | verb · 25, 27 adjective · 2832 pronoun · 3336 numeral |
| 3839 | **proper noun** (excluded) · 4858 comparative/adverb · 4953 function words |
The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
passed and lemmas returned in UTF-8.
## Notes & caveats
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
and is therefore used only as a weak confirming vote, never as dictionary membership.
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.
-27
View File
@@ -1,27 +0,0 @@
#!/usr/bin/env python3
"""Fold Ё/ё → Е/е in a word list and de-duplicate — the dictionary prep for "Эрудит".
The Эрудит ruleset has no Ё tile and treats Е/Ё as one letter, so its dictionary must be
folded before the DAWG is built. Folding merges pairs like ёж/еж, hence the de-dup. Output
is sorted (Russian order over the 32 folded letters) and LF-separated.
Run: python3 dictprep/fold_yo.py dictprep/russian/scrabble.txt > /tmp/ru_erudit_words.txt
"""
import sys
ORDER = {c: i for i, c in enumerate("абвгдежзийклмнопрстуфхцчшщъыьэюя")} # 32 letters, no ё
def key(w):
return [ORDER.get(c, 99) for c in w]
def main():
src = sys.argv[1] if len(sys.argv) > 1 else "/dev/stdin"
words = {line.strip().replace("ё", "е").replace("Ё", "Е") for line in open(src, encoding="utf-8")}
words.discard("")
sys.stdout.write("\n".join(sorted(words, key=key)) + "\n")
if __name__ == "__main__":
main()
-47
View File
@@ -1,47 +0,0 @@
// libmorph_check: a thin stdin->stdout bridge to the libmorph Russian morphological
// analyser, for use by the Stage-2 classifier (scripts/ru_stage2.py).
//
// Reads one word per line (bytes are passed through verbatim — the caller encodes to
// the code page the libmorph char interface expects, CP1251). For each word it writes
// a line:
//
// <known>\t<pos>:<lemma>\t<pos>:<lemma>...
//
// where <known> is CheckWord's result (1 = in the dictionary, 0 = not), and each
// following field is one lexeme: its part of speech (wdInfo & 0x3f) and lemma.
//
// Build: g++ -std=c++17 -O2 scripts/libmorph_check.cpp -lmorphrus -lmoonycode -o libmorph_check
#include <libmorph/rus.h>
#include <libmorph/api.hpp>
#include <cstdio>
#include <iostream>
#include <string>
int main(int argc, char** argv) {
// The factory key selects the code page: "libmorph.api.v4:<charset>". Use the
// UTF-8 instance so words pass through verbatim. IMlmaMbXX only adds non-virtual
// convenience wrappers over IMlmaMb, so the filled pointer can be used as such.
const char* key = argc > 1 ? argv[1] : "libmorph.api.v4:utf-8";
IMlmaMbXX* mlma = nullptr;
int rc = mlmaruGetAPI(key, (void**)&mlma);
if (mlma == nullptr) {
std::fprintf(stderr, "libmorph_check: GetAPI('%s') failed, rc=%d\n", key, rc);
return 1;
}
std::string line;
while (std::getline(std::cin, line)) {
if (!line.empty() && line.back() == '\r') line.pop_back();
IMlmaMbXX::inword w(line.c_str(), line.size());
int known = mlma->CheckWord(w, sfIgnoreCapitals);
std::cout << known;
try {
for (auto& lx : mlma->Lemmatize(w, sfIgnoreCapitals)) {
unsigned pos = lx.ngrams > 0 ? (lx.pgrams[0].wdInfo & 0x3f) : 0xffu;
std::cout << '\t' << pos << ':' << (lx.plemma ? lx.plemma : "");
}
} catch (...) {
}
std::cout << '\n';
}
return 0;
}
Binary file not shown.
-341
View File
@@ -1,341 +0,0 @@
#!/usr/bin/env python3
"""Stage 2 — the "brain" of the Russian Scrabble word-list pipeline.
It reads the Stage-1 base word list (built once by ruwords so the heavy PDF is not
re-parsed) together with the grammatical notes and the singular/variant structure, runs
the whole noun-selection logic in memory, and writes a minimal result:
dictprep/russian/scrabble.txt — the working dictionary (common nouns, nom. sing.)
dictprep/russian/undefined.txt — the ambiguous tail, left for manual review
(dictprep/russian/all.txt is the Stage-1 base.) Every other bucket — adjectives, verbs,
the merged note-nouns, singulars, variants — stays in memory. Pass --dump to also write
them; pass --trace WORD to ask how a single word did or did not reach the dictionary.
Note: all.txt is a plain word list, so the grammatical notes, "ед." singulars and "и"
variants are read from the pdftotext output (slov.txt) and the Stage-1 side files; the
expensive PDF parse itself runs only once.
Sources, most authoritative first: OpenCorpora (mawo-pymorphy3), libmorph (libmorph_check),
and the orthographic dictionary's own notes. See dictprep/README.md.
Run: ru-venv/bin/python dictprep/ru_stage2.py [--dump] [--trace WORD]
"""
import argparse
import os
import re
import subprocess
HERE = os.path.dirname(os.path.abspath(__file__))
OUT_DIR = os.path.join(HERE, "russian")
SLOV = os.path.join(OUT_DIR, "orfo_dict_2025.txt") # committed pdftotext output (source of truth)
WL_FROM, WL_TO = 452, 168808 # 1-based inclusive bounds of the column word-list section
OC_CACHE = "/tmp/oc_nouns.txt"
LIBMORPH_BIN = os.path.join(HERE, "libmorph_check")
ALPHABET = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
ORDER = {c: i for i, c in enumerate(ALPHABET)}
PROPER = {"Name", "Surn", "Patr", "Geox", "Orgn", "Trad"}
LIBMORPH_NOUN_CODES = set(range(7, 22)) | {24} # 7..21 plus 24 (pluralia tantum)
ADJ_END = {"ая", "яя", "ое", "ее", "ье", "ья", "ьи"}
VERB3 = ("ет", "ёт", "ит", "ют", "ут", "ает", "яет", "ует", "уют", "нет", "жет", "чет")
GENPL = ("ов", "ёв", "ев", "ей")
def key(w):
return [ORDER.get(c, 99) for c in w]
def destress(s):
return "".join(c for c in s if ord(c) not in (0x0300, 0x0301)).lower()
def cyr_ok(w):
return 2 <= len(w) <= 15 and all(("а" <= c <= "я") or c == "ё" for c in w)
def load(p):
return [l.strip() for l in open(p, encoding="utf-8") if l.strip()] if os.path.exists(p) else []
def write(path, words):
os.makedirs(os.path.dirname(path), exist_ok=True)
open(path, "w", encoding="utf-8").write("\n".join(sorted(set(words), key=key)) + "\n")
import mawo_pymorphy3 # noqa: E402
M = mawo_pymorphy3.MorphAnalyzer()
D = M._dawg_dict
def oc_noun_lemmas():
"""Every common-noun lemma (nom. sing. / pluralia tantum) in OpenCorpora's words.dawg."""
gp, pt = D.get_paradigm, D.parse_tag_string
para0, tagc = {}, {}
def g0(pid):
r = para0.get(pid)
if r is None:
suf0, tag0, pre0 = gp(pid, 0)
_, gr = pt(tag0)
r = (pre0, suf0, gr)
para0[pid] = r
return r
def gt(pid, idx):
k = (pid, idx)
r = tagc.get(k)
if r is None:
suf, tag, pre = gp(pid, idx)
pos, gr = pt(tag)
r = (suf, pre, pos, gr)
tagc[k] = r
return r
out = set()
for word, rec in D.words_dawg.iteritems():
pid, idx = rec
suf, pre, pos, gr = gt(pid, idx)
if pos != "NOUN":
continue
pre0, suf0, gr0 = g0(pid)
if (PROPER & gr) or (PROPER & gr0):
continue
stem = word[len(pre):len(word) - len(suf)] if suf else word[len(pre):]
out.add(pre0 + stem + suf0)
return {w for w in out if cyr_ok(w)}
def oc_status(word):
"""(is_common_noun, in_dictionary) for word, from OpenCorpora only."""
parses = D.get_word_parses(word)
if not parses:
return False, False
gp, pt = D.get_paradigm, D.parse_tag_string
for pid, idx in parses:
suf, tag, pre = gp(pid, idx)
pos, gr = pt(tag)
if pos == "NOUN":
_, tag0, _ = gp(pid, 0)
_, gr0 = pt(tag0)
if not (PROPER & gr or PROPER & gr0):
return True, True
return False, True
def libmorph_analyze(words):
"""Map each word to (known, noun_lemma, codes) per libmorph; noun_lemma is None when it
is not a common noun there. Empty result if the helper binary is not built."""
words = list(words)
if not words or not os.path.exists(LIBMORPH_BIN):
return {}
proc = subprocess.run([LIBMORPH_BIN], input="\n".join(words), capture_output=True, text=True)
out = {}
for w, line in zip(words, proc.stdout.split("\n")):
fields = line.split("\t")
known = fields[:1] == ["1"]
codes, noun_lemmas = set(), []
for field in fields[1:]:
code, _, lex = field.partition(":")
if code.isdigit():
codes.add(int(code))
if int(code) in LIBMORPH_NOUN_CODES:
noun_lemmas.append(lex)
lemma = (w if w in noun_lemmas else noun_lemmas[0]) if noun_lemmas else None
out[w] = (known, lemma, codes)
return out
def build_notes():
"""Map each headword (destressed, lowercased) to its grammatical note."""
def is_hw(ch):
o = ord(ch)
return (0x0430 <= o <= 0x044F) or (0x0410 <= o <= 0x042F) or o in (0x0401, 0x0451, 0x0300, 0x0301)
hmap = {}
lines = open(SLOV, encoding="utf-8").read().split("\n")
for l in lines[WL_FROM - 1:WL_TO]:
s = l.lstrip()
e = 0
for ch in s:
if is_hw(ch):
e += 1
else:
break
hw = destress(s[:e])
if hw and hw not in hmap:
hmap[hw] = destress(s[e:]).strip()
return hmap
def classify(w, note):
"""Coarse part of speech of an out-of-dictionary word from its PDF note."""
if note is None:
return "amb"
n = re.sub(r"\([^)]*\)", "", note).strip() # drop domain/etymology parentheticals
if "кр. ф" in n or "кр.ф" in n or "прич." in n or "прил." in n:
return "adj"
ends = re.findall(r"-([а-яё]+)", n)
if any(e in ADJ_END for e in ends):
return "adj"
if "сов." in n or "несов." in n or "безл." in n:
return "verb"
if w.endswith("ся"): # reflexive: no Russian noun ends in -ся
return "verb"
if any(e.endswith(VERB3) for e in ends) and not any(m in n for m in ("ед.", "тв.", "род.", "м.", "ж.", "с.")):
return "verb"
if n == "" and w.endswith(("ый", "ий", "ой", "ая", "ое", "ые", "ие", "яя", "ее")):
return "adj"
if "нескл" in n:
return "noun" if any(g in n for g in ("м.", "ж.", "с.", "мн.")) else "amb"
if ends:
return "noun"
if n == "" and w.endswith(("ать", "ять", "еть", "ить", "оть", "уть", "ыть", "ти", "чь")):
return "verb"
return "amb"
def singular(w, note):
"""Nominative singular of a noun headword from the PDF note (authoritative) or, for a
plural headword without an explicit singular, the mawo lemma; pluralia tantum kept."""
n = note or ""
full = re.search(r"ед\.\s+([а-яё]+)", n)
if full:
return full.group(1)
suf = re.search(r"ед\.\s+-([а-яё]+)", n)
if suf:
s = suf.group(1)
i = w.rfind(s[0])
return w[:i] + s if i > 0 else w
ends = re.findall(r"-([а-яё]+)", re.sub(r"\([^)]*\)", "", n))
if ends and ends[0].endswith(GENPL):
for p in M.parse(w):
if str(p.tag.POS) == "NOUN":
return p.normal_form
return w
return w
def build():
"""Run the whole pipeline in memory. Returns the result sets plus a `fate` map giving
every word's outcome, so a word's path can be traced or the buckets dumped."""
oc = set(load(OC_CACHE)) or oc_noun_lemmas()
if not os.path.exists(OC_CACHE):
write(OC_CACHE, oc)
hmap = build_notes()
all_words = load(os.path.join(OUT_DIR, "all.txt"))
ed_nouns = set(load("/tmp/ru_singulars.txt"))
pairs = [tuple(p) for l in load("/tmp/ru_variants.txt") if len(p := l.split("\t")) == 2]
pdf = [w for w in all_words if cyr_ok(w)]
lm = libmorph_analyze(pdf)
def to_singular(w):
s = singular(w, hmap.get(w))
return s if cyr_ok(s) else w
fate = {}
scrabble = set(oc)
adj, verb, amb = [], [], []
for w in pdf:
oc_noun, oc_known = oc_status(w)
if oc_noun:
fate[w] = "scrabble: сущ. по OpenCorpora"
continue
lm_known, lm_lemma, _ = lm.get(w, (False, None, frozenset()))
if lm_lemma is not None:
s = lm_lemma if cyr_ok(lm_lemma) else to_singular(w)
scrabble.add(s)
fate[w] = "scrabble: сущ. по libmorph" + ("" if s == w else f"{s}")
continue
if oc_known or lm_known:
fate[w] = "отброшено: словарь знает как не-существительное"
continue
if w in ed_nouns:
scrabble.add(w)
fate[w] = "scrabble: ед.ч. по помете «ед.»"
continue
c = classify(w, hmap.get(w))
if c == "noun":
s = to_singular(w)
scrabble.add(s)
fate[w] = "scrabble: сущ. по помете орфословаря" + ("" if s == w else f"{s}")
elif c == "adj":
adj.append(w)
fate[w] = "отброшено: прилагательное (помета орфословаря)"
elif c == "verb":
verb.append(w)
fate[w] = "отброшено: глагол (помета орфословаря)"
else:
amb.append(w)
fate[w] = "undefined: неоднозначное (нет в словарях, помета не определяет)"
# Manual confirmations: nouns the maintainer approved from the undefined tail.
for w in load(os.path.join(OUT_DIR, "manual_confirm.txt")):
if cyr_ok(w):
scrabble.add(w)
fate[w] = "scrabble: подтверждено вручную (manual_confirm.txt)"
# Variant rescue: a word joined by "и" to a confirmed noun is itself a noun.
pending = set(amb) - scrabble
changed = True
while changed:
changed = False
for a, b in pairs:
for x, y in ((a, b), (b, a)):
if x in scrabble and y in pending:
scrabble.add(y)
pending.discard(y)
fate[y] = f"scrabble: вариант от «{x}» (через «и»)"
changed = True
undefined = [w for w in amb if w not in scrabble]
return {
"oc": oc, "scrabble": scrabble, "undefined": undefined,
"adjectives": adj, "verbs": verb, "singulars": ed_nouns,
"fate": fate, "all": set(all_words),
}
def trace(word, r):
w = destress(word)
if w in r["fate"]:
return r["fate"][w]
if w in r["scrabble"]:
return "scrabble: лексикон OpenCorpora" if w in r["oc"] else "scrabble: производная/лемма"
if w not in r["all"]:
return "нет в russian_all (не извлечено на Stage 1 — нет в .pdf, либо имя собств./дефис/форма)"
if not cyr_ok(w):
return "отсеяно: длина или символы вне диапазона (2–15 кириллица)"
return "не определено"
def main():
ap = argparse.ArgumentParser(description="Stage 2 brain: build the noun dictionary, trace a word, or dump buckets.")
ap.add_argument("--dump", action="store_true", help="also write the in-memory buckets (adjectives, verbs, singulars, variants, fate)")
ap.add_argument("--trace", metavar="WORD", help="report how WORD did or did not reach the dictionary, then exit")
args = ap.parse_args()
r = build()
if args.trace:
print(f"{args.trace}: {trace(args.trace, r)}")
return
write(os.path.join(OUT_DIR, "scrabble.txt"), r["scrabble"])
print(f"=> dictprep/russian/scrabble.txt {len(r['scrabble'])}")
print(f" undefined kept in memory: {len(set(r['undefined']))} (use --dump to write it)")
if args.dump:
write(os.path.join(OUT_DIR, "undefined.txt"), r["undefined"])
write(os.path.join(OUT_DIR, "adjectives.txt"), r["adjectives"])
write(os.path.join(OUT_DIR, "verbs.txt"), r["verbs"])
write(os.path.join(OUT_DIR, "singulars.txt"), r["singulars"])
fate_path = os.path.join(OUT_DIR, "fate.tsv")
os.makedirs(OUT_DIR, exist_ok=True)
with open(fate_path, "w", encoding="utf-8") as f:
for w in sorted(r["fate"], key=key):
f.write(f"{w}\t{r['fate'][w]}\n")
print(f" dumped: undefined.txt ({len(set(r['undefined']))}), adjectives.txt, verbs.txt, singulars.txt, fate.tsv")
if __name__ == "__main__":
main()
File diff suppressed because it is too large Load Diff
-135
View File
@@ -1,135 +0,0 @@
артгруппа
бутень
вебинар
видеодневник
водозащита
генацвале
жакоб
оберфюрер
околоть
особина
полбазара
полбака
полбалкона
полбанана
полбарана
полбатальона
полбатона
полбиблиотеки
полблокнота
полбокала
полбуханки
полвагона
полвечера
полвзвода
полвинта
полгазеты
полгектара
полгостиницы
полграмма
полгруппы
полдачи
полдвора
полдекабря
полдеревни
полдетсада
полдивана
полдивизии
полдыни
полжурнала
ползавода
ползарплаты
полздания
полканикул
полканистры
полкартофелины
полкастрюли
полквартиры
полкилограмма
полкласса
полкниги
полколлекции
полкольца
полкоманды
полкоробки
полкочана
полкурса
полкуска
полмагазина
полмандарина
полмарта
полматча
полмиллиметра
полмузея
полноября
полпакета
полпарка
полпартии
полпинты
полпирога
полпирожка
полпируэта
полпоезда
полполена
полполка
полполки
полполосы
полпомидора
полпоросёнка
полпосёлка
полпредовский
полпроцента
полпузырька
полрайона
полромана
полроты
полрулона
полряда
полсада
полсажени
полсезона
полсентября
полсловаря
полсостава
полсрока
полстада
полстены
полстолетия
полстраницы
полстроки
полтаблетки
полтайма
полтакта
полтарелки
полтетради
полтома
полтона
полторта
полтысячелетия
полтюбика
полусанаторий
полфакультета
полфевраля
полфлакона
полфразы
полхаты
полцарства
полцентнера
полцистерны
полчайника
полчемодана
полшажка
полшажочка
полшара
полшкафа
полшколы
полщеки
принт
промо
рентгеноаппарат
сивец
соцнаём
срывка
флеш
флешмобер
шиноремонт
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
-434
View File
@@ -1,434 +0,0 @@
// Command ruwords extracts a clean Cyrillic word list from the plain text of a Russian
// orthographic dictionary (the output of `pdftotext`).
//
// Stage 1 (this tool): from the column word-list section [from, to] it collects, per
// entry, the headword (the leading token). When the headword is plural and the entry
// gives its singular after "ед." — in full ("ящеры, …, ед. ящер") or as a replacement
// suffix ("…, ед. -вец") — only the singular is kept, since a plural that has a singular
// is never needed. It drops stress marks, lowercases, keeps ё, and discards proper nouns
// (capitalized), hyphenated words, acronyms and non-Cyrillic tokens. The result is
// de-duplicated and sorted in Russian alphabetical order (ё right after е), LF-separated.
//
// It also collects a variant headword joined by "и" when it carries its own grammatical
// note (e.g. "аблатив, -а и аблятив, -а"). Suffix-singular reconstruction is heuristic;
// Stage 2 (dictprep/ru_stage2.py) re-checks the words against real dictionaries.
//
// pdftotext dictprep/orfo_dict_2025.pdf /tmp/slov.txt
// go run ./dictprep/ruwords -in /tmp/slov.txt -from 452 -to 168808 \
// -out russian_all.txt -skip russian_skip.txt
package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
"path/filepath"
"sort"
"strings"
"unicode"
)
// ruAlphabet is the Russian alphabet in collation order (ё directly after е).
const ruAlphabet = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
var ruRank = func() map[rune]int {
m := make(map[rune]int, len(ruAlphabet))
for i, r := range []rune(ruAlphabet) {
m[r] = i
}
return m
}()
func isCyrLetter(r rune) bool {
return (r >= 'а' && r <= 'я') || (r >= 'А' && r <= 'Я') || r == 'ё' || r == 'Ё'
}
func isUpperCyr(r rune) bool { return (r >= 'А' && r <= 'Я') || r == 'Ё' }
func isStress(r rune) bool { return r == 0x0300 || r == 0x0301 }
// cleanWord normalizes a run of letters/stress-marks into a lowercase Cyrillic word, or
// returns ok=false for proper nouns (capitalized), hyphenated or non-Cyrillic runs.
func cleanWord(run []rune) (string, bool) {
if len(run) == 0 || isUpperCyr(run[0]) {
return "", false
}
var b strings.Builder
for _, r := range run {
switch {
case isStress(r), r == '­': // drop stress accents and soft hyphens
case r == '-': // a real hyphen means a hyphenated word: reject it
return "", false
default:
b.WriteRune(unicode.ToLower(r))
}
}
w := b.String()
if w == "" {
return "", false
}
for _, r := range w {
if !((r >= 'а' && r <= 'я') || r == 'ё') {
return "", false
}
}
return w, true
}
// headword returns the entry's headword: the leading run of letters, stress marks and
// hyphens, normalized.
func headword(line string) (string, bool) {
// Trim leading whitespace, including the form-feed (U+000C) that pdftotext puts at
// the top of each page — otherwise the first headword on every page is lost.
line = strings.TrimLeftFunc(line, unicode.IsSpace)
var run []rune
for _, r := range line {
if isCyrLetter(r) || isStress(r) || r == '-' || r == '­' {
run = append(run, r)
} else {
break
}
}
return cleanWord(run)
}
// embeddedSingulars returns the singular form of a plural headword spelled out after
// "ед.", either in full ("ед. ящер") or as a replacement suffix ("ед. -вец",
// reconstructed from headword). It skips gender marks ("ед. м") and abbreviations that
// merely start with "ед." ("ед. измер.", "ден. ед.").
func embeddedSingulars(line, headword string) []string {
var out []string
for i := 0; ; {
j := strings.Index(line[i:], "ед.")
if j < 0 {
break
}
i += j + len("ед.")
rest := strings.TrimLeft(line[i:], "  \t")
if strings.HasPrefix(rest, "-") { // suffix form: reconstruct from the headword
var suf []rune
for _, r := range rest[len("-"):] {
if isCyrLetter(r) || isStress(r) {
suf = append(suf, r)
} else {
break
}
}
if s, ok := cleanWord(suf); ok && len([]rune(s)) >= 2 {
if recon := reconstructSingular(headword, s); recon != "" {
out = append(out, recon)
}
}
continue
}
var run []rune
consumed := 0
for _, r := range rest {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
consumed += len(string(r))
} else {
break
}
}
if len(run) == 0 {
continue
}
if strings.HasPrefix(rest[consumed:], ".") {
continue // an abbreviation like "ед. измер." rather than a singular form
}
w, ok := cleanWord(run)
if !ok || len([]rune(w)) < 2 { // 2+ letters excludes the gender marks м/ж/с
continue
}
out = append(out, w)
}
return out
}
// reconstructSingular builds the singular from a plural headword and the replacement
// suffix from "ед. -<suffix>", splicing where the suffix best overlaps the tail of the
// headword (the position of longest common prefix between the suffix and a headword
// suffix). It is a heuristic; Stage 2 re-checks the words against real dictionaries.
func reconstructSingular(headword, suffix string) string {
hw, sf := []rune(headword), []rune(suffix)
bestK, bestLen := -1, 0
for k := 0; k < len(hw); k++ {
m := 0
for k+m < len(hw) && m < len(sf) && hw[k+m] == sf[m] {
m++
}
if m > bestLen {
bestK, bestLen = k, m
}
}
if bestK < 0 {
return ""
}
return string(hw[:bestK]) + suffix
}
// headwordNotes are the grammatical notes that mark a parallel headword (a lemma) after
// "и", as opposed to an inflected form. A "-" ending also marks one; form labels such as
// деепр. (gerund) or сравн. (comparative) deliberately do not.
var headwordNotes = map[string]bool{
"нескл": true, "неизм": true, "предлог": true, "предл": true, "нареч": true,
"нар": true, "прил": true, "союз": true, "частица": true, "част": true,
"межд": true, "мн": true, "ед": true, "тв": true, "числ": true, "мест": true,
"м": true, "ж": true, "с": true, "вводн": true, "сказ": true,
}
// variantNoteOK reports whether the note following a candidate variant marks a headword:
// a "-" inflection ending or one of headwordNotes (and not a bare inflected word).
func variantNoteOK(note string) bool {
if strings.HasPrefix(note, "-") {
return true
}
var stem []rune
for _, r := range note {
if (r >= 'а' && r <= 'я') || r == 'ё' {
stem = append(stem, r)
} else {
break
}
}
return headwordNotes[string(stem)]
}
// variants returns the second (and further) headwords of an entry, written as a parallel
// form after " и ", e.g. "аблатив, -а и аблятив, -а" yields "аблятив" and "регги и реггей,
// нескл." yields "реггей". Requiring a headword note after the comma keeps this from
// matching "и" inside examples or picking up inflected forms.
func variants(line string) []string {
var out []string
const sep = " и "
for i := 0; ; {
j := strings.Index(line[i:], sep)
if j < 0 {
break
}
i += j + len(sep)
rest := line[i:]
var run []rune
consumed := 0
for _, r := range rest {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
consumed += len(string(r))
} else {
break
}
}
if len(run) == 0 {
continue
}
after := rest[consumed:]
if !strings.HasPrefix(after, ", ") || !variantNoteOK(after[len(", "):]) {
continue
}
if w, ok := cleanWord(run); ok && len([]rune(w)) >= 2 {
out = append(out, w)
}
}
return out
}
// normToken normalizes any token (a run of letters and stress marks) for the skip set:
// lowercase, stress removed, kept only if it is 2+ all-Cyrillic letters. Unlike
// cleanWord it does NOT reject capitalized tokens — a lowercased proper noun belongs in
// the skip set so it can be re-checked by a morphological analyzer.
func normToken(run []rune) (string, bool) {
var b strings.Builder
for _, r := range run {
if isStress(r) {
continue
}
b.WriteRune(unicode.ToLower(r))
}
w := b.String()
if len([]rune(w)) < 2 {
return "", false
}
for _, r := range w {
if !((r >= 'а' && r <= 'я') || r == 'ё') {
return "", false
}
}
return w, true
}
// tokens returns every maximal run of Cyrillic letters (plus stress marks) in the line,
// normalized; runs are split on every other character (so hyphens split a word).
func tokens(line string) []string {
var out []string
var run []rune
flush := func() {
if len(run) > 0 {
if w, ok := normToken(run); ok {
out = append(out, w)
}
run = run[:0]
}
}
for _, r := range line {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
} else {
flush()
}
}
flush()
return out
}
func lessRu(a, b string) bool {
ra, rb := []rune(a), []rune(b)
for i := 0; i < len(ra) && i < len(rb); i++ {
if ra[i] != rb[i] {
return ruRank[ra[i]] < ruRank[rb[i]]
}
}
return len(ra) < len(rb)
}
func sortedRu(set map[string]struct{}) []string {
words := make([]string, 0, len(set))
for w := range set {
words = append(words, w)
}
sort.Slice(words, func(i, j int) bool { return lessRu(words[i], words[j]) })
return words
}
func writeWords(path string, words []string) error {
if dir := filepath.Dir(path); dir != "" && dir != "." {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
}
o, err := os.Create(path)
if err != nil {
return err
}
w := bufio.NewWriter(o)
for _, word := range words {
w.WriteString(word)
w.WriteByte('\n')
}
if err := w.Flush(); err != nil {
o.Close()
return err
}
return o.Close()
}
func main() {
in := flag.String("in", "dictprep/russian/orfo_dict_2025.txt", "plain-text dictionary (pdftotext output)")
out := flag.String("out", "dictprep/russian/all.txt", "output: the base word list (clean headwords + reconstructed singulars + variants)")
skip := flag.String("skip", "/tmp/ru_skip.txt", "output: every other token, for a later morphology re-check")
sings := flag.String("singulars", "/tmp/ru_singulars.txt", "output: singulars reconstructed from \"ед.\" (known nouns)")
varsOut := flag.String("variants", "/tmp/ru_variants.txt", "output: variant pairs joined by \"и\" (primary<TAB>variant)")
from := flag.Int("from", 452, "first line of the word-list section (1-based, inclusive)")
to := flag.Int("to", 168808, "last line of the word-list section (inclusive)")
flag.Parse()
if *in == "" {
log.Fatal("ruwords: -in is required")
}
f, err := os.Open(*in)
if err != nil {
log.Fatal(err)
}
defer f.Close()
all := make(map[string]struct{})
allTokens := make(map[string]struct{})
singulars := make(map[string]struct{})
variantPairs := make(map[string]struct{})
entries, fromHead, fromSing, fromVar := 0, 0, 0, 0
sc := bufio.NewScanner(f)
sc.Buffer(make([]byte, 1<<20), 1<<20)
for line := 0; sc.Scan(); {
line++
if line < *from || line > *to {
continue
}
entries++
text := sc.Text()
hw, hwOK := headword(text)
var sings []string
if hwOK {
sings = embeddedSingulars(text, hw)
}
primary := ""
if len(sings) > 0 {
// the headword is plural and the entry gives its singular: keep only the singular
primary = sings[0]
for _, w := range sings {
if _, seen := all[w]; !seen {
fromSing++
all[w] = struct{}{}
}
singulars[w] = struct{}{}
}
} else if hwOK {
primary = hw
if _, seen := all[hw]; !seen {
fromHead++
}
all[hw] = struct{}{}
}
for _, w := range variants(text) {
if _, seen := all[w]; !seen {
fromVar++
all[w] = struct{}{}
}
if primary != "" && primary != w {
variantPairs[primary+"\t"+w] = struct{}{}
}
}
for _, w := range tokens(text) {
allTokens[w] = struct{}{}
}
}
if err := sc.Err(); err != nil {
log.Fatal(err)
}
skipSet := make(map[string]struct{})
for w := range allTokens {
if _, ok := all[w]; !ok {
skipSet[w] = struct{}{}
}
}
allWords := sortedRu(all)
skipWords := sortedRu(skipSet)
if err := writeWords(*out, allWords); err != nil {
log.Fatal(err)
}
if err := writeWords(*skip, skipWords); err != nil {
log.Fatal(err)
}
if err := writeWords(*sings, sortedRu(singulars)); err != nil {
log.Fatal(err)
}
pairList := make([]string, 0, len(variantPairs))
for p := range variantPairs {
pairList = append(pairList, p)
}
sort.Strings(pairList)
if err := writeWords(*varsOut, pairList); err != nil {
log.Fatal(err)
}
fmt.Printf("scanned %d entries\n", entries)
fmt.Printf(" %-20s %7d words (%d headwords + %d embedded singulars + %d variants)\n", *out, len(allWords), fromHead, fromSing, fromVar)
fmt.Printf(" %-20s %7d words (tokens not in %s; for a morphology re-check)\n", *skip, len(skipWords), *out)
fmt.Printf(" %-20s %7d words (singulars from \"ед.\"; known nouns)\n", *sings, len(singulars))
fmt.Printf(" %-20s %7d pairs (variants joined by \"и\")\n", *varsOut, len(variantPairs))
}
+1 -1
View File
@@ -1,4 +1,4 @@
module scrabble-solver
module gitea.iliadenisov.ru/developer/scrabble-solver
go 1.26.3
+13 -43
View File
@@ -1,24 +1,18 @@
// Package dict loads the English test dictionary as a DAWG, preferring the serialized
// cache under testdata and falling back to building from the dictionaries submodule.
// Paths are resolved relative to the repository root so it works both from the repo root
// (commands) and from a package directory (tests).
// Package dict loads the English test dictionary as a DAWG from the committed
// dawg/en_sowpods.dawg fixture, for the cmd/stress benchmark. The dictionary build
// pipeline (word-list parsing and DAWG construction from sources) now lives in the
// separate scrabble-dictionary repository; this package only loads the committed
// artifact. Paths are resolved relative to the repository root so it works both from
// the repo root (commands) and from a package directory (tests).
package dict
import (
"os"
"path/filepath"
"github.com/iliadenisov/alphabet"
dawg "github.com/iliadenisov/dafsa"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/wordlist"
)
// MinLen and MaxLen bound playable word lengths (a 15x15 board holds at most 15).
const (
MinLen = 2
MaxLen = 15
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
)
func exists(p string) bool { _, err := os.Stat(p); return err == nil }
@@ -42,35 +36,11 @@ func Root() string {
}
}
// DAWGCache and WordlistPath locate the English cache file and source word list,
// relative to the repository root.
func DAWGCache() string { return filepath.Join(Root(), "testdata", "sowpods.dawg") }
func WordlistPath() string { return filepath.Join(Root(), "dictionaries", "english", "sowpods.txt") }
// DAWGCache locates the committed English DAWG, relative to the repository root.
func DAWGCache() string { return filepath.Join(Root(), "dawg", "en_sowpods.dawg") }
// EnglishAvailable reports whether the English dictionary can be loaded (cache or source).
func EnglishAvailable() bool {
return exists(DAWGCache()) || exists(WordlistPath())
}
// EnglishAvailable reports whether the committed English DAWG is present.
func EnglishAvailable() bool { return exists(DAWGCache()) }
// EnglishWords returns the encoded English word list (from the submodule source).
func EnglishWords() ([][]byte, error) {
return wordlist.Read(WordlistPath(), alphabet.Latin(), MinLen, MaxLen)
}
// EnglishDAWG returns the English DAWG, loading the cache if present, otherwise building
// it from the word list and caching it (best effort).
func EnglishDAWG() (dawg.Finder, error) {
if exists(DAWGCache()) {
return dictdawg.Load(DAWGCache())
}
words, err := EnglishWords()
if err != nil {
return nil, err
}
f, err := dictdawg.Build(alphabet.Latin(), words)
if err != nil {
return nil, err
}
_ = dictdawg.Save(f, DAWGCache())
return f, nil
}
// EnglishDAWG loads the committed English DAWG.
func EnglishDAWG() (dawg.Finder, error) { return dictdawg.Load(DAWGCache()) }
+1 -1
View File
@@ -6,7 +6,7 @@ import (
"github.com/iliadenisov/alphabet"
dawg "github.com/iliadenisov/dafsa"
"scrabble-solver/internal/graph"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/graph"
)
// TestSpellSmoke also exercises the go.mod replace => ../dafsa wiring and the new
+2 -2
View File
@@ -1,8 +1,8 @@
package scrabble
import (
"scrabble-solver/board"
"scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
)
// Apply places a move's newly-placed tiles on the board. The move must be legal for the
+2 -2
View File
@@ -3,8 +3,8 @@ package scrabble
import (
dawg "github.com/iliadenisov/dafsa"
"scrabble-solver/board"
"scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
)
// letterSet is a bit set over alphabet letter indexes (alphabets are at most 63
+2 -2
View File
@@ -6,8 +6,8 @@ import (
"github.com/iliadenisov/alphabet"
dawg "github.com/iliadenisov/dafsa"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/wordlist"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/wordlist"
)
func bruteCrossSet(words [][]byte, above, below []byte, size int) letterSet {
+5 -5
View File
@@ -8,9 +8,9 @@ import (
"strings"
"testing"
"scrabble-solver/board"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
// TestScoreRealGames replays real tournament games recorded in GCG format and checks that
@@ -19,11 +19,11 @@ import (
//
// The games come from cross-tables.com (annotated self-play) and are stored under
// testdata/. They use the standard English board and SOWPODS, so the test loads the
// committed dawg/en_sowpods.dawg (build it with `make dawg`).
// committed dawg/en_sowpods.dawg.
func TestScoreRealGames(t *testing.T) {
finder, err := dictdawg.Load("../dawg/en_sowpods.dawg")
if err != nil {
t.Skipf("need dawg/en_sowpods.dawg (run `make dawg`): %v", err)
t.Skipf("need dawg/en_sowpods.dawg: %v", err)
}
s := NewSolver(rules.English(), finder)
games, _ := filepath.Glob("testdata/*.gcg")
+3 -3
View File
@@ -1,9 +1,9 @@
package scrabble
import (
"scrabble-solver/board"
"scrabble-solver/rack"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
// generateBoth runs an across-generator on the board (for horizontal plays) and on its
+4 -4
View File
@@ -3,10 +3,10 @@ package scrabble
import (
dawg "github.com/iliadenisov/dafsa"
"scrabble-solver/board"
"scrabble-solver/internal/encoding"
"scrabble-solver/rack"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
// DAWGGenerator generates moves with the Appel-Jacobson two-phase algorithm
+6 -6
View File
@@ -5,12 +5,12 @@ import (
"github.com/iliadenisov/alphabet"
"scrabble-solver/board"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/encoding"
"scrabble-solver/internal/wordlist"
"scrabble-solver/rack"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/wordlist"
)
func makeRack(letters string, blanks int) rack.Rack {
+2 -2
View File
@@ -1,8 +1,8 @@
package scrabble
import (
"scrabble-solver/board"
"scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
)
// Generator produces every legal play for a position. The DAWG generator
+3 -3
View File
@@ -1,9 +1,9 @@
package scrabble
import (
"scrabble-solver/board"
"scrabble-solver/rack"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
// dict is a membership set of words (alphabet-index strings) for the oracle.
+3 -3
View File
@@ -5,9 +5,9 @@ import (
"fmt"
"sort"
"scrabble-solver/board"
"scrabble-solver/internal/encoding"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
// coord maps a line coordinate (fixed, axis) to a board (row, col) for direction dir.
+3 -3
View File
@@ -3,9 +3,9 @@ package scrabble
import (
"testing"
"scrabble-solver/board"
"scrabble-solver/internal/encoding"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/internal/encoding"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
const plain7 = `.......
+3 -3
View File
@@ -7,9 +7,9 @@ import (
dawg "github.com/iliadenisov/dafsa"
"scrabble-solver/board"
"scrabble-solver/rack"
"scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
)
// Solver is the high-level entry point: it generates ranked plays and scores or
+3 -3
View File
@@ -5,9 +5,9 @@ import (
"github.com/iliadenisov/alphabet"
"scrabble-solver/board"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/wordlist"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/wordlist"
)
func newTestSolver(t *testing.T) *Solver {
+4 -4
View File
@@ -7,10 +7,10 @@ import (
"sort"
"time"
"scrabble-solver/board"
"scrabble-solver/rack"
"scrabble-solver/rules"
"scrabble-solver/scrabble"
"gitea.iliadenisov.ru/developer/scrabble-solver/board"
"gitea.iliadenisov.ru/developer/scrabble-solver/rack"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/scrabble"
)
// blankTile marks a blank in the bag and in a player's hand.
+5 -5
View File
@@ -5,11 +5,11 @@ import (
"github.com/iliadenisov/alphabet"
"scrabble-solver/internal/dictdawg"
"scrabble-solver/internal/wordlist"
"scrabble-solver/rules"
"scrabble-solver/scrabble"
"scrabble-solver/selfplay"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/rules"
"gitea.iliadenisov.ru/developer/scrabble-solver/scrabble"
"gitea.iliadenisov.ru/developer/scrabble-solver/selfplay"
"gitea.iliadenisov.ru/developer/scrabble-solver/wordlist"
)
func TestPlayGameSmoke(t *testing.T) {