Implement Scrabble move generator (DAWG) with English and Russian rules
A Go library that returns every legal play ranked by score and scores or validates plays, using the Appel-Jacobson DAWG algorithm over github.com/iliadenisov/dafsa v1.1.0. - DAWG move generation (across / down / both), full tournament scoring with a per-tile breakdown; public Solver: GenerateMoves (ranked), ScorePlay, ValidatePlay. - Rulesets: English Scrabble, Russian Scrabble, Эрудит (parameterizable Ruleset). - cmd/builddict (build the DAWG from the dictionaries submodule), cmd/stress (self-play benchmark), selfplay engine; brute-force test oracle. - A GADDAG was implemented, benchmarked and removed (the DAWG was smaller and faster for a scoring solver); see RESULTS.md and ALGORITHM.md.
This commit is contained in:
@@ -0,0 +1,174 @@
|
||||
# Scrabble Solver — Implementation Plan
|
||||
|
||||
## Outcome (current state)
|
||||
|
||||
Both generators were implemented and verified to produce identical moves, then compared
|
||||
by self-play stress test (`RESULTS.md`). The **GADDAG was removed**: for a scoring solver
|
||||
it was ~7× larger and no faster than the **DAWG**, which is now the sole generator.
|
||||
Shipped: the DAWG generator, full scoring + breakdown, the public `Solver`
|
||||
(`GenerateMoves`/`ScorePlay`/`ValidatePlay`), and three rulesets (English Scrabble,
|
||||
Russian Scrabble, Эрудит). The rest of this document is the original roadmap, kept for
|
||||
history; the DAWG/GADDAG comparison it describes is preserved in `RESULTS.md`.
|
||||
|
||||
## Context
|
||||
|
||||
We are building a Go library that, given a dictionary, a current game position and a
|
||||
player's rack, returns every legal new play ranked by descending score. The core is a
|
||||
fast finite-automaton move generator based on two papers (analysed in `ALGORITHM.md`):
|
||||
|
||||
- Appel & Jacobson, *The World's Fastest Scrabble Program* (CACM 1988) — the **DAWG**
|
||||
algorithm (cross-checks, anchor squares, `LeftPart`/`ExtendRight`, edge encoding,
|
||||
cross-sums for scoring, transpose for the perpendicular direction).
|
||||
- Gordon, *A Faster Scrabble Move Generation Algorithm* (SP&E 1994) — the **GADDAG**
|
||||
(`REV(x)◊y` representation, single `Gen`/`GoOn` generator, deterministic cross-sets,
|
||||
construction algorithm).
|
||||
|
||||
The graph engine is `github.com/iliadenisov/dafsa` — a compact, bit-packed minimized
|
||||
DAWG with a 6-bit "compact alphabet" (`alphabet.Indexer`, ≤63 symbols) and an
|
||||
index-based (`*B`) API, checked out locally at `../dafsa`.
|
||||
|
||||
### Headline approach
|
||||
|
||||
**Implement BOTH generators — DAWG and GADDAG — behind one shared `Generator`
|
||||
interface, then decide which becomes the production default empirically**, via a clean
|
||||
self-play stress test (two greedy players, several games) on the *same* dictionary,
|
||||
measuring speed and memory. The choice is made **after** implementation and measurement.
|
||||
Both implementations are kept; the comparison output (`RESULTS.md`) picks the default.
|
||||
|
||||
### Locked decisions
|
||||
|
||||
| # | Topic | Decision |
|
||||
|---|---|---|
|
||||
| 1 | Core algorithm | **Implement both**: DAWG (Appel-Jacobson) and GADDAG (Gordon, over `dafsa` as a DAWG of `{REV(x)◊y}` with per-node final flags). Pick the default after a self-play stress test. |
|
||||
| 2 | dafsa changes | **Edit `../dafsa` directly**, wire via `go.mod replace`. Leave a spec/CHANGELOG there. |
|
||||
| 3 | Ruleset scope | Default **standard English Scrabble**, fully **parameterizable** (board geometry, premium layout DL/TL/DW/TW, tile values & counts, alphabet, blank count, bingo bonus). Must support **Russian "Эрудит"** (same 15×15 board + premiums; different tile values/counts; Cyrillic alphabet; one word per move, horizontal **or** vertical). |
|
||||
| 4 | Scoring | **Full tournament scoring + breakdown**: main word + all cross-words + premiums (newly-placed tiles only) + bingo bonus; result carries formed cross-words and a per-tile breakdown. |
|
||||
| 5 | Symbol encoding | **`0x80` = wildcard/blank** flag (board/rack/output only — never in the graph). **GADDAG separator `◊` = index == alphabet size** (`cbits` minimal; measured optimum). |
|
||||
| 6 | State model | **Compact byte board** is the generation core; a structured **`Play`** type + a constructor that applies plays to build a board provide the full game-state overlay. |
|
||||
| 7 | API scope | **Generation + scoring + validation** of arbitrary plays. |
|
||||
| 8 | Dictionary | `kamilmielnik/scrabble-dictionaries` as a **git submodule**; `cmd/builddict` builds serialized structures **cached in `testdata/` (gitignored)**. English now; Russian later. |
|
||||
|
||||
### Static structure probe (informs expectations; NOT the decision)
|
||||
|
||||
Full SOWPODS (267,752 words, Σlen = 2,439,269), built through `dafsa`:
|
||||
|
||||
| Structure | nodes | bytes | bits/char | build | ns/arc |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| DAWG (a–z) | 77,808 | 750 KB | 2.46 | 186 ms | 48.7 |
|
||||
| GADDAG sep=size(26) · cbits5 | 587,940 | 5.37 MB | 17.61 | 2.92 s | 61.0 |
|
||||
| GADDAG sep=62 · cbits6 (measured) | 587,940 | 5.53 MB | 18.13 | — | 60.9 |
|
||||
| GADDAG sep=0x40 · cbits7 (extrapolated) | 587,940 | ~5.69 MB | ~18.6 | — | ~61 |
|
||||
|
||||
GADDAG is ~7× the DAWG and ~25% costlier per arc, but ~2× faster at actual move
|
||||
generation (Gordon Table IV: ~2.5× fewer arcs). The stress test settles it.
|
||||
|
||||
## Deliverable documents
|
||||
|
||||
1. **`ALGORITHM.md`** — single source of truth (verbatim pseudocode + our adaptation).
|
||||
2. **`PLAN.md`** — this plan.
|
||||
3. **`RESULTS.md`** — stress-test comparison + the production-default decision.
|
||||
|
||||
## Architecture & package layout
|
||||
|
||||
```
|
||||
scrabble-solver/
|
||||
go.mod # + replace github.com/iliadenisov/dafsa => ../dafsa
|
||||
PLAN.md ALGORITHM.md RESULTS.md README.md
|
||||
dictionaries/ # git submodule: kamilmielnik/scrabble-dictionaries
|
||||
testdata/ # gitignored: cached serialized DAWG + GADDAG
|
||||
internal/
|
||||
gaddag/ # REV(x)◊y transform + build + traversal wrapper over dafsa
|
||||
dictdawg/ # plain-DAWG build + traversal wrapper over dafsa
|
||||
encoding/ # byte conventions (wildcard 0x80, separator, board cells)
|
||||
board/ # compact board grid, transpose, premium layout
|
||||
rack/ # rack as per-letter counts + blanks
|
||||
rules/ # Ruleset: geometry, premiums, tile values/counts, alphabet, bonus
|
||||
scrabble/ (public pkg) # Solver + Generator interface; Play/Move types;
|
||||
gen_dawg.go # DAWG generator (LeftPart/ExtendRight)
|
||||
gen_gaddag.go # GADDAG generator (Gen/GoOn)
|
||||
selfplay/ # bag + greedy player + game loop (self-play engine)
|
||||
cmd/builddict/ # word list -> serialized DAWG/GADDAG -> testdata
|
||||
cmd/stress/ # run N self-play games per generator, emit comparison
|
||||
```
|
||||
|
||||
Shared `Generator` interface so the harness can swap implementations:
|
||||
|
||||
```go
|
||||
type Generator interface {
|
||||
GenerateMoves(b *board.Board, r rack.Rack, mode Mode) []Move // ranked, descending score
|
||||
Name() string
|
||||
}
|
||||
```
|
||||
|
||||
Board, rack, rules and **scoring are shared**; cross-set computation is per-generator
|
||||
(DAWG: probe the dictionary DAWG incl. the non-deterministic left set; GADDAG:
|
||||
deterministic GADDAG walk) — that difference is part of what is measured.
|
||||
|
||||
## Changes to `../dafsa` (additive ⇒ backward compatible)
|
||||
|
||||
1. **Low-level traversal API**: `type Node` (opaque bit-offset); `Root() Node`;
|
||||
`Next(n, ch) (child, final, ok)` (= Gordon's `NextArc`, wraps private `getEdge`);
|
||||
`Arcs(n, fn)` (wraps `getNode`, for blanks/cross-sets); a reusable allocation-free
|
||||
cursor for hot-path `Next`.
|
||||
2. **Custom-alphabet persistence**: `WriteTo`/`SaveWith` (allow non-embedded alphabet);
|
||||
`ReadWith`/`LoadWith` (inject a known indexer, skip language reconstruction).
|
||||
3. (Optional) accurate serialized arc/node count; document that `NumEdges()` is build-time.
|
||||
|
||||
## Data model & compact formats
|
||||
|
||||
- **Byte symbol**: low 6 bits = alphabet index; `0x80` = wildcard/blank (I/O only);
|
||||
`◊` = index `len(alphabet)` (GADDAG graph only).
|
||||
- **Board**: `[]byte`, row-major. `0` = empty; occupied = `letterIndex+1`; blank =
|
||||
`(letterIndex+1) | 0x80`. Helpers: `At`, `Set`, `Transpose`, premium lookup.
|
||||
- **Rack**: `[]byte` counts, length `alphabetSize+1`; last slot = blank count.
|
||||
- **`Play`**: `{Row, Col; Dir; Tiles []byte (0x80 flags); Main; CrossWords; Score;
|
||||
Breakdown}` — input for apply/validate/score and the output element.
|
||||
- **Modes**: `Both`, `Horizontal`, `Vertical`.
|
||||
|
||||
## Staged implementation
|
||||
|
||||
- **Stage 0** — Scaffolding & docs: `ALGORITHM.md`, `PLAN.md`, `dictionaries/` submodule,
|
||||
`go.mod` replace, `.gitignore`.
|
||||
- **Stage 1** — dafsa traversal API (shared): `Node`, `Root`, `Next`, `Arcs`, cursor; tests.
|
||||
- **Stage 2** — dafsa custom-alphabet persistence: `SaveWith`/`ReadWith`; round-trip.
|
||||
- **Stage 3** — Shared infra: encoding, board (+transpose), rack, rules (EN; Эрудит stub),
|
||||
scoring, `Generator` interface, `Move`/`Play` types.
|
||||
- **Stage 4** — Dictionary build: `internal/dictdawg` + `internal/gaddag`; `cmd/builddict`
|
||||
caching serialized DAWG **and** GADDAG in `testdata`.
|
||||
- **Stage 5** — Cross-sets: DAWG cross-checks (incl. non-deterministic left set) and GADDAG
|
||||
deterministic cross-sets; validated against each other + brute force on a small lexicon.
|
||||
- **Stage 6** — DAWG generator (`LeftPart`/`ExtendRight`).
|
||||
- **Stage 7** — GADDAG generator (`Gen`/`GoOn`).
|
||||
- **Stage 8** — Correctness gate: DAWG and GADDAG identical move sets on random positions
|
||||
(each move once) + brute force on a tiny dictionary. Must pass before perf comparison.
|
||||
- **Stage 9** — Self-play stress test: `selfplay` engine (bag, racks, greedy policy,
|
||||
seeded RNG, end conditions); `cmd/stress` plays N games per generator measuring time,
|
||||
arcs, allocations (`runtime.MemStats`), peak RSS (`/proc/self/status` VmHWM), footprint;
|
||||
emit `RESULTS.md`.
|
||||
- **Stage 10** — Decision + public API: choose default from `RESULTS.md` (both selectable);
|
||||
finalize `Solver` API, Play↔board constructors, examples.
|
||||
- **Stage 11** — Polish: benchmarks, README, optional prebuilt-graph distribution.
|
||||
|
||||
## Verification
|
||||
|
||||
- `go test ./...`, `go vet`, lint green per stage.
|
||||
- Mutual oracle (Stage 8): identical move sets; brute force on a tiny dictionary.
|
||||
- Build EN structures from the SOWPODS submodule via `cmd/builddict`; run `GenerateMoves`
|
||||
on canonical positions (e.g. Gordon's "CARE on ABLE") and assert top moves/scores.
|
||||
- Run `cmd/stress` (100–1000 seeded games per generator) → `RESULTS.md`.
|
||||
|
||||
## Assumptions & caveats
|
||||
|
||||
- Both algorithms ship; the production default is decided by the stress test. Both remain
|
||||
selectable.
|
||||
- Self-play policy defaults to **greedy** (deterministic tie-break, seeded RNG); tunable.
|
||||
- Separator = real 27th token (`◊`, index = size, `cbits=5`); `0x40` reserved on the board.
|
||||
Wildcard/blank = `0x80`, never in the graph.
|
||||
- **Stateless per-call** generation in v1 (anchors + cross-sets recomputed per call);
|
||||
incremental maintenance is a later optimization (both generators run stateless — a
|
||||
fairness note for the comparison).
|
||||
- Persistence stores only the graph; the (custom) alphabet is injected on load.
|
||||
- Russian "Эрудит" alphabet specifics (Е/Ё handling, tile values/counts) resolved at
|
||||
Stage 3/4; "one word per move, H or V" is satisfied by the modes.
|
||||
- The final-flag GADDAG is larger than Gordon's letter-set form; letter-sets-on-arcs
|
||||
remain a possible future optimization.
|
||||
Reference in New Issue
Block a user