Implement Scrabble move generator (DAWG) with English and Russian rules

A Go library that returns every legal play ranked by score and scores or
validates plays, using the Appel-Jacobson DAWG algorithm over
github.com/iliadenisov/dafsa v1.1.0.

- DAWG move generation (across / down / both), full tournament scoring with a
  per-tile breakdown; public Solver: GenerateMoves (ranked), ScorePlay,
  ValidatePlay.
- Rulesets: English Scrabble, Russian Scrabble, Эрудит (parameterizable Ruleset).
- cmd/builddict (build the DAWG from the dictionaries submodule), cmd/stress
  (self-play benchmark), selfplay engine; brute-force test oracle.
- A GADDAG was implemented, benchmarked and removed (the DAWG was smaller and
  faster for a scoring solver); see RESULTS.md and ALGORITHM.md.
This commit is contained in:
Ilia Denisov
2026-06-01 16:07:32 +02:00
parent f51a1fe2f2
commit 15c7959d96
43 changed files with 3406 additions and 0 deletions
+164
View File
@@ -0,0 +1,164 @@
# Scrabble move generation — algorithm reference (single source of truth)
This is the authoritative description of the algorithm the solver implements: the DAWG
move generator of Appel & Jacobson. It is distilled from the source paper (extracted
from the PDF in this repo) plus our adaptation. **Work from this file; do not re-parse
the PDFs.**
Source: **[AJ88]** A. Appel, G. Jacobson, *The World's Fastest Scrabble Program*,
Commun. ACM 31(5):572578, 585 (1988).
> A GADDAG (Gordon, 1994) was also implemented and benchmarked against the DAWG, then
> **removed**: for a scoring solver it was ~7× larger and no faster. See `RESULTS.md`.
Notation: letters are alphabet **indexes** (English `a..z``0..25`). "Across" =
horizontal play; "down" = vertical. The board grid and all I/O use the byte encoding in
`internal/encoding` (low 6 bits = letter+1, `0` = empty, bit 7 = blank).
---
## 1. Reduction to one dimension [AJ88 §3.1]
Every play is either **across** (one row) or **down** (one column). Down plays are
across plays on the **transposed** board, so the generator implements only "across" and
runs on the board and/or its transpose. This is the two-mode requirement:
- `OnlyHorizontal` = across on the board.
- `OnlyVertical` = across on the transpose.
- `Both` = both. (A move found on the transpose has its coordinates swapped back.)
### Anchors [AJ88 §3.1.2]
An across word must include a newly-placed tile adjacent to a tile already on the board.
Generation starts only from **anchors** — empty squares orthogonally adjacent to a
filled square — which guarantees connectivity and prunes most of the search. The very
first move has no anchors and is special-cased through the board's centre square.
### Cross-checks (cross-sets) [AJ88 §3.1.1]
For each empty square the **cross-set** is the set of letters that, placed there, form a
legal **perpendicular** word with the tiles above/below it; it is stored as a bit-vector
(`letterSet`). A square with no perpendicular neighbour allows every letter. Cross-sets
let move generation stay one-dimensional. Off-board squares are treated as blocking
sentinels.
---
## 2. Lexicon: trie → DAWG [AJ88 §3.2]
The lexicon is a trie whose edges are labelled by letters; a word is a root-to-node
path; terminal nodes are marked. Minimizing the trie (merging equivalent sub-tries)
yields a **DAWG**, the minimum-state automaton for the word set. `dafsa` is exactly such
a minimized, bit-packed DAWG, with a compact ≤6-bit alphabet and a **per-node** `final`
flag. Its node identity is a bit-offset; edges of a node are sorted by letter.
---
## 3. Move generation — [AJ88 §3.3] (verbatim)
Two phases per anchor: place a **left part**, then **extend right**. The left part is
either tiles already on the board (read them, then `ExtendRight` from the corresponding
node) or rack tiles placed by a pruned DAWG traversal bounded by `limit` = number of
non-anchor squares to the left (this bound, reset at each anchor, makes each move appear
once).
```
ExtendRight(PartialWord, node N, square):
if square is vacant then
if N is terminal then LegalMove(PartialWord)
for each edge E out of N:
let l = letter of E
if l is in the rack and l is in the cross-set of square then
remove a tile l from the rack
ExtendRight(PartialWord || l, node(E), square+1)
put the tile l back into the rack
else
let l = letter occupying square
if N has an edge labelled l leading to N' then
ExtendRight(PartialWord || l, N', square+1)
LeftPart(PartialWord, node N, limit):
ExtendRight(PartialWord, N, AnchorSquare)
if limit > 0 then
for each edge E out of N:
let l = letter of E
if l is in the rack then
remove a tile l from the rack
LeftPart(PartialWord || l, node(E), limit - 1)
put the tile l back into the rack
```
To generate from an anchor with `k` non-anchor squares to its left: `LeftPart("", root,
k)`. Implementation notes:
- A word is **recorded only past the anchor** (`square > anchor`), so every recorded play
covers the anchor square — the connection guaranteed by the anchor's filled neighbour.
- **Empty prefix** when a tile sits left of the anchor: skip `LeftPart`; `ExtendRight`
directly from the node reached by walking the on-board left context.
- **Blanks**: when scanning the rack for a letter, also allow a blank to stand for it;
the placed tile is flagged (bit 7) and scores 0; restore on backtrack.
- **First move**: the only anchor is the centre; the left limit equals its column, so the
word covers the centre.
This maps onto the `dafsa` traversal API: `Cursor.Root`, `Cursor.Next(node, letter)`
(an edge, with the destination's `final` flag), `Cursor.Final(node)`, and `Cursor.Arcs`
(enumerate a node's edges, used for placements and cross-sets).
---
## 4. Cross-set computation
For a square with tiles `above` (top→bottom) and `below`, the cross-set is
`{ X : above·X·below ∈ dict }`:
- **Right extension** (no `below`): deterministic — `X` just completes the prefix
`above`. Walk `above` to a node, then read the **completers** (one arc enumeration:
the letters whose arc leads straight to an accepting node).
- **Left extension** (tiles `below`): non-deterministic — probe each `X` (walk `above`,
`X`, then `below`, test acceptance). This is the asymmetry inherent to a left-to-right
DAWG.
Cross-sets are recomputed per generation for the squares that need them (cached within a
call). Scoring is done separately by `Evaluate` (§5), so cross-sums are not precomputed.
---
## 5. Scoring (full tournament rules + breakdown)
A play's score = main word + every cross-word formed + the all-tiles bonus. Per word:
- Sum `tileValue(letter)` over its tiles; a **blank** scores 0.
- A **letter** premium (DL/TL) multiplies the value of a tile placed on it **only when
newly placed** this turn.
- A **word** premium (DW/TW) multiplies the whole word **only when a newly-placed tile
sits on it**; multiple word premiums multiply.
- Each cross-word counts its new tile plus the existing perpendicular run.
The all-tiles bonus is added when the play uses a full rack. Board geometry, premium
layout, tile values/counts, blank count, rack size and the bonus are all part of the
`rules.Ruleset` (`English`, `RussianScrabble`, `Erudit`). `Evaluate` returns the main
word, the cross-words and a per-tile breakdown. `ValidatePlay` adds dictionary and
connectivity checks.
---
## 6. Rulesets
- **English** Scrabble: 15×15, standard premiums, 26 letters, 100 tiles (98 + 2 blanks),
7-tile rack, 50-point bonus.
- **Russian Scrabble**: 33-letter alphabet (incl. Ё), standard board, 104 tiles, 50-bonus.
- **Эрудит**: 33-letter alphabet with **Ё unused** (no tile — fold Ё→Е when preparing the
dictionary, `wordlist.FoldYo`; an out-of-engine step), the **centre does not double**,
131 tiles (128 + 3 blanks), blanks score 0, **15-point bonus**.
---
## 7. Special cases checklist
- **First move**: no anchors; the play must cover the centre.
- **Blanks**: any letter, score 0, flagged via bit 7; expand to every cross-set-allowed
letter during generation.
- **Off-board sentinels**: stop extension at the edge.
- **A single newly-placed tile** can form both an across and a down word.
- **Dedup**: each legal move is generated once (anchor + left-part limit); a canonical
move key guards against any residual duplicate.