Files
Ilia Denisov 15c7959d96 Implement Scrabble move generator (DAWG) with English and Russian rules
A Go library that returns every legal play ranked by score and scores or
validates plays, using the Appel-Jacobson DAWG algorithm over
github.com/iliadenisov/dafsa v1.1.0.

- DAWG move generation (across / down / both), full tournament scoring with a
  per-tile breakdown; public Solver: GenerateMoves (ranked), ScorePlay,
  ValidatePlay.
- Rulesets: English Scrabble, Russian Scrabble, Эрудит (parameterizable Ruleset).
- cmd/builddict (build the DAWG from the dictionaries submodule), cmd/stress
  (self-play benchmark), selfplay engine; brute-force test oracle.
- A GADDAG was implemented, benchmarked and removed (the DAWG was smaller and
  faster for a scoring solver); see RESULTS.md and ALGORITHM.md.
2026-06-01 16:07:32 +02:00

165 lines
7.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scrabble move generation — algorithm reference (single source of truth)
This is the authoritative description of the algorithm the solver implements: the DAWG
move generator of Appel & Jacobson. It is distilled from the source paper (extracted
from the PDF in this repo) plus our adaptation. **Work from this file; do not re-parse
the PDFs.**
Source: **[AJ88]** A. Appel, G. Jacobson, *The World's Fastest Scrabble Program*,
Commun. ACM 31(5):572578, 585 (1988).
> A GADDAG (Gordon, 1994) was also implemented and benchmarked against the DAWG, then
> **removed**: for a scoring solver it was ~7× larger and no faster. See `RESULTS.md`.
Notation: letters are alphabet **indexes** (English `a..z``0..25`). "Across" =
horizontal play; "down" = vertical. The board grid and all I/O use the byte encoding in
`internal/encoding` (low 6 bits = letter+1, `0` = empty, bit 7 = blank).
---
## 1. Reduction to one dimension [AJ88 §3.1]
Every play is either **across** (one row) or **down** (one column). Down plays are
across plays on the **transposed** board, so the generator implements only "across" and
runs on the board and/or its transpose. This is the two-mode requirement:
- `OnlyHorizontal` = across on the board.
- `OnlyVertical` = across on the transpose.
- `Both` = both. (A move found on the transpose has its coordinates swapped back.)
### Anchors [AJ88 §3.1.2]
An across word must include a newly-placed tile adjacent to a tile already on the board.
Generation starts only from **anchors** — empty squares orthogonally adjacent to a
filled square — which guarantees connectivity and prunes most of the search. The very
first move has no anchors and is special-cased through the board's centre square.
### Cross-checks (cross-sets) [AJ88 §3.1.1]
For each empty square the **cross-set** is the set of letters that, placed there, form a
legal **perpendicular** word with the tiles above/below it; it is stored as a bit-vector
(`letterSet`). A square with no perpendicular neighbour allows every letter. Cross-sets
let move generation stay one-dimensional. Off-board squares are treated as blocking
sentinels.
---
## 2. Lexicon: trie → DAWG [AJ88 §3.2]
The lexicon is a trie whose edges are labelled by letters; a word is a root-to-node
path; terminal nodes are marked. Minimizing the trie (merging equivalent sub-tries)
yields a **DAWG**, the minimum-state automaton for the word set. `dafsa` is exactly such
a minimized, bit-packed DAWG, with a compact ≤6-bit alphabet and a **per-node** `final`
flag. Its node identity is a bit-offset; edges of a node are sorted by letter.
---
## 3. Move generation — [AJ88 §3.3] (verbatim)
Two phases per anchor: place a **left part**, then **extend right**. The left part is
either tiles already on the board (read them, then `ExtendRight` from the corresponding
node) or rack tiles placed by a pruned DAWG traversal bounded by `limit` = number of
non-anchor squares to the left (this bound, reset at each anchor, makes each move appear
once).
```
ExtendRight(PartialWord, node N, square):
if square is vacant then
if N is terminal then LegalMove(PartialWord)
for each edge E out of N:
let l = letter of E
if l is in the rack and l is in the cross-set of square then
remove a tile l from the rack
ExtendRight(PartialWord || l, node(E), square+1)
put the tile l back into the rack
else
let l = letter occupying square
if N has an edge labelled l leading to N' then
ExtendRight(PartialWord || l, N', square+1)
LeftPart(PartialWord, node N, limit):
ExtendRight(PartialWord, N, AnchorSquare)
if limit > 0 then
for each edge E out of N:
let l = letter of E
if l is in the rack then
remove a tile l from the rack
LeftPart(PartialWord || l, node(E), limit - 1)
put the tile l back into the rack
```
To generate from an anchor with `k` non-anchor squares to its left: `LeftPart("", root,
k)`. Implementation notes:
- A word is **recorded only past the anchor** (`square > anchor`), so every recorded play
covers the anchor square — the connection guaranteed by the anchor's filled neighbour.
- **Empty prefix** when a tile sits left of the anchor: skip `LeftPart`; `ExtendRight`
directly from the node reached by walking the on-board left context.
- **Blanks**: when scanning the rack for a letter, also allow a blank to stand for it;
the placed tile is flagged (bit 7) and scores 0; restore on backtrack.
- **First move**: the only anchor is the centre; the left limit equals its column, so the
word covers the centre.
This maps onto the `dafsa` traversal API: `Cursor.Root`, `Cursor.Next(node, letter)`
(an edge, with the destination's `final` flag), `Cursor.Final(node)`, and `Cursor.Arcs`
(enumerate a node's edges, used for placements and cross-sets).
---
## 4. Cross-set computation
For a square with tiles `above` (top→bottom) and `below`, the cross-set is
`{ X : above·X·below ∈ dict }`:
- **Right extension** (no `below`): deterministic — `X` just completes the prefix
`above`. Walk `above` to a node, then read the **completers** (one arc enumeration:
the letters whose arc leads straight to an accepting node).
- **Left extension** (tiles `below`): non-deterministic — probe each `X` (walk `above`,
`X`, then `below`, test acceptance). This is the asymmetry inherent to a left-to-right
DAWG.
Cross-sets are recomputed per generation for the squares that need them (cached within a
call). Scoring is done separately by `Evaluate` (§5), so cross-sums are not precomputed.
---
## 5. Scoring (full tournament rules + breakdown)
A play's score = main word + every cross-word formed + the all-tiles bonus. Per word:
- Sum `tileValue(letter)` over its tiles; a **blank** scores 0.
- A **letter** premium (DL/TL) multiplies the value of a tile placed on it **only when
newly placed** this turn.
- A **word** premium (DW/TW) multiplies the whole word **only when a newly-placed tile
sits on it**; multiple word premiums multiply.
- Each cross-word counts its new tile plus the existing perpendicular run.
The all-tiles bonus is added when the play uses a full rack. Board geometry, premium
layout, tile values/counts, blank count, rack size and the bonus are all part of the
`rules.Ruleset` (`English`, `RussianScrabble`, `Erudit`). `Evaluate` returns the main
word, the cross-words and a per-tile breakdown. `ValidatePlay` adds dictionary and
connectivity checks.
---
## 6. Rulesets
- **English** Scrabble: 15×15, standard premiums, 26 letters, 100 tiles (98 + 2 blanks),
7-tile rack, 50-point bonus.
- **Russian Scrabble**: 33-letter alphabet (incl. Ё), standard board, 104 tiles, 50-bonus.
- **Эрудит**: 33-letter alphabet with **Ё unused** (no tile — fold Ё→Е when preparing the
dictionary, `wordlist.FoldYo`; an out-of-engine step), the **centre does not double**,
131 tiles (128 + 3 blanks), blanks score 0, **15-point bonus**.
---
## 7. Special cases checklist
- **First move**: no anchors; the play must cover the centre.
- **Blanks**: any letter, score 0, flagged via bit 7; expand to every cross-set-allowed
letter during generation.
- **Off-board sentinels**: stop extension at the edge.
- **A single newly-placed tile** can form both an across and a down word.
- **Dedup**: each legal move is generated once (anchor + left-part limit); a canonical
move key guards against any residual duplicate.