scrabble-dictionary/tools/README.md

# Russian word-list preparation (`tools`)

Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
Russian academic **orthographic dictionary**, cross-checked against two independent
morphological dictionaries.

The goal of the pipeline is a list of **common nouns in the nominative singular**
(`sources/scrabble_ru/scrabble.txt`), plus an ambiguous tail for manual review.

> This directory is self-contained tooling for *building* the word list. It is not part
> of the solver library. The committed result lives in `sources/scrabble_ru/`.

## Source

`orfo_dict_2025.pdf` — *Русский орфографический словарь РАН* (≈ 200 000 entries), the
authority for **spelling**. It encodes declension type in its grammatical notes but does
**not** reliably mark part of speech.

- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>

The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
from the text alone — the binary PDF is needed only to regenerate that text.

## Outputs (`sources/scrabble_ru/`)

The committed result is **three** files; every other bucket stays in the Stage-2
process's memory (dump it with `--dump`, query it with `--trace WORD`).

| File | Committed | Meaning |
|------|:--:|---------|
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 2–15 — the working dictionary. |
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |

`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
word with the reason it did or did not reach the dictionary); these are git-ignored debug
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
the brain consumes.

## Prerequisites

```sh
# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils

# 2. Go toolchain (Stage 1) — already required by the parent module

# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3            # bundles OpenCorpora 2025 (words.dawg)

# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 tools/libmorph_check.cpp -lmorphrus -lmoonycode -o tools/libmorph_check
```

If `tools/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
the stack and reports `libmorph_helper=MISSING`.

## How to run

```sh
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext tools/orfo_dict_2025.pdf sources/scrabble_ru/orfo_dict_2025.txt

# Stage 1 — build the base word list (Go): sources/scrabble_ru/all.txt + /tmp/ru_*.txt
go run ./tools/ruwords

# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python tools/ru_stage2.py

# ask how a word did or did not reach the dictionary
ru-venv/bin/python tools/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python tools/ru_stage2.py --dump
```

`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
PDF is re-exported.

## Algorithm

### Stage 1 — `ruwords` (Go)

Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):

- the **headword** (leading token). Leading whitespace including the form-feed `\f`
  pdftotext puts at every page top is trimmed — otherwise the first headword of each page
  is lost;
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
  (`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
  suffix best overlaps the headword); the plural is then dropped (a plural that has a
  singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
- **variant headwords** after `и` that carry their own grammatical note
  (`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.

Everything else (every maximal Cyrillic token not selected above) goes to
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.

### Stage 2 — `ru_stage2.py` (Python)

Each Stage-1 word (length 2–15) is routed by three sources, most authoritative first:

1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
   reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
   added (so nouns absent from the PDF are included).
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
   keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
   reading in *either* is enough (their disagreements were reviewed and resolved this way,
   since each is incomplete in different places). A singular reconstructed from "ед." that
   neither dictionary knows is accepted as a noun (the orthographic note attests it).
3. A word **both dictionaries miss** is classified by the orthographic **note**
   (`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
   `нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
   verb is dropped; anything undecided goes to `undefined.txt`.
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
   травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
   from review/undefined into the result as well, propagated transitively through chains.
   The plural-form variants the dictionaries already resolve never reach this step.

The nominative singular always comes from the dictionary that recognised the word, or from
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).

### The libmorph bridge — `libmorph_check.cpp`

libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
stdin→stdout filter: one UTF-8 word per line in, one line out:

```
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
```

`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
speech. The codes were reverse-engineered (the docs omit the table):

| codes | part of speech |
|------|----------------|
| **7–21, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
| 1–3 | verb · 25, 27 adjective · 28–32 pronoun · 33–36 numeral |
| 38–39 | **proper noun** (excluded) · 48–58 comparative/adverb · 49–53 function words |

The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
passed and lemmas returned in UTF-8.

## Notes & caveats

- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
  dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
  signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
  and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
  which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
  and is therefore used only as a weak confirming vote, never as dictionary membership.
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
  dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.