Initial dictionary producer: builddict, word-list sources, DAWG build + CI
build / dawg (push) Failing after 2s

- builddict drives the de-internalized scrabble-solver dictdawg/wordlist builders
  (pinned v1.0.0) to produce the three DAWGs (en_sowpods, ru_scrabble, ru_erudit),
  byte-identical to the solver's committed fixtures (same dafsa/alphabet v1.1.0 -> no
  index drift with the running backend).
- Sources: english/sowpods.txt vendored from kamilmielnik/scrabble-dictionaries;
  russian/scrabble.txt + the dictprep tooling moved out of scrabble-solver.
- CI builds the DAWGs on push/PR and, on a vX.Y.Z tag, packages them flat into
  scrabble-dawg-<tag>.tar.gz and attaches it to the Gitea release.
This commit is contained in:
Ilia Denisov
2026-06-04 19:18:19 +02:00
commit d04470b741
15 changed files with 352547 additions and 0 deletions
+62
View File
@@ -0,0 +1,62 @@
name: build
# Builds the dictionary DAWGs on every push/PR (validation) and, on a vX.Y.Z tag,
# packages them flat into scrabble-dawg-<tag>.tar.gz and attaches it to the Gitea release.
# The build pins the published scrabble-solver builders (GOPRIVATE -> direct VCS fetch from
# this Gitea), so the on-disk format matches the running backend exactly.
on:
push:
branches: [master]
tags: ['v*']
pull_request:
branches: [master]
jobs:
dawg:
runs-on: ubuntu-latest
defaults:
run:
shell: bash
env:
GOPRIVATE: gitea.iliadenisov.ru/*
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
cache: true
- name: Build DAWGs
run: |
make dawg
ls -la dawg/
for f in en_sowpods ru_scrabble ru_erudit; do
test -s "dawg/$f.dawg" || { echo "missing dawg/$f.dawg"; exit 1; }
done
- name: Package and publish release artifact
if: startsWith(github.ref, 'refs/tags/v')
env:
TOKEN: ${{ github.token }}
API: ${{ github.server_url }}/api/v1/repos/${{ github.repository }}
run: |
set -eo pipefail
tag="${GITHUB_REF_NAME}"
art="scrabble-dawg-${tag}.tar.gz"
tar czf "$art" -C dawg en_sowpods.dawg ru_scrabble.dawg ru_erudit.dawg
# Create the release (or fetch it if it already exists), then upload the asset.
code=$(curl -sS -o /tmp/rel.json -w '%{http_code}' -X POST "$API/releases" \
-H "Authorization: token $TOKEN" -H 'Content-Type: application/json' \
-d "{\"tag_name\":\"$tag\",\"name\":\"$tag\",\"body\":\"Dictionary DAWG set $tag (en_sowpods, ru_scrabble, ru_erudit).\"}")
if [ "$code" != "201" ]; then
echo "release POST returned $code; fetching existing release for tag $tag"
curl -sS -o /tmp/rel.json "$API/releases/tags/$tag" -H "Authorization: token $TOKEN"
fi
rel_id=$(python3 -c 'import json;print(json.load(open("/tmp/rel.json"))["id"])')
curl -sS -X POST "$API/releases/$rel_id/assets?name=$art" \
-H "Authorization: token $TOKEN" -F "attachment=@$art" -o /tmp/asset.json
echo "published $art to release $rel_id"
+3
View File
@@ -0,0 +1,3 @@
# Built DAWGs are release artifacts (published by CI on a vX.Y.Z tag), not committed.
/dawg/
/scrabble-dawg-*.tar.gz
+34
View File
@@ -0,0 +1,34 @@
# scrabble-dictionary build helpers.
#
# `make dawg` (re)builds the dictionary DAWGs under dawg/ from their word lists, using the
# published scrabble-solver dictdawg/wordlist builders (pinned in go.mod), so the on-disk
# format and letter indexing match the running backend exactly (no index drift):
# en_sowpods.dawg — English SOWPODS (Latin alphabet)
# ru_scrabble.dawg — Russian Scrabble nouns (Cyrillic, 33 letters)
# ru_erudit.dawg — Эрудит (the same list with Ё→Е folded and de-duped)
#
# The CI workflow packages dawg/*.dawg into a release artifact on a vX.Y.Z tag.
export GOPRIVATE := gitea.iliadenisov.ru/*
GO ?= go
PYTHON ?= python3
DAWG_DIR := dawg
BUILDDICT := $(GO) run ./cmd/builddict
.PHONY: dawg dawg-en dawg-ru dawg-erudit clean-dawg
dawg: dawg-en dawg-ru dawg-erudit
dawg-en:
$(BUILDDICT) -dict dictionaries/english/sowpods.txt -alphabet latin -name en_sowpods -out $(DAWG_DIR)
dawg-ru:
$(BUILDDICT) -dict dictprep/russian/scrabble.txt -alphabet russian -name ru_scrabble -out $(DAWG_DIR)
dawg-erudit:
$(PYTHON) dictprep/fold_yo.py dictprep/russian/scrabble.txt > /tmp/ru_erudit_words.txt
$(BUILDDICT) -dict /tmp/ru_erudit_words.txt -alphabet russian -name ru_erudit -out $(DAWG_DIR)
clean-dawg:
rm -f $(DAWG_DIR)/*.dawg
+44
View File
@@ -0,0 +1,44 @@
# scrabble-dictionary
Versioned **dictionary artifacts** for the Scrabble game backend: the word-list sources and
the build pipeline that produces the dictionary DAWGs, published as a **release artifact**
(the DAWGs are data, not a Go module).
The build uses the published
[`scrabble-solver`](https://gitea.iliadenisov.ru/developer/scrabble-solver) `dictdawg`/`wordlist`
packages (pinned in `go.mod`) over `github.com/iliadenisov/{dafsa,alphabet}` (v1.1.0), so the
on-disk format and letter indexing match the running backend **exactly** — there is no index
drift, because the backend pins the same `dafsa`/`alphabet`. The DAWGs this repo builds are
byte-identical to the solver's committed test fixtures.
## Artifact
`make dawg` builds three DAWGs into `dawg/`:
| file | variant | source |
| --- | --- | --- |
| `en_sowpods.dawg` | English (SOWPODS) | `dictionaries/english/sowpods.txt` |
| `ru_scrabble.dawg` | Russian Scrabble | `dictprep/russian/scrabble.txt` |
| `ru_erudit.dawg` | Эрудит | the Russian list with Ё→Е folded (`dictprep/fold_yo.py`) |
The CI (`.gitea/workflows/build.yaml`) builds them on every push/PR and, on a `vX.Y.Z` tag,
packages them flat into `scrabble-dawg-<tag>.tar.gz` and attaches it to the Gitea release. The
backend deploy unpacks that tarball into `BACKEND_DICT_DIR`; **one semver label versions the
whole set** (additive — a new version is a new release, never breaking a running backend).
## Sources / provenance
- **English:** `dictionaries/english/sowpods.txt`, vendored from
[`kamilmielnik/scrabble-dictionaries`](https://github.com/kamilmielnik/scrabble-dictionaries).
- **Russian:** `dictprep/russian/scrabble.txt`, derived from the Russian academic orthographic
dictionary by the tooling under `dictprep/` (see `dictprep/README.md`). Only the prepared word
list is vendored; the heavy upstream source (the orfo PDF/text) is not.
## Build
```sh
make dawg # -> dawg/{en_sowpods,ru_scrabble,ru_erudit}.dawg
```
Requires Go (module deps fetched with `GOPRIVATE=gitea.iliadenisov.ru/*`, exported by the
Makefile) and `python3` (for the Ё→Е fold).
+75
View File
@@ -0,0 +1,75 @@
// Command builddict converts a word list into a serialized DAWG. By default it reads the
// English SOWPODS list (Latin alphabet); pass -alphabet russian for the Cyrillic lists.
package main
import (
"flag"
"fmt"
"log"
"os"
"path/filepath"
"time"
"github.com/iliadenisov/alphabet"
"gitea.iliadenisov.ru/developer/scrabble-solver/dictdawg"
"gitea.iliadenisov.ru/developer/scrabble-solver/wordlist"
)
func main() {
dict := flag.String("dict", "dictionaries/english/sowpods.txt", "word list file (one word per line)")
out := flag.String("out", "testdata", "output directory")
name := flag.String("name", "sowpods", "base name for the output file")
minLen := flag.Int("min", 2, "minimum word length")
maxLen := flag.Int("max", 15, "maximum word length")
alpha := flag.String("alphabet", "latin", "alphabet: latin (English) or russian")
flag.Parse()
var idx alphabet.Indexer
switch *alpha {
case "latin":
idx = alphabet.Latin()
case "russian":
idx = alphabet.Embedded(alphabet.Langs.LangRu)
default:
log.Fatalf("unknown -alphabet %q (want latin or russian)", *alpha)
}
t0 := time.Now()
words, err := wordlist.Read(*dict, idx, *minLen, *maxLen)
if err != nil {
log.Fatalf("read %s: %v", *dict, err)
}
fmt.Printf("loaded %d words from %s in %s\n", len(words), *dict, time.Since(t0).Round(time.Millisecond))
if err := os.MkdirAll(*out, 0o755); err != nil {
log.Fatal(err)
}
t := time.Now()
f, err := dictdawg.Build(idx, words)
if err != nil {
log.Fatalf("build dawg: %v", err)
}
path := filepath.Join(*out, *name+".dawg")
if err := dictdawg.Save(f, path); err != nil {
log.Fatalf("save: %v", err)
}
size := int64(0)
if fi, err := os.Stat(path); err == nil {
size = fi.Size()
}
fmt.Printf("DAWG %d nodes, %s, built+saved in %s -> %s\n",
f.NumNodes(), humanBytes(size), time.Since(t).Round(time.Millisecond), path)
}
func humanBytes(n int64) string {
switch {
case n >= 1<<20:
return fmt.Sprintf("%.2f MB", float64(n)/(1<<20))
case n >= 1<<10:
return fmt.Sprintf("%.1f KB", float64(n)/(1<<10))
default:
return fmt.Sprintf("%d B", n)
}
}
File diff suppressed because it is too large Load Diff
+164
View File
@@ -0,0 +1,164 @@
# Russian word-list preparation (`dictprep`)
Builds the Russian **noun** word list for the Scrabble/Эрудит solver out of the official
Russian academic **orthographic dictionary**, cross-checked against two independent
morphological dictionaries.
The goal of the pipeline is a list of **common nouns in the nominative singular**
(`dictprep/russian/scrabble.txt`), plus an ambiguous tail for manual review.
> This directory is self-contained tooling for *building* the word list. It is not part
> of the solver library. The committed result lives in `dictprep/russian/`.
## Source
`orfo_dict_2025.pdf`*Русский орфографический словарь РАН* (≈ 200 000 entries), the
authority for **spelling**. It encodes declension type in its grammatical notes but does
**not** reliably mark part of speech.
- Source: <https://ruslang.ru/sites/default/files/doc/normativnyje_slovari/orfograficheskij_slovar.pdf>
- Mirror: <https://rus-gos.spbu.ru/index.php/dictionary>
The PDF is git-ignored (large, third-party); place it here as `orfo_dict_2025.pdf`. Its
pdftotext output is committed as `russian/orfo_dict_2025.txt`, so the word list rebuilds
from the text alone — the binary PDF is needed only to regenerate that text.
## Outputs (`dictprep/russian/`)
The committed result is **three** files; every other bucket stays in the Stage-2
process's memory (dump it with `--dump`, query it with `--trace WORD`).
| File | Committed | Meaning |
|------|:--:|---------|
| `orfo_dict_2025.txt` | ✓ | the pdftotext output — the parsed source of truth (the PDF binary is not needed to rebuild). |
| `all.txt` | ✓ | Stage 1 base: every clean Cyrillic headword/variant; a plural headword with a singular is replaced by that singular. |
| `manual_confirm.txt` | ✓ | hand-reviewed nouns from the undefined tail; the brain merges them into the result. |
| `scrabble.txt` | ✓ | **Stage 2 result**: common nouns, nominative singular (+ pluralia tantum), length 215 — the working dictionary. |
| `undefined.txt` | — | the ambiguous tail; kept in memory, written only with `--dump`. |
`--dump` also writes `adjectives.txt`, `verbs.txt`, `singulars.txt` and `fate.tsv` (every
word with the reason it did or did not reach the dictionary); these are git-ignored debug
artifacts. Stage 1 also writes `/tmp/ru_{skip,singulars,variants}.txt`, intermediate inputs
the brain consumes.
## Prerequisites
```sh
# 1. pdftotext (Poppler)
sudo apt-get install -y poppler-utils
# 2. Go toolchain (Stage 1) — already required by the parent module
# 3. Python + the OpenCorpora analyser (Stage 2)
sudo apt-get install -y python3-venv python3-pip
python3 -m venv ru-venv
ru-venv/bin/pip install mawo-pymorphy3 # bundles OpenCorpora 2025 (words.dawg)
# 4. libmorph — the independent morphological dictionary (Stage 2 cross-check)
sudo apt-get install -y morphrus morphrus-dev moonycode-dev morphapi-dev
g++ -std=c++17 -O2 dictprep/libmorph_check.cpp -lmorphrus -lmoonycode -o dictprep/libmorph_check
```
If `dictprep/libmorph_check` is absent, Stage 2 still runs — it simply drops libmorph from
the stack and reports `libmorph_helper=MISSING`.
## How to run
```sh
# Stage 0 — PDF -> plain text (committed as the source of truth; run once)
pdftotext dictprep/orfo_dict_2025.pdf dictprep/russian/orfo_dict_2025.txt
# Stage 1 — build the base word list (Go): dictprep/russian/all.txt + /tmp/ru_*.txt
go run ./dictprep/ruwords
# Stage 2 — the brain (Python + mawo + libmorph): writes scrabble.txt
ru-venv/bin/python dictprep/ru_stage2.py
# ask how a word did or did not reach the dictionary
ru-venv/bin/python dictprep/ru_stage2.py --trace травмпункт
# also write the in-memory buckets (undefined, adjectives, verbs, singulars, fate.tsv)
ru-venv/bin/python dictprep/ru_stage2.py --dump
```
`-from`/`-to` (defaulting to 452/168808) bound the column word-list section of
`russian/orfo_dict_2025.txt` (line 452 = the first entry `а1, …`; line 168808 = the last,
`я́щурный`). The preface above line 452 is prose and is skipped. Verify these bounds if the
PDF is re-exported.
## Algorithm
### Stage 1 — `ruwords` (Go)
Per dictionary line in `[from, to]` it collects, normalised (stress marks U+0300/U+0301
stripped, lowercased, `ё` kept, hyphenated/capitalised/non-Cyrillic rejected):
- the **headword** (leading token). Leading whitespace including the form-feed `\f`
pdftotext puts at every page top is trimmed — otherwise the first headword of each page
is lost;
- the **singular of a plural headword** when the entry gives it after `ед.`, in full
(`ящеры, …, ед. ящер`) or as a replacement suffix (`…, ед. -вец`, spliced where the
suffix best overlaps the headword); the plural is then dropped (a plural that has a
singular is never needed) and the singular is also recorded (`/tmp/ru_singulars.txt`);
- **variant headwords** after `и` that carry their own grammatical note
(`аблатив, -а и аблятив, -а`; `регги и реггей, нескл.`), excluding inflected forms.
Everything else (every maximal Cyrillic token not selected above) goes to
`/tmp/ru_skip.txt`, a safety net for a later morphology re-check.
### Stage 2 — `ru_stage2.py` (Python)
Each Stage-1 word (length 215) is routed by three sources, most authoritative first:
1. **OpenCorpora** (`words.dawg`, read directly — *not* the predictor): a common-noun
reading ⇒ keep the OpenCorpora lemma. The full OpenCorpora common-noun lexicon is also
added (so nouns absent from the PDF are included).
2. **libmorph** (independent dictionary, via `libmorph_check`): a common-noun reading ⇒
keep the libmorph lemma. The two dictionaries are treated as **complementary** — a noun
reading in *either* is enough (their disagreements were reviewed and resolved this way,
since each is incomplete in different places). A singular reconstructed from "ед." that
neither dictionary knows is accepted as a noun (the orthographic note attests it).
3. A word **both dictionaries miss** is classified by the orthographic **note**
(`-ая, -ое` ⇒ adjective; `-ть`, `сов./несов.` ⇒ verb; single genitive `-а/-и` or
`нескл., м./ж./с.` ⇒ noun). A note-noun goes straight to `scrabble.txt`; an adjective or
verb is dropped; anything undecided goes to `undefined.txt`.
4. **Variant rescue**: when the dictionary joins two spellings with "и" (`травмопункт и
травмпункт`, `регги и реггей`) and one is already a confirmed noun, the other is moved
from review/undefined into the result as well, propagated transitively through chains.
The plural-form variants the dictionaries already resolve never reach this step.
The nominative singular always comes from the dictionary that recognised the word, or from
the orthographic `ед.` note — never from a predictor guess (libmorph and the predictor
mis-lemmatise out-of-dictionary words, e.g. `витебчане → витебчан` instead of `витебчанин`).
### The libmorph bridge — `libmorph_check.cpp`
libmorph (A. Kovalenko, MIT) ships as `libmorphrus.so`. `libmorph_check` is a thin
stdin→stdout filter: one UTF-8 word per line in, one line out:
```
<known>\t<pos>:<lemma>\t<pos>:<lemma>...
```
`<known>` is `CheckWord` (1 = in the dictionary). `<pos>` is `wdInfo & 0x3f`, the part of
speech. The codes were reverse-engineered (the docs omit the table):
| codes | part of speech |
|------|----------------|
| **721, 24** | **noun** (all genders / declensions / animacy; pluralia tantum is 24) |
| 13 | verb · 25, 27 adjective · 2832 pronoun · 3336 numeral |
| 3839 | **proper noun** (excluded) · 4858 comparative/adverb · 4953 function words |
The analyser instance is requested with the key `libmorph.api.v4:utf-8` so words are
passed and lemmas returned in UTF-8.
## Notes & caveats
- The hard tail (≈ 35 000 Stage-1 words / our candidates) is in **no** morphological
dictionary; only the orthographic dictionary attests them, so the PDF note is the sole
signal there. Compound and very recent nouns (`робототехник`, `толкинист`) live here.
- OpenCorpora and libmorph are near-equal in size (≈ 99 500 words each on `all.txt`)
and ≈ 96 % overlapping, but **complementary** (each contributes ≈ 2 200 unique nouns),
which is why both are kept. The mawo *predictor* "knows" ~98 % of everything by guessing
and is therefore used only as a weak confirming vote, never as dictionary membership.
- Licensing: OpenCorpora data is CC BY-SA 3.0; libmorph is MIT; the orthographic
dictionary has its own copyright. A list derived from CC BY-SA data inherits that licence.
+27
View File
@@ -0,0 +1,27 @@
#!/usr/bin/env python3
"""Fold Ё/ё → Е/е in a word list and de-duplicate — the dictionary prep for "Эрудит".
The Эрудит ruleset has no Ё tile and treats Е/Ё as one letter, so its dictionary must be
folded before the DAWG is built. Folding merges pairs like ёж/еж, hence the de-dup. Output
is sorted (Russian order over the 32 folded letters) and LF-separated.
Run: python3 dictprep/fold_yo.py dictprep/russian/scrabble.txt > /tmp/ru_erudit_words.txt
"""
import sys
ORDER = {c: i for i, c in enumerate("абвгдежзийклмнопрстуфхцчшщъыьэюя")} # 32 letters, no ё
def key(w):
return [ORDER.get(c, 99) for c in w]
def main():
src = sys.argv[1] if len(sys.argv) > 1 else "/dev/stdin"
words = {line.strip().replace("ё", "е").replace("Ё", "Е") for line in open(src, encoding="utf-8")}
words.discard("")
sys.stdout.write("\n".join(sorted(words, key=key)) + "\n")
if __name__ == "__main__":
main()
+47
View File
@@ -0,0 +1,47 @@
// libmorph_check: a thin stdin->stdout bridge to the libmorph Russian morphological
// analyser, for use by the Stage-2 classifier (scripts/ru_stage2.py).
//
// Reads one word per line (bytes are passed through verbatim — the caller encodes to
// the code page the libmorph char interface expects, CP1251). For each word it writes
// a line:
//
// <known>\t<pos>:<lemma>\t<pos>:<lemma>...
//
// where <known> is CheckWord's result (1 = in the dictionary, 0 = not), and each
// following field is one lexeme: its part of speech (wdInfo & 0x3f) and lemma.
//
// Build: g++ -std=c++17 -O2 scripts/libmorph_check.cpp -lmorphrus -lmoonycode -o libmorph_check
#include <libmorph/rus.h>
#include <libmorph/api.hpp>
#include <cstdio>
#include <iostream>
#include <string>
int main(int argc, char** argv) {
// The factory key selects the code page: "libmorph.api.v4:<charset>". Use the
// UTF-8 instance so words pass through verbatim. IMlmaMbXX only adds non-virtual
// convenience wrappers over IMlmaMb, so the filled pointer can be used as such.
const char* key = argc > 1 ? argv[1] : "libmorph.api.v4:utf-8";
IMlmaMbXX* mlma = nullptr;
int rc = mlmaruGetAPI(key, (void**)&mlma);
if (mlma == nullptr) {
std::fprintf(stderr, "libmorph_check: GetAPI('%s') failed, rc=%d\n", key, rc);
return 1;
}
std::string line;
while (std::getline(std::cin, line)) {
if (!line.empty() && line.back() == '\r') line.pop_back();
IMlmaMbXX::inword w(line.c_str(), line.size());
int known = mlma->CheckWord(w, sfIgnoreCapitals);
std::cout << known;
try {
for (auto& lx : mlma->Lemmatize(w, sfIgnoreCapitals)) {
unsigned pos = lx.ngrams > 0 ? (lx.pgrams[0].wdInfo & 0x3f) : 0xffu;
std::cout << '\t' << pos << ':' << (lx.plemma ? lx.plemma : "");
}
} catch (...) {
}
std::cout << '\n';
}
return 0;
}
+341
View File
@@ -0,0 +1,341 @@
#!/usr/bin/env python3
"""Stage 2 — the "brain" of the Russian Scrabble word-list pipeline.
It reads the Stage-1 base word list (built once by ruwords so the heavy PDF is not
re-parsed) together with the grammatical notes and the singular/variant structure, runs
the whole noun-selection logic in memory, and writes a minimal result:
dictprep/russian/scrabble.txt — the working dictionary (common nouns, nom. sing.)
dictprep/russian/undefined.txt — the ambiguous tail, left for manual review
(dictprep/russian/all.txt is the Stage-1 base.) Every other bucket — adjectives, verbs,
the merged note-nouns, singulars, variants — stays in memory. Pass --dump to also write
them; pass --trace WORD to ask how a single word did or did not reach the dictionary.
Note: all.txt is a plain word list, so the grammatical notes, "ед." singulars and "и"
variants are read from the pdftotext output (slov.txt) and the Stage-1 side files; the
expensive PDF parse itself runs only once.
Sources, most authoritative first: OpenCorpora (mawo-pymorphy3), libmorph (libmorph_check),
and the orthographic dictionary's own notes. See dictprep/README.md.
Run: ru-venv/bin/python dictprep/ru_stage2.py [--dump] [--trace WORD]
"""
import argparse
import os
import re
import subprocess
HERE = os.path.dirname(os.path.abspath(__file__))
OUT_DIR = os.path.join(HERE, "russian")
SLOV = os.path.join(OUT_DIR, "orfo_dict_2025.txt") # committed pdftotext output (source of truth)
WL_FROM, WL_TO = 452, 168808 # 1-based inclusive bounds of the column word-list section
OC_CACHE = "/tmp/oc_nouns.txt"
LIBMORPH_BIN = os.path.join(HERE, "libmorph_check")
ALPHABET = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
ORDER = {c: i for i, c in enumerate(ALPHABET)}
PROPER = {"Name", "Surn", "Patr", "Geox", "Orgn", "Trad"}
LIBMORPH_NOUN_CODES = set(range(7, 22)) | {24} # 7..21 plus 24 (pluralia tantum)
ADJ_END = {"ая", "яя", "ое", "ее", "ье", "ья", "ьи"}
VERB3 = ("ет", "ёт", "ит", "ют", "ут", "ает", "яет", "ует", "уют", "нет", "жет", "чет")
GENPL = ("ов", "ёв", "ев", "ей")
def key(w):
return [ORDER.get(c, 99) for c in w]
def destress(s):
return "".join(c for c in s if ord(c) not in (0x0300, 0x0301)).lower()
def cyr_ok(w):
return 2 <= len(w) <= 15 and all(("а" <= c <= "я") or c == "ё" for c in w)
def load(p):
return [l.strip() for l in open(p, encoding="utf-8") if l.strip()] if os.path.exists(p) else []
def write(path, words):
os.makedirs(os.path.dirname(path), exist_ok=True)
open(path, "w", encoding="utf-8").write("\n".join(sorted(set(words), key=key)) + "\n")
import mawo_pymorphy3 # noqa: E402
M = mawo_pymorphy3.MorphAnalyzer()
D = M._dawg_dict
def oc_noun_lemmas():
"""Every common-noun lemma (nom. sing. / pluralia tantum) in OpenCorpora's words.dawg."""
gp, pt = D.get_paradigm, D.parse_tag_string
para0, tagc = {}, {}
def g0(pid):
r = para0.get(pid)
if r is None:
suf0, tag0, pre0 = gp(pid, 0)
_, gr = pt(tag0)
r = (pre0, suf0, gr)
para0[pid] = r
return r
def gt(pid, idx):
k = (pid, idx)
r = tagc.get(k)
if r is None:
suf, tag, pre = gp(pid, idx)
pos, gr = pt(tag)
r = (suf, pre, pos, gr)
tagc[k] = r
return r
out = set()
for word, rec in D.words_dawg.iteritems():
pid, idx = rec
suf, pre, pos, gr = gt(pid, idx)
if pos != "NOUN":
continue
pre0, suf0, gr0 = g0(pid)
if (PROPER & gr) or (PROPER & gr0):
continue
stem = word[len(pre):len(word) - len(suf)] if suf else word[len(pre):]
out.add(pre0 + stem + suf0)
return {w for w in out if cyr_ok(w)}
def oc_status(word):
"""(is_common_noun, in_dictionary) for word, from OpenCorpora only."""
parses = D.get_word_parses(word)
if not parses:
return False, False
gp, pt = D.get_paradigm, D.parse_tag_string
for pid, idx in parses:
suf, tag, pre = gp(pid, idx)
pos, gr = pt(tag)
if pos == "NOUN":
_, tag0, _ = gp(pid, 0)
_, gr0 = pt(tag0)
if not (PROPER & gr or PROPER & gr0):
return True, True
return False, True
def libmorph_analyze(words):
"""Map each word to (known, noun_lemma, codes) per libmorph; noun_lemma is None when it
is not a common noun there. Empty result if the helper binary is not built."""
words = list(words)
if not words or not os.path.exists(LIBMORPH_BIN):
return {}
proc = subprocess.run([LIBMORPH_BIN], input="\n".join(words), capture_output=True, text=True)
out = {}
for w, line in zip(words, proc.stdout.split("\n")):
fields = line.split("\t")
known = fields[:1] == ["1"]
codes, noun_lemmas = set(), []
for field in fields[1:]:
code, _, lex = field.partition(":")
if code.isdigit():
codes.add(int(code))
if int(code) in LIBMORPH_NOUN_CODES:
noun_lemmas.append(lex)
lemma = (w if w in noun_lemmas else noun_lemmas[0]) if noun_lemmas else None
out[w] = (known, lemma, codes)
return out
def build_notes():
"""Map each headword (destressed, lowercased) to its grammatical note."""
def is_hw(ch):
o = ord(ch)
return (0x0430 <= o <= 0x044F) or (0x0410 <= o <= 0x042F) or o in (0x0401, 0x0451, 0x0300, 0x0301)
hmap = {}
lines = open(SLOV, encoding="utf-8").read().split("\n")
for l in lines[WL_FROM - 1:WL_TO]:
s = l.lstrip()
e = 0
for ch in s:
if is_hw(ch):
e += 1
else:
break
hw = destress(s[:e])
if hw and hw not in hmap:
hmap[hw] = destress(s[e:]).strip()
return hmap
def classify(w, note):
"""Coarse part of speech of an out-of-dictionary word from its PDF note."""
if note is None:
return "amb"
n = re.sub(r"\([^)]*\)", "", note).strip() # drop domain/etymology parentheticals
if "кр. ф" in n or "кр.ф" in n or "прич." in n or "прил." in n:
return "adj"
ends = re.findall(r"-([а-яё]+)", n)
if any(e in ADJ_END for e in ends):
return "adj"
if "сов." in n or "несов." in n or "безл." in n:
return "verb"
if w.endswith("ся"): # reflexive: no Russian noun ends in -ся
return "verb"
if any(e.endswith(VERB3) for e in ends) and not any(m in n for m in ("ед.", "тв.", "род.", "м.", "ж.", "с.")):
return "verb"
if n == "" and w.endswith(("ый", "ий", "ой", "ая", "ое", "ые", "ие", "яя", "ее")):
return "adj"
if "нескл" in n:
return "noun" if any(g in n for g in ("м.", "ж.", "с.", "мн.")) else "amb"
if ends:
return "noun"
if n == "" and w.endswith(("ать", "ять", "еть", "ить", "оть", "уть", "ыть", "ти", "чь")):
return "verb"
return "amb"
def singular(w, note):
"""Nominative singular of a noun headword from the PDF note (authoritative) or, for a
plural headword without an explicit singular, the mawo lemma; pluralia tantum kept."""
n = note or ""
full = re.search(r"ед\.\s+([а-яё]+)", n)
if full:
return full.group(1)
suf = re.search(r"ед\.\s+-([а-яё]+)", n)
if suf:
s = suf.group(1)
i = w.rfind(s[0])
return w[:i] + s if i > 0 else w
ends = re.findall(r"-([а-яё]+)", re.sub(r"\([^)]*\)", "", n))
if ends and ends[0].endswith(GENPL):
for p in M.parse(w):
if str(p.tag.POS) == "NOUN":
return p.normal_form
return w
return w
def build():
"""Run the whole pipeline in memory. Returns the result sets plus a `fate` map giving
every word's outcome, so a word's path can be traced or the buckets dumped."""
oc = set(load(OC_CACHE)) or oc_noun_lemmas()
if not os.path.exists(OC_CACHE):
write(OC_CACHE, oc)
hmap = build_notes()
all_words = load(os.path.join(OUT_DIR, "all.txt"))
ed_nouns = set(load("/tmp/ru_singulars.txt"))
pairs = [tuple(p) for l in load("/tmp/ru_variants.txt") if len(p := l.split("\t")) == 2]
pdf = [w for w in all_words if cyr_ok(w)]
lm = libmorph_analyze(pdf)
def to_singular(w):
s = singular(w, hmap.get(w))
return s if cyr_ok(s) else w
fate = {}
scrabble = set(oc)
adj, verb, amb = [], [], []
for w in pdf:
oc_noun, oc_known = oc_status(w)
if oc_noun:
fate[w] = "scrabble: сущ. по OpenCorpora"
continue
lm_known, lm_lemma, _ = lm.get(w, (False, None, frozenset()))
if lm_lemma is not None:
s = lm_lemma if cyr_ok(lm_lemma) else to_singular(w)
scrabble.add(s)
fate[w] = "scrabble: сущ. по libmorph" + ("" if s == w else f"{s}")
continue
if oc_known or lm_known:
fate[w] = "отброшено: словарь знает как не-существительное"
continue
if w in ed_nouns:
scrabble.add(w)
fate[w] = "scrabble: ед.ч. по помете «ед.»"
continue
c = classify(w, hmap.get(w))
if c == "noun":
s = to_singular(w)
scrabble.add(s)
fate[w] = "scrabble: сущ. по помете орфословаря" + ("" if s == w else f"{s}")
elif c == "adj":
adj.append(w)
fate[w] = "отброшено: прилагательное (помета орфословаря)"
elif c == "verb":
verb.append(w)
fate[w] = "отброшено: глагол (помета орфословаря)"
else:
amb.append(w)
fate[w] = "undefined: неоднозначное (нет в словарях, помета не определяет)"
# Manual confirmations: nouns the maintainer approved from the undefined tail.
for w in load(os.path.join(OUT_DIR, "manual_confirm.txt")):
if cyr_ok(w):
scrabble.add(w)
fate[w] = "scrabble: подтверждено вручную (manual_confirm.txt)"
# Variant rescue: a word joined by "и" to a confirmed noun is itself a noun.
pending = set(amb) - scrabble
changed = True
while changed:
changed = False
for a, b in pairs:
for x, y in ((a, b), (b, a)):
if x in scrabble and y in pending:
scrabble.add(y)
pending.discard(y)
fate[y] = f"scrabble: вариант от «{x}» (через «и»)"
changed = True
undefined = [w for w in amb if w not in scrabble]
return {
"oc": oc, "scrabble": scrabble, "undefined": undefined,
"adjectives": adj, "verbs": verb, "singulars": ed_nouns,
"fate": fate, "all": set(all_words),
}
def trace(word, r):
w = destress(word)
if w in r["fate"]:
return r["fate"][w]
if w in r["scrabble"]:
return "scrabble: лексикон OpenCorpora" if w in r["oc"] else "scrabble: производная/лемма"
if w not in r["all"]:
return "нет в russian_all (не извлечено на Stage 1 — нет в .pdf, либо имя собств./дефис/форма)"
if not cyr_ok(w):
return "отсеяно: длина или символы вне диапазона (2–15 кириллица)"
return "не определено"
def main():
ap = argparse.ArgumentParser(description="Stage 2 brain: build the noun dictionary, trace a word, or dump buckets.")
ap.add_argument("--dump", action="store_true", help="also write the in-memory buckets (adjectives, verbs, singulars, variants, fate)")
ap.add_argument("--trace", metavar="WORD", help="report how WORD did or did not reach the dictionary, then exit")
args = ap.parse_args()
r = build()
if args.trace:
print(f"{args.trace}: {trace(args.trace, r)}")
return
write(os.path.join(OUT_DIR, "scrabble.txt"), r["scrabble"])
print(f"=> dictprep/russian/scrabble.txt {len(r['scrabble'])}")
print(f" undefined kept in memory: {len(set(r['undefined']))} (use --dump to write it)")
if args.dump:
write(os.path.join(OUT_DIR, "undefined.txt"), r["undefined"])
write(os.path.join(OUT_DIR, "adjectives.txt"), r["adjectives"])
write(os.path.join(OUT_DIR, "verbs.txt"), r["verbs"])
write(os.path.join(OUT_DIR, "singulars.txt"), r["singulars"])
fate_path = os.path.join(OUT_DIR, "fate.tsv")
os.makedirs(OUT_DIR, exist_ok=True)
with open(fate_path, "w", encoding="utf-8") as f:
for w in sorted(r["fate"], key=key):
f.write(f"{w}\t{r['fate'][w]}\n")
print(f" dumped: undefined.txt ({len(set(r['undefined']))}), adjectives.txt, verbs.txt, singulars.txt, fate.tsv")
if __name__ == "__main__":
main()
+135
View File
@@ -0,0 +1,135 @@
артгруппа
бутень
вебинар
видеодневник
водозащита
генацвале
жакоб
оберфюрер
околоть
особина
полбазара
полбака
полбалкона
полбанана
полбарана
полбатальона
полбатона
полбиблиотеки
полблокнота
полбокала
полбуханки
полвагона
полвечера
полвзвода
полвинта
полгазеты
полгектара
полгостиницы
полграмма
полгруппы
полдачи
полдвора
полдекабря
полдеревни
полдетсада
полдивана
полдивизии
полдыни
полжурнала
ползавода
ползарплаты
полздания
полканикул
полканистры
полкартофелины
полкастрюли
полквартиры
полкилограмма
полкласса
полкниги
полколлекции
полкольца
полкоманды
полкоробки
полкочана
полкурса
полкуска
полмагазина
полмандарина
полмарта
полматча
полмиллиметра
полмузея
полноября
полпакета
полпарка
полпартии
полпинты
полпирога
полпирожка
полпируэта
полпоезда
полполена
полполка
полполки
полполосы
полпомидора
полпоросёнка
полпосёлка
полпредовский
полпроцента
полпузырька
полрайона
полромана
полроты
полрулона
полряда
полсада
полсажени
полсезона
полсентября
полсловаря
полсостава
полсрока
полстада
полстены
полстолетия
полстраницы
полстроки
полтаблетки
полтайма
полтакта
полтарелки
полтетради
полтома
полтона
полторта
полтысячелетия
полтюбика
полусанаторий
полфакультета
полфевраля
полфлакона
полфразы
полхаты
полцарства
полцентнера
полцистерны
полчайника
полчемодана
полшажка
полшажочка
полшара
полшкафа
полшколы
полщеки
принт
промо
рентгеноаппарат
сивец
соцнаём
срывка
флеш
флешмобер
шиноремонт
File diff suppressed because it is too large Load Diff
+434
View File
@@ -0,0 +1,434 @@
// Command ruwords extracts a clean Cyrillic word list from the plain text of a Russian
// orthographic dictionary (the output of `pdftotext`).
//
// Stage 1 (this tool): from the column word-list section [from, to] it collects, per
// entry, the headword (the leading token). When the headword is plural and the entry
// gives its singular after "ед." — in full ("ящеры, …, ед. ящер") or as a replacement
// suffix ("…, ед. -вец") — only the singular is kept, since a plural that has a singular
// is never needed. It drops stress marks, lowercases, keeps ё, and discards proper nouns
// (capitalized), hyphenated words, acronyms and non-Cyrillic tokens. The result is
// de-duplicated and sorted in Russian alphabetical order (ё right after е), LF-separated.
//
// It also collects a variant headword joined by "и" when it carries its own grammatical
// note (e.g. "аблатив, -а и аблятив, -а"). Suffix-singular reconstruction is heuristic;
// Stage 2 (dictprep/ru_stage2.py) re-checks the words against real dictionaries.
//
// pdftotext dictprep/orfo_dict_2025.pdf /tmp/slov.txt
// go run ./dictprep/ruwords -in /tmp/slov.txt -from 452 -to 168808 \
// -out russian_all.txt -skip russian_skip.txt
package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
"path/filepath"
"sort"
"strings"
"unicode"
)
// ruAlphabet is the Russian alphabet in collation order (ё directly after е).
const ruAlphabet = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
var ruRank = func() map[rune]int {
m := make(map[rune]int, len(ruAlphabet))
for i, r := range []rune(ruAlphabet) {
m[r] = i
}
return m
}()
func isCyrLetter(r rune) bool {
return (r >= 'а' && r <= 'я') || (r >= 'А' && r <= 'Я') || r == 'ё' || r == 'Ё'
}
func isUpperCyr(r rune) bool { return (r >= 'А' && r <= 'Я') || r == 'Ё' }
func isStress(r rune) bool { return r == 0x0300 || r == 0x0301 }
// cleanWord normalizes a run of letters/stress-marks into a lowercase Cyrillic word, or
// returns ok=false for proper nouns (capitalized), hyphenated or non-Cyrillic runs.
func cleanWord(run []rune) (string, bool) {
if len(run) == 0 || isUpperCyr(run[0]) {
return "", false
}
var b strings.Builder
for _, r := range run {
switch {
case isStress(r), r == '­': // drop stress accents and soft hyphens
case r == '-': // a real hyphen means a hyphenated word: reject it
return "", false
default:
b.WriteRune(unicode.ToLower(r))
}
}
w := b.String()
if w == "" {
return "", false
}
for _, r := range w {
if !((r >= 'а' && r <= 'я') || r == 'ё') {
return "", false
}
}
return w, true
}
// headword returns the entry's headword: the leading run of letters, stress marks and
// hyphens, normalized.
func headword(line string) (string, bool) {
// Trim leading whitespace, including the form-feed (U+000C) that pdftotext puts at
// the top of each page — otherwise the first headword on every page is lost.
line = strings.TrimLeftFunc(line, unicode.IsSpace)
var run []rune
for _, r := range line {
if isCyrLetter(r) || isStress(r) || r == '-' || r == '­' {
run = append(run, r)
} else {
break
}
}
return cleanWord(run)
}
// embeddedSingulars returns the singular form of a plural headword spelled out after
// "ед.", either in full ("ед. ящер") or as a replacement suffix ("ед. -вец",
// reconstructed from headword). It skips gender marks ("ед. м") and abbreviations that
// merely start with "ед." ("ед. измер.", "ден. ед.").
func embeddedSingulars(line, headword string) []string {
var out []string
for i := 0; ; {
j := strings.Index(line[i:], "ед.")
if j < 0 {
break
}
i += j + len("ед.")
rest := strings.TrimLeft(line[i:], "  \t")
if strings.HasPrefix(rest, "-") { // suffix form: reconstruct from the headword
var suf []rune
for _, r := range rest[len("-"):] {
if isCyrLetter(r) || isStress(r) {
suf = append(suf, r)
} else {
break
}
}
if s, ok := cleanWord(suf); ok && len([]rune(s)) >= 2 {
if recon := reconstructSingular(headword, s); recon != "" {
out = append(out, recon)
}
}
continue
}
var run []rune
consumed := 0
for _, r := range rest {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
consumed += len(string(r))
} else {
break
}
}
if len(run) == 0 {
continue
}
if strings.HasPrefix(rest[consumed:], ".") {
continue // an abbreviation like "ед. измер." rather than a singular form
}
w, ok := cleanWord(run)
if !ok || len([]rune(w)) < 2 { // 2+ letters excludes the gender marks м/ж/с
continue
}
out = append(out, w)
}
return out
}
// reconstructSingular builds the singular from a plural headword and the replacement
// suffix from "ед. -<suffix>", splicing where the suffix best overlaps the tail of the
// headword (the position of longest common prefix between the suffix and a headword
// suffix). It is a heuristic; Stage 2 re-checks the words against real dictionaries.
func reconstructSingular(headword, suffix string) string {
hw, sf := []rune(headword), []rune(suffix)
bestK, bestLen := -1, 0
for k := 0; k < len(hw); k++ {
m := 0
for k+m < len(hw) && m < len(sf) && hw[k+m] == sf[m] {
m++
}
if m > bestLen {
bestK, bestLen = k, m
}
}
if bestK < 0 {
return ""
}
return string(hw[:bestK]) + suffix
}
// headwordNotes are the grammatical notes that mark a parallel headword (a lemma) after
// "и", as opposed to an inflected form. A "-" ending also marks one; form labels such as
// деепр. (gerund) or сравн. (comparative) deliberately do not.
var headwordNotes = map[string]bool{
"нескл": true, "неизм": true, "предлог": true, "предл": true, "нареч": true,
"нар": true, "прил": true, "союз": true, "частица": true, "част": true,
"межд": true, "мн": true, "ед": true, "тв": true, "числ": true, "мест": true,
"м": true, "ж": true, "с": true, "вводн": true, "сказ": true,
}
// variantNoteOK reports whether the note following a candidate variant marks a headword:
// a "-" inflection ending or one of headwordNotes (and not a bare inflected word).
func variantNoteOK(note string) bool {
if strings.HasPrefix(note, "-") {
return true
}
var stem []rune
for _, r := range note {
if (r >= 'а' && r <= 'я') || r == 'ё' {
stem = append(stem, r)
} else {
break
}
}
return headwordNotes[string(stem)]
}
// variants returns the second (and further) headwords of an entry, written as a parallel
// form after " и ", e.g. "аблатив, -а и аблятив, -а" yields "аблятив" and "регги и реггей,
// нескл." yields "реггей". Requiring a headword note after the comma keeps this from
// matching "и" inside examples or picking up inflected forms.
func variants(line string) []string {
var out []string
const sep = " и "
for i := 0; ; {
j := strings.Index(line[i:], sep)
if j < 0 {
break
}
i += j + len(sep)
rest := line[i:]
var run []rune
consumed := 0
for _, r := range rest {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
consumed += len(string(r))
} else {
break
}
}
if len(run) == 0 {
continue
}
after := rest[consumed:]
if !strings.HasPrefix(after, ", ") || !variantNoteOK(after[len(", "):]) {
continue
}
if w, ok := cleanWord(run); ok && len([]rune(w)) >= 2 {
out = append(out, w)
}
}
return out
}
// normToken normalizes any token (a run of letters and stress marks) for the skip set:
// lowercase, stress removed, kept only if it is 2+ all-Cyrillic letters. Unlike
// cleanWord it does NOT reject capitalized tokens — a lowercased proper noun belongs in
// the skip set so it can be re-checked by a morphological analyzer.
func normToken(run []rune) (string, bool) {
var b strings.Builder
for _, r := range run {
if isStress(r) {
continue
}
b.WriteRune(unicode.ToLower(r))
}
w := b.String()
if len([]rune(w)) < 2 {
return "", false
}
for _, r := range w {
if !((r >= 'а' && r <= 'я') || r == 'ё') {
return "", false
}
}
return w, true
}
// tokens returns every maximal run of Cyrillic letters (plus stress marks) in the line,
// normalized; runs are split on every other character (so hyphens split a word).
func tokens(line string) []string {
var out []string
var run []rune
flush := func() {
if len(run) > 0 {
if w, ok := normToken(run); ok {
out = append(out, w)
}
run = run[:0]
}
}
for _, r := range line {
if isCyrLetter(r) || isStress(r) {
run = append(run, r)
} else {
flush()
}
}
flush()
return out
}
func lessRu(a, b string) bool {
ra, rb := []rune(a), []rune(b)
for i := 0; i < len(ra) && i < len(rb); i++ {
if ra[i] != rb[i] {
return ruRank[ra[i]] < ruRank[rb[i]]
}
}
return len(ra) < len(rb)
}
func sortedRu(set map[string]struct{}) []string {
words := make([]string, 0, len(set))
for w := range set {
words = append(words, w)
}
sort.Slice(words, func(i, j int) bool { return lessRu(words[i], words[j]) })
return words
}
func writeWords(path string, words []string) error {
if dir := filepath.Dir(path); dir != "" && dir != "." {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
}
o, err := os.Create(path)
if err != nil {
return err
}
w := bufio.NewWriter(o)
for _, word := range words {
w.WriteString(word)
w.WriteByte('\n')
}
if err := w.Flush(); err != nil {
o.Close()
return err
}
return o.Close()
}
func main() {
in := flag.String("in", "dictprep/russian/orfo_dict_2025.txt", "plain-text dictionary (pdftotext output)")
out := flag.String("out", "dictprep/russian/all.txt", "output: the base word list (clean headwords + reconstructed singulars + variants)")
skip := flag.String("skip", "/tmp/ru_skip.txt", "output: every other token, for a later morphology re-check")
sings := flag.String("singulars", "/tmp/ru_singulars.txt", "output: singulars reconstructed from \"ед.\" (known nouns)")
varsOut := flag.String("variants", "/tmp/ru_variants.txt", "output: variant pairs joined by \"и\" (primary<TAB>variant)")
from := flag.Int("from", 452, "first line of the word-list section (1-based, inclusive)")
to := flag.Int("to", 168808, "last line of the word-list section (inclusive)")
flag.Parse()
if *in == "" {
log.Fatal("ruwords: -in is required")
}
f, err := os.Open(*in)
if err != nil {
log.Fatal(err)
}
defer f.Close()
all := make(map[string]struct{})
allTokens := make(map[string]struct{})
singulars := make(map[string]struct{})
variantPairs := make(map[string]struct{})
entries, fromHead, fromSing, fromVar := 0, 0, 0, 0
sc := bufio.NewScanner(f)
sc.Buffer(make([]byte, 1<<20), 1<<20)
for line := 0; sc.Scan(); {
line++
if line < *from || line > *to {
continue
}
entries++
text := sc.Text()
hw, hwOK := headword(text)
var sings []string
if hwOK {
sings = embeddedSingulars(text, hw)
}
primary := ""
if len(sings) > 0 {
// the headword is plural and the entry gives its singular: keep only the singular
primary = sings[0]
for _, w := range sings {
if _, seen := all[w]; !seen {
fromSing++
all[w] = struct{}{}
}
singulars[w] = struct{}{}
}
} else if hwOK {
primary = hw
if _, seen := all[hw]; !seen {
fromHead++
}
all[hw] = struct{}{}
}
for _, w := range variants(text) {
if _, seen := all[w]; !seen {
fromVar++
all[w] = struct{}{}
}
if primary != "" && primary != w {
variantPairs[primary+"\t"+w] = struct{}{}
}
}
for _, w := range tokens(text) {
allTokens[w] = struct{}{}
}
}
if err := sc.Err(); err != nil {
log.Fatal(err)
}
skipSet := make(map[string]struct{})
for w := range allTokens {
if _, ok := all[w]; !ok {
skipSet[w] = struct{}{}
}
}
allWords := sortedRu(all)
skipWords := sortedRu(skipSet)
if err := writeWords(*out, allWords); err != nil {
log.Fatal(err)
}
if err := writeWords(*skip, skipWords); err != nil {
log.Fatal(err)
}
if err := writeWords(*sings, sortedRu(singulars)); err != nil {
log.Fatal(err)
}
pairList := make([]string, 0, len(variantPairs))
for p := range variantPairs {
pairList = append(pairList, p)
}
sort.Strings(pairList)
if err := writeWords(*varsOut, pairList); err != nil {
log.Fatal(err)
}
fmt.Printf("scanned %d entries\n", entries)
fmt.Printf(" %-20s %7d words (%d headwords + %d embedded singulars + %d variants)\n", *out, len(allWords), fromHead, fromSing, fromVar)
fmt.Printf(" %-20s %7d words (tokens not in %s; for a morphology re-check)\n", *skip, len(skipWords), *out)
fmt.Printf(" %-20s %7d words (singulars from \"ед.\"; known nouns)\n", *sings, len(singulars))
fmt.Printf(" %-20s %7d pairs (variants joined by \"и\")\n", *varsOut, len(variantPairs))
}
+13
View File
@@ -0,0 +1,13 @@
module gitea.iliadenisov.ru/developer/scrabble-dictionary
go 1.26.3
require (
gitea.iliadenisov.ru/developer/scrabble-solver v1.0.0
github.com/iliadenisov/alphabet v1.1.0
)
require (
github.com/iliadenisov/dafsa v1.1.0 // indirect
golang.org/x/exp v0.0.0-20201008143054-e3b2a7f2fdc7 // indirect
)
+31
View File
@@ -0,0 +1,31 @@
dmitri.shuralyov.com/gpu/mtl v0.0.0-20190408044501-666a987793e9/go.mod h1:H6x//7gZCb22OMCxBHrMx7a5I7Hp++hsVxbQ4BYO7hU=
gitea.iliadenisov.ru/developer/scrabble-solver v1.0.0 h1:ntN6m4cOB+4FelleO2nkAIZp8WSc+v25neetzfdUuuw=
gitea.iliadenisov.ru/developer/scrabble-solver v1.0.0/go.mod h1:G60OiGZtkrRyYX8P3SSsjVpU707fufmZkvCkNFPFWrY=
github.com/BurntSushi/xgb v0.0.0-20160522181843-27f122750802/go.mod h1:IVnqGOEym/WlBOVXweHU+Q+/VP0lqqI8lqeDx9IjBqo=
github.com/go-gl/glfw/v3.3/glfw v0.0.0-20200222043503-6f7a984d4dc4/go.mod h1:tQ2UAYgL5IevRw8kRxooKSPJfGvJ9fJQFa0TUsXzTg8=
github.com/iliadenisov/alphabet v1.1.0 h1:d87N7Rmpjj9FgL7bvEaqLdaIaNch2hC6HvkbKGhn7Hk=
github.com/iliadenisov/alphabet v1.1.0/go.mod h1:h6BhDBiJBLhMEb5XfsqJXZop3hhwXaD8lc5yf38Baqw=
github.com/iliadenisov/dafsa v1.1.0 h1:NV1ZOstMdHXI/cCyAZKOD3qnKLoYdMUunA0+Baj7vR4=
github.com/iliadenisov/dafsa v1.1.0/go.mod h1:mG6Y0DdfRrqdXGqTEMb9Zx0Fl0NkP3ZDYesvxR+e14o=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
golang.org/x/exp v0.0.0-20201008143054-e3b2a7f2fdc7 h1:2/QncOxxpPAdiH+E00abYw/SaQG353gltz79Nl1zrYE=
golang.org/x/exp v0.0.0-20201008143054-e3b2a7f2fdc7/go.mod h1:1phAWC201xIgDyaFpmDeZkgf70Q4Pd/CNqfRtVPtxNw=
golang.org/x/image v0.0.0-20190227222117-0694c2d4d067/go.mod h1:kZ7UVZpmo3dzQBMxlp+ypCbDeSB+sBbTgSJuh5dn5js=
golang.org/x/image v0.0.0-20190802002840-cff245a6509b/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0=
golang.org/x/mobile v0.0.0-20190719004257-d2bd2a29d028/go.mod h1:E/iHnbuqvinMTCcRqshq8CkpyQDoeVncDDYHnLhea+o=
golang.org/x/mod v0.1.1-0.20191105210325-c90efee705ee/go.mod h1:QqPTAvyqsEbceGzBzNggFXnrqF1CaUcvgkdR5Ot7KZg=
golang.org/x/mod v0.3.1-0.20200828183125-ce943fd02449/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190312061237-fead79001313/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20191001151750-bb3f8db39f24/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20200207183749-b753a1ba74fa/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=