Stage 12: observability & performance (OTel/OTLP, domain metrics, guest GC)
- pkg/telemetry: shared OTel provider bootstrap (none/stdout/otlp + W3C
propagators + Go runtime metrics); backend/internal/telemetry becomes a thin
facade keeping its gin middleware.
- Telemetry parity: gateway and the Telegram connector gain telemetry runtimes
and config (GATEWAY_/TELEGRAM_ SERVICE_NAME + OTEL_*); otelgrpc instruments the
backend push server, the gateway's backend+connector clients and the connector
server. Default exporter stays none (collector/dashboards are Stage 14).
- Operational metrics (variant attribute on game-scoped ones): game_replay_duration,
game_move_validate_duration, games_started_total, games_abandoned_total,
game_cache_active, chat_messages_total{kind}, gateway edge_request_duration.
Wired via the SetMetrics setter pattern (default no-op meter).
- TODO-3: account.GuestReaper deletes guests with no game seat past
BACKEND_GUEST_RETENTION (default 30d, swept every BACKEND_GUEST_REAP_INTERVAL).
- Tests: pkg/telemetry exporter selection; game/social/edge metric recording via
a manual reader; config (otlp accepted, guest knobs); inttest guest reaper.
- Docs: PLAN.md re-scopes Stage 12 and adds Stage 13 (alphabet-on-wire) + Stage 14
(CI/deploy) with the agreed dictionary-versioning resolution; ARCHITECTURE 11/13,
TESTING, the three READMEs and FUNCTIONAL(+ru) updated.
This commit is contained in:
@@ -45,7 +45,9 @@ independent (see ARCHITECTURE §9.1).
|
||||
| 9 | Telegram integration (bot side-service, deep-link, push) | **done** |
|
||||
| 10 | Admin & dictionary ops (complaint review, version reload) | **done** |
|
||||
| 11 | Account linking & merge | **done** |
|
||||
| 12 | Polish (observability, perf with evidence, deploy) | todo |
|
||||
| 12 | Observability & performance (telemetry, metrics, guest GC) | todo |
|
||||
| 13 | Alphabet on the wire (UI alphabet-agnostic) | todo |
|
||||
| 14 | CI & deploy (multi-service, dictionary artifacts) | todo |
|
||||
|
||||
Scaffolding is incremental: `go.work` lists only existing modules; each stage
|
||||
adds the modules it needs.
|
||||
@@ -204,10 +206,68 @@ dedupe). High blast-radius — focused regression tests.
|
||||
Open details: conflict resolution (active games on both, duplicate friends,
|
||||
display-name collisions); irreversibility/audit; confirm-flow per platform.
|
||||
|
||||
### Stage 12 — Polish
|
||||
Scope: observability dashboards, evidence-based performance work, prod
|
||||
build/deploy.
|
||||
Open details: deployment target/host; dashboards; load expectations.
|
||||
### Stage 12 — Observability & performance
|
||||
Scope: wire a configurable **OTLP** exporter (alongside `none`/`stdout`), shared in a
|
||||
new `pkg/telemetry`; add telemetry to the **gateway** and the **Telegram connector**
|
||||
(providers + `otelgrpc` on the gRPC hops) for parity with the backend; add
|
||||
domain/operational **metrics** close to the business (game replay/validate timings,
|
||||
started/abandoned games, live-cache size, chat/nudge counts, the edge roundtrip, Go
|
||||
runtime metrics); discharge **TODO-3** (abandoned-guest GC). The OTLP collector and
|
||||
dashboards are stood up with the deploy (Stage 14); the default exporter stays `none`,
|
||||
so CI needs no collector. Performance is operational-metric instrumentation, not
|
||||
speculative optimisation (the standing "evidence first" rule — no measured hotspot yet).
|
||||
Open details: exporter default and whether a collector is stood up now; the metric set
|
||||
and its attributes; the guest-reaper trigger given revoke-only sessions.
|
||||
|
||||
### Stage 13 — Alphabet on the wire (TODO-4)
|
||||
Scope: make the UI **alphabet-agnostic**. On game-screen load the client receives the
|
||||
variant's alphabet table `(letter, index, value)` for **display only**, caches it in
|
||||
memory by variant (a request flag gates whether the table is included, so it is not
|
||||
resent on every state poll); live play then exchanges **letter indices** both ways, and
|
||||
**word-check** sends indices, constraining input to the variant's alphabet. The engine
|
||||
already works in alphabet-index bytes, so the wire does *less* decoding in live play; the
|
||||
durable journal / history / GCG stay decoded concrete characters (the §9.1
|
||||
dictionary-independent invariant is untouched). The alphabet comes from the **solver's
|
||||
rules** (not the DAWG), so the wire table is pinned by the solver version. **Index-drift
|
||||
caveat:** the running solver, the DAWGs (built against it — Stage 14 / TODO-2) and the
|
||||
wire table must agree, or letter indexing silently corrupts. Blast radius: `pkg/fbs`
|
||||
(a new Alphabet table; index fields in `StateView`/rack and in
|
||||
`SubmitPlay`/`Exchange`/`check_word`) → backend DTO encode/decode → UI
|
||||
`codec.ts`/`premiums.ts` → board/rack render, the move/exchange/word-check senders, the
|
||||
mock transport and the Vitest tests.
|
||||
Open details: the fbs shape and `include_alphabet` flag placement; whether to keep
|
||||
concrete-letter fields during the transition; whether tile exchange moves fully to
|
||||
indices; the premiums.ts parity-test rework.
|
||||
|
||||
### Stage 14 — CI & deploy
|
||||
Scope: the full **multi-service production deploy** plus the observability backend, also
|
||||
discharging **TODO-1** and **TODO-2**. Backend + gateway **Dockerfiles** (multi-stage
|
||||
distroless, mirroring the Stage 9 connector image); the gateway gains **static UI
|
||||
serving** (the §13 single-origin model — mini-landing at `/`, Mini App under
|
||||
`/telegram/`), documented since Stage 9 but **not yet implemented**; prod UI build vars
|
||||
(`VITE_TELEGRAM_BOT_ID` for the Login Widget, the Mini App URL / share link); a root
|
||||
`deploy/docker-compose.yml` (backend + gateway + Postgres + connector + the OTLP
|
||||
collector / Grafana stack) on the external `edge` network behind the host caddy, the VPN
|
||||
sidecar only for the connector; a **deploy workflow** mirroring `../15-puzzle` (host-mode
|
||||
runner, `docker compose up -d --build`, no external registry, env from Gitea secrets, a
|
||||
post-deploy probe). Stand up the **OTLP collector + dashboards** (the export wiring landed
|
||||
in Stage 12).
|
||||
- **TODO-1 — publish & version the solver:** tag/publish `scrabble-solver`, drop the
|
||||
`go.work` replace + the CI clone, pin a version in `backend/go.mod` (or keep cloning the
|
||||
sibling as the minimal-diff fallback). The DAWGs are delivered separately regardless.
|
||||
- **TODO-2 — versioned dictionary artifacts:** a **new versioned repo** for the wordlist
|
||||
parsers + built DAWGs, delivered as a **release artifact** (Gitea release / OCI / object
|
||||
store — not `go get`; DAWGs are data). **One semver label `vX.Y.Z` for the whole set**,
|
||||
additive: a deploy drops a new `BACKEND_DICT_DIR/<version>/` subdir;
|
||||
`engine.OpenWithVersions` loads every present subdir at boot; `BACKEND_DICT_VERSION`
|
||||
selects the default for **new** games. A new version never breaks a running backend
|
||||
(each game pins its `dict_version`; versions are additive); **only active games need a
|
||||
dictionary** (validate-at-submit — finished games replay the dictionary-independent
|
||||
journal), so a version is safe to retire once no active game pins it. The dict repo must
|
||||
build against the **same `dafsa`/`alphabet`/solver** the backend runs, or letter indexing
|
||||
drifts (ties into Stage 13).
|
||||
Open details: embed-vs-mount for the UI build and the DAWG set; the OTLP collector /
|
||||
dashboard stack; solver-publish vs clone-in-build; load expectations.
|
||||
|
||||
## Refinements logged during implementation
|
||||
|
||||
@@ -796,12 +856,50 @@ Open details: deployment target/host; dashboards; load expectations.
|
||||
./pkg/... ./platform/telegram/...`; integration stays `./backend/...`. UI ~90 KB gzip
|
||||
JS (budget 100 KB). New error code `merge_active_game_conflict`.
|
||||
|
||||
- **Stage 12** (interview + implementation):
|
||||
- **Re-scoped & split** (interview): the original "Polish (observability + perf +
|
||||
deploy)" was too large for one session, so it was split — **Stage 12** = observability
|
||||
+ performance + guest GC; **Stage 13** = alphabet-on-the-wire (TODO-4); **Stage 14** =
|
||||
CI & deploy (TODO-1, TODO-2, the collector + dashboards). The latter two were written
|
||||
into the plan now as the agreed baseline (each still re-interviews at its own start).
|
||||
- **Shared telemetry** (interview): a new `pkg/telemetry` owns the OTel provider
|
||||
bootstrap (exporter selection, W3C propagators, shutdown, Go runtime metrics); the
|
||||
backend `internal/telemetry` is now a thin facade over it (keeping its gin middleware),
|
||||
and the gateway and connector gained telemetry runtimes. A configurable **`otlp`**
|
||||
exporter was added alongside `none`/`stdout`; the **default stays `none`**, the OTLP
|
||||
endpoint comes from the standard `OTEL_EXPORTER_OTLP_*` env, and the collector +
|
||||
dashboards are Stage 14 (so CI needs none). `otelgrpc` instruments the backend push
|
||||
server, the gateway's backend + connector clients, and the connector's gRPC server.
|
||||
New config `GATEWAY_SERVICE_NAME`/`GATEWAY_OTEL_*` and `TELEGRAM_SERVICE_NAME`/
|
||||
`TELEGRAM_OTEL_*`; the backend's existing `BACKEND_OTEL_*` gained the `otlp` value.
|
||||
- **Metrics = operational, business-near** (interview): histograms
|
||||
`game_replay_duration` and `game_move_validate_duration`; counters
|
||||
`games_started_total`, `games_abandoned_total` (a turn-timeout seat drop) and
|
||||
`chat_messages_total` (`kind`=message/nudge); an observable gauge `game_cache_active`;
|
||||
the gateway `edge_request_duration` (`message_type`/`result`); plus Go runtime/heap
|
||||
metrics. Game-scoped metrics carry a **`variant`** attribute
|
||||
(english/russian_scrabble/erudit — chosen over a coarser `language`, which it
|
||||
subsumes); the gateway edge metric is variant-agnostic. Optional wiring uses the
|
||||
established `SetMetrics`/`SetNotifier` setter pattern (default no-op meter), so existing
|
||||
constructors and tests are untouched. **No speculative optimisation** — there is no
|
||||
measured hotspot; the deliverable is the instrumentation (the standing "performance only
|
||||
with evidence" rule). pprof was not added (reframed away by the owner).
|
||||
- **Guest GC** (interview, TODO-3): age-based, no-seat-only — see the discharged TODO-3
|
||||
below; new config `BACKEND_GUEST_REAP_INTERVAL`/`BACKEND_GUEST_RETENTION`.
|
||||
- **Deps/CI**: new OTel modules (the OTLP exporters,
|
||||
`contrib/instrumentation/runtime`, `otelgrpc`) added with the no-tidy pattern
|
||||
(`go mod edit` + `go mod download` + `go work sync`; `pkg` carries no bare-path dep, so
|
||||
it tidies cleanly). No workflow change — the Go workflows already span
|
||||
`./backend/... ./gateway/... ./pkg/... ./platform/telegram/...`, integration stays
|
||||
`./backend/...`, and the default `none` exporter keeps CI collector-free.
|
||||
|
||||
## Deferred TODOs (cross-stage)
|
||||
|
||||
- **TODO-1 — publish & version the solver.** Once `scrabble-solver` is stable,
|
||||
give it a real module URL and switch `backend` to a versioned dependency,
|
||||
dropping the `go.work` replace and the CI clone. Removes the floating
|
||||
`master` dependency accepted for now (Stage 2 interview).
|
||||
`master` dependency accepted for now (Stage 2 interview). **Planned for Stage 14**
|
||||
(it cleans up the backend Docker build; a clone-in-build fallback stays available).
|
||||
- **TODO-2 — split the solver into engine vs dictionary generator + versioned
|
||||
dictionary artifacts.** Owner's idea, with the caveats agreed at the Stage 2
|
||||
interview: the split is sound (build-time wordlist→DAWG vs runtime load have
|
||||
@@ -817,12 +915,22 @@ Open details: deployment target/host; dashboards; load expectations.
|
||||
`BACKEND_DICT_DIR/<version>/` loaded via `Registry.LoadAvailable`, restart-restored by
|
||||
`engine.OpenWithVersions`) — keep the `BACKEND_DICT_DIR` directory as
|
||||
the runtime contract: a new `.dawg` appears in it and is loaded with
|
||||
`dawg.Load`.
|
||||
- **TODO-3 — garbage-collect abandoned guest accounts.** Stage 6 makes a guest a
|
||||
durable `accounts` row (no identity, `is_guest`), so an ephemeral guest leaves a
|
||||
row behind. Add a periodic reaper (or a finished-and-idle sweep) that deletes
|
||||
guest accounts with no active games once their last session is gone; the
|
||||
`ON DELETE CASCADE` foreign keys clean up the dependent rows.
|
||||
`dawg.Load`. **Planned for Stage 14**, agreed resolution: a **new versioned repo**
|
||||
for the parsers + built DAWGs, delivered as a **release artifact** (not `go get`),
|
||||
versioned with **one semver label for the whole set** (additive; old versions retired
|
||||
once no active game pins them — see Stage 14). The generator must build against the same
|
||||
`dafsa`/`alphabet`/solver as the runtime (the index-drift caveat, shared with TODO-4).
|
||||
- ~~**TODO-3 — garbage-collect abandoned guest accounts.**~~ **Done in Stage 12.**
|
||||
A periodic `account.GuestReaper` deletes guests (`is_guest`) **with no game seat at
|
||||
all** whose account age exceeds `BACKEND_GUEST_RETENTION` (default 30 d, swept every
|
||||
`BACKEND_GUEST_REAP_INTERVAL`, default 1 h). Two schema facts shaped this, narrowing
|
||||
the original sketch: (1) `game_players`/`chat_messages`/`complaints` reference accounts
|
||||
**without** `ON DELETE CASCADE`, and a finished game belongs to the other players'
|
||||
history, so a guest with any seat is retained (a delete would be blocked anyway) — hence
|
||||
"no seat", not "no active game"; (2) sessions are revoke-only with no maintained
|
||||
`last_seen_at`, so a lingering session never expires and **account age** is the
|
||||
abandonment trigger, not "last session gone". The reaped guest's `sessions`/`identities`/
|
||||
`account_stats` fall away via their own `ON DELETE CASCADE`.
|
||||
- **TODO-4 — put the per-game alphabet on the wire (owner's idea, Stage 7).** Today the
|
||||
client hardcodes each variant's letters/values (ported into `ui/src/lib/premiums.ts`
|
||||
from `scrabble-solver/rules/rules.go`) and the edge exchanges plays/hints by concrete
|
||||
@@ -830,6 +938,10 @@ Open details: deployment target/host; dashboards; load expectations.
|
||||
value)` table so the UI stops duplicating it, and optionally moving tile exchange to
|
||||
letter **indices** end-to-end. Caveat (as for the dictionaries, TODO-2): the wire table
|
||||
must stay pinned to the same `rules.Alphabet` the engine uses, or indices drift.
|
||||
**Planned for Stage 13**, expanded (owner) to a fully **alphabet-agnostic UI**: the
|
||||
client caches the per-variant table (display only) behind an `include_alphabet` request
|
||||
flag and exchanges indices both ways, word-check included; the durable journal stays
|
||||
concrete characters (§9.1). See Stage 13.
|
||||
- **TODO-5 — QR friend codes (owner's idea, Stage 8).** *Partially done in Stage 9:*
|
||||
the deep-link scheme now exists (`f<code>`, shared Go ↔ TS), the bot redeems it on
|
||||
launch, and the UI shows a **share-to-Telegram** link for an issued code when
|
||||
|
||||
Reference in New Issue
Block a user