Stage 16: insert Stage 17 (test-contour verification); renumber prod deploy to 18
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 20s
CI / deploy (pull_request) Successful in 21s

- PLAN.md: new Stage 17 "Test-contour verification & defect fixes" (exercise the
  deployed contour end-to-end and fix what it surfaces — connector liveness check,
  path-conditional CI); the former prod-deploy stage becomes Stage 18.
- Renumber every "Stage 17" prod-deploy reference to "Stage 18" across docs,
  compose, Caddyfile, ci.yaml and CLAUDE.md; the post-Stage-14 split range is now
  "Stages 15–18".
This commit is contained in:
Ilia Denisov
2026-06-05 16:57:17 +02:00
parent 0ea35fe991
commit efbaf657c6
9 changed files with 38 additions and 19 deletions
+27 -8
View File
@@ -50,7 +50,8 @@ independent (see ARCHITECTURE §9.1).
| 14 | Solver & dictionary split (publish solver + scrabble-dictionary repo/artifact) | **done** |
| 15 | Dual Telegram bots & language-gated variants | **done** |
| 16 | Deploy infra & test contour (Dockerfiles, gateway static UI, compose, observability) | **done** |
| 17 | Prod contour deploy (SSH export/import, manual after merge) | todo |
| 17 | Test-contour verification & defect fixes | todo |
| 18 | Prod contour deploy (SSH export/import, manual after merge) | todo |
Scaffolding is incremental: `go.work` lists only existing modules; each stage
adds the modules it needs.
@@ -244,7 +245,7 @@ indices; the premiums.ts parity-test rework.
### Stage 14 — Solver & dictionary split (TODO-1 + TODO-2)
Re-scoped from the original "CI & deploy": that was several sessions of work, so the
deploy + observability + the two-bots idea were split into **Stages 1517** below and this
deploy + observability + the two-bots idea were split into **Stages 1518** below and this
stage took only the dependency/artifact split that everything else builds on. Scope: publish
`scrabble-solver` as a versioned Gitea module and split the dictionary build into a new
`scrabble-dictionary` repo delivering a **release artifact**, then make `scrabble-game` consume
@@ -297,7 +298,25 @@ h2c wrap — `/` + `/telegram/` mounts; a committed `dist` placeholder so `go bu
build); Postgres healthcheck/volume; whether the connector-scoped compose is retired for the root one;
collector/Tempo/Prometheus retention.
### Stage 17 — Prod contour deploy
### Stage 17 — Test-contour verification & defect fixes
Scope: exercise the deployed **test contour** end-to-end and fix the defects it surfaces — the
"does it actually work in the contour" pass before prod. Bring up the `development` deploy, then
verify each piece against a real run: the gateway serves the SPA at `/` and `/telegram/`; the admin
console and Grafana sit behind the single `/_gm` Basic-Auth; the Telegram **bots** start (test
environment) and the Mini App launches/authenticates; a game can be created and played through (web
+ Mini App); the **observability** stack receives data (Prometheus targets up, the dashboards
populate incl. `accounts_created_total`/`active_users`, traces reach Tempo); the out-of-app push
works. Fix the defects found and harden where the run exposes gaps — notably a CI **connector
liveness check** (the deploy probe only hits the gateway today, so a crash-looping connector is
invisible — that is how the Stage 16 test-env miss went unnoticed) and **path-conditional CI** (skip
the jobs whose code did not change, behind a single always-running gate job so branch-protection
required checks stay satisfiable — a skipped required check otherwise blocks the merge).
Open details (interview at start): the verification checklist + pass bar; which discovered defects
are in-scope vs deferred; the changed-paths design + the aggregate gate job; the connector
liveness-check grace period (the VPN sidecar handshake lets the connector restart a few times before
it settles).
### Stage 18 — Prod contour deploy
Scope: the **production contour** on a remote host over SSH. Deploy by **container export/import**
(`docker save``scp`/ssh → `docker load``docker compose up` on the remote), the SSH key + host IP
in Gitea secrets; **strictly manual** (`workflow_dispatch`) after `development` is merged to `master`
@@ -905,7 +924,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
CI & deploy (TODO-1, TODO-2, the collector + dashboards). The latter two were written
into the plan now as the agreed baseline (each still re-interviews at its own start).
(Stage 14 was itself later re-scoped to the solver/dictionary split alone; deploy +
observability + the dual-bot idea split into Stages 1517.)
observability + the dual-bot idea split into Stages 1518.)
- **Shared telemetry** (interview): a new `pkg/telemetry` owns the OTel provider
bootstrap (exporter selection, W3C propagators, shutdown, Go runtime metrics); the
backend `internal/telemetry` is now a thin facade over it (keeping its gin middleware),
@@ -985,7 +1004,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
- **Stage 14** (interview + implementation, re-scoped + discharges TODO-1/TODO-2):
- **Re-scoped to the split** (interview): the original "CI & deploy" was several sessions of work,
so it was cut to the **solver/dictionary split** (the dependency foundation) and the deploy +
observability + the dual-bot idea were written into the plan as new **Stages 1517**. The deploy
observability + the dual-bot idea were written into the plan as new **Stages 1518**. The deploy
decisions taken at the interview are recorded there (embed the UI in the gateway via `go:embed`;
full Collector+Prometheus+Tempo+Grafana stack; **two contours** — test = auto on feature-branch
push on the local host, prod = manual SSH `docker save`/`load` after merge; `TEST_`/`PROD_` secret
@@ -1047,7 +1066,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
`.gitea/workflows/ci.yaml` (Gitea has no cross-workflow `needs`) runs `unit`+`integration`+`ui` on a PR
into `development`/`master` and a **gated `deploy`** job (`needs` the three) that auto-rolls the test
contour **on a PR into — or a push to — `development`** (owner's "и PR, и push"). A PR into `master` is
test-only; prod is the manual Stage 17. The former `go-unit`/`integration`/`ui-test` workflows were
test-only; prod is the manual Stage 18. The former `go-unit`/`integration`/`ui-test` workflows were
folded in (no path filters — full CI on every PR, per the owner). Console kept plain (`NO_COLOR`,
`docker compose --ansi never`, `--progress plain`).
- **Gateway serves the UI** (interview, the §13 single-origin): a new `gateway/internal/webui` embeds
@@ -1066,7 +1085,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
**supersedes Stage 10's** gateway-fronts-`/_gm` model **in the deploy topology** (the gateway's own
`/_gm` proxy stays for a local non-caddy run). TLS: the **host caddy** terminates it for the test
contour and forwards to `scrabble:80`; the in-compose caddy is parameterised (`CADDY_SITE_ADDRESS`) to
own ACME on prod (Stage 17) where there is no host caddy.
own ACME on prod (Stage 18) where there is no host caddy.
- **Networks** (engineering): inter-service traffic on a private `internal` network (project-scoped DNS,
no name collisions on the shared `edge`); only caddy joins the external `edge` (alias `scrabble`). The
connector keeps its VPN sidecar (the only egress that needs the tunnel). The connector-scoped
@@ -1094,7 +1113,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
verified, but the option is idiomatic and now has a `bot` test asserting the `/bot<token>/test/getMe`
path). The test contour **pins `TELEGRAM_TEST_ENV=true` in `ci.yaml`** (the contour is the test
environment) rather than via a `TEST_`-prefixed variable — removing a confusing double-`TEST` operator
knob and the secret-vs-variable footgun; prod (Stage 17) leaves it `false`.
knob and the secret-vs-variable footgun; prod (Stage 18) leaves it `false`.
## Deferred TODOs (cross-stage)