Stage 16: insert Stage 17 (test-contour verification); renumber prod deploy to 18
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 20s
CI / deploy (pull_request) Successful in 21s

- PLAN.md: new Stage 17 "Test-contour verification & defect fixes" (exercise the
  deployed contour end-to-end and fix what it surfaces — connector liveness check,
  path-conditional CI); the former prod-deploy stage becomes Stage 18.
- Renumber every "Stage 17" prod-deploy reference to "Stage 18" across docs,
  compose, Caddyfile, ci.yaml and CLAUDE.md; the post-Stage-14 split range is now
  "Stages 15–18".
This commit is contained in:
Ilia Denisov
2026-06-05 16:57:17 +02:00
parent 0ea35fe991
commit efbaf657c6
9 changed files with 38 additions and 19 deletions
+2 -2
View File
@@ -9,7 +9,7 @@ name: CI
# `development` or `master` (the full test suite — the merge gate) and on a push
# to `development` (after a merge). The deploy job runs only for `development`
# (PR or merge), so a PR into `master` is test-only; the prod deploy is a manual
# workflow (Stage 17).
# workflow (Stage 18).
#
# Console output is kept plain (NO_COLOR + `docker compose --ansi never` +
# `--progress plain`) so the Gitea logs stay readable.
@@ -176,7 +176,7 @@ jobs:
TELEGRAM_GAME_CHANNEL_ID_EN: ${{ vars.TEST_TELEGRAM_GAME_CHANNEL_ID_EN }}
TELEGRAM_GAME_CHANNEL_ID_RU: ${{ vars.TEST_TELEGRAM_GAME_CHANNEL_ID_RU }}
# The test contour always uses Telegram's test environment — pinned here,
# not an operator variable. Stage 17's prod workflow leaves it false.
# not an operator variable. Stage 18's prod workflow leaves it false.
TELEGRAM_TEST_ENV: "true"
VITE_TELEGRAM_BOT_ID: ${{ vars.TEST_VITE_TELEGRAM_BOT_ID }}
VITE_TELEGRAM_LINK: ${{ vars.TEST_VITE_TELEGRAM_LINK }}
+1 -1
View File
@@ -61,7 +61,7 @@ conversation memory — is the source of continuity. Keep it that way.
(`docker compose up -d --build` on the runner host + a `GET /` probe). A PR into
`master` is test-only.
- Merge `development → master` only when CI is green; the **prod** deploy is then a
**manual** workflow (Stage 17), never automatic. Secrets/variables are prefixed
**manual** workflow (Stage 18), never automatic. Secrets/variables are prefixed
`TEST_` / `PROD_` per contour (Gitea 1.26 has no deployment environments).
- After any push, watch the run to green before declaring a stage done — use the
ready-made watcher, never an inline poll loop:
+27 -8
View File
@@ -50,7 +50,8 @@ independent (see ARCHITECTURE §9.1).
| 14 | Solver & dictionary split (publish solver + scrabble-dictionary repo/artifact) | **done** |
| 15 | Dual Telegram bots & language-gated variants | **done** |
| 16 | Deploy infra & test contour (Dockerfiles, gateway static UI, compose, observability) | **done** |
| 17 | Prod contour deploy (SSH export/import, manual after merge) | todo |
| 17 | Test-contour verification & defect fixes | todo |
| 18 | Prod contour deploy (SSH export/import, manual after merge) | todo |
Scaffolding is incremental: `go.work` lists only existing modules; each stage
adds the modules it needs.
@@ -244,7 +245,7 @@ indices; the premiums.ts parity-test rework.
### Stage 14 — Solver & dictionary split (TODO-1 + TODO-2)
Re-scoped from the original "CI & deploy": that was several sessions of work, so the
deploy + observability + the two-bots idea were split into **Stages 1517** below and this
deploy + observability + the two-bots idea were split into **Stages 1518** below and this
stage took only the dependency/artifact split that everything else builds on. Scope: publish
`scrabble-solver` as a versioned Gitea module and split the dictionary build into a new
`scrabble-dictionary` repo delivering a **release artifact**, then make `scrabble-game` consume
@@ -297,7 +298,25 @@ h2c wrap — `/` + `/telegram/` mounts; a committed `dist` placeholder so `go bu
build); Postgres healthcheck/volume; whether the connector-scoped compose is retired for the root one;
collector/Tempo/Prometheus retention.
### Stage 17 — Prod contour deploy
### Stage 17 — Test-contour verification & defect fixes
Scope: exercise the deployed **test contour** end-to-end and fix the defects it surfaces — the
"does it actually work in the contour" pass before prod. Bring up the `development` deploy, then
verify each piece against a real run: the gateway serves the SPA at `/` and `/telegram/`; the admin
console and Grafana sit behind the single `/_gm` Basic-Auth; the Telegram **bots** start (test
environment) and the Mini App launches/authenticates; a game can be created and played through (web
+ Mini App); the **observability** stack receives data (Prometheus targets up, the dashboards
populate incl. `accounts_created_total`/`active_users`, traces reach Tempo); the out-of-app push
works. Fix the defects found and harden where the run exposes gaps — notably a CI **connector
liveness check** (the deploy probe only hits the gateway today, so a crash-looping connector is
invisible — that is how the Stage 16 test-env miss went unnoticed) and **path-conditional CI** (skip
the jobs whose code did not change, behind a single always-running gate job so branch-protection
required checks stay satisfiable — a skipped required check otherwise blocks the merge).
Open details (interview at start): the verification checklist + pass bar; which discovered defects
are in-scope vs deferred; the changed-paths design + the aggregate gate job; the connector
liveness-check grace period (the VPN sidecar handshake lets the connector restart a few times before
it settles).
### Stage 18 — Prod contour deploy
Scope: the **production contour** on a remote host over SSH. Deploy by **container export/import**
(`docker save``scp`/ssh → `docker load``docker compose up` on the remote), the SSH key + host IP
in Gitea secrets; **strictly manual** (`workflow_dispatch`) after `development` is merged to `master`
@@ -905,7 +924,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
CI & deploy (TODO-1, TODO-2, the collector + dashboards). The latter two were written
into the plan now as the agreed baseline (each still re-interviews at its own start).
(Stage 14 was itself later re-scoped to the solver/dictionary split alone; deploy +
observability + the dual-bot idea split into Stages 1517.)
observability + the dual-bot idea split into Stages 1518.)
- **Shared telemetry** (interview): a new `pkg/telemetry` owns the OTel provider
bootstrap (exporter selection, W3C propagators, shutdown, Go runtime metrics); the
backend `internal/telemetry` is now a thin facade over it (keeping its gin middleware),
@@ -985,7 +1004,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
- **Stage 14** (interview + implementation, re-scoped + discharges TODO-1/TODO-2):
- **Re-scoped to the split** (interview): the original "CI & deploy" was several sessions of work,
so it was cut to the **solver/dictionary split** (the dependency foundation) and the deploy +
observability + the dual-bot idea were written into the plan as new **Stages 1517**. The deploy
observability + the dual-bot idea were written into the plan as new **Stages 1518**. The deploy
decisions taken at the interview are recorded there (embed the UI in the gateway via `go:embed`;
full Collector+Prometheus+Tempo+Grafana stack; **two contours** — test = auto on feature-branch
push on the local host, prod = manual SSH `docker save`/`load` after merge; `TEST_`/`PROD_` secret
@@ -1047,7 +1066,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
`.gitea/workflows/ci.yaml` (Gitea has no cross-workflow `needs`) runs `unit`+`integration`+`ui` on a PR
into `development`/`master` and a **gated `deploy`** job (`needs` the three) that auto-rolls the test
contour **on a PR into — or a push to — `development`** (owner's "и PR, и push"). A PR into `master` is
test-only; prod is the manual Stage 17. The former `go-unit`/`integration`/`ui-test` workflows were
test-only; prod is the manual Stage 18. The former `go-unit`/`integration`/`ui-test` workflows were
folded in (no path filters — full CI on every PR, per the owner). Console kept plain (`NO_COLOR`,
`docker compose --ansi never`, `--progress plain`).
- **Gateway serves the UI** (interview, the §13 single-origin): a new `gateway/internal/webui` embeds
@@ -1066,7 +1085,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
**supersedes Stage 10's** gateway-fronts-`/_gm` model **in the deploy topology** (the gateway's own
`/_gm` proxy stays for a local non-caddy run). TLS: the **host caddy** terminates it for the test
contour and forwards to `scrabble:80`; the in-compose caddy is parameterised (`CADDY_SITE_ADDRESS`) to
own ACME on prod (Stage 17) where there is no host caddy.
own ACME on prod (Stage 18) where there is no host caddy.
- **Networks** (engineering): inter-service traffic on a private `internal` network (project-scoped DNS,
no name collisions on the shared `edge`); only caddy joins the external `edge` (alias `scrabble`). The
connector keeps its VPN sidecar (the only egress that needs the tunnel). The connector-scoped
@@ -1094,7 +1113,7 @@ provided cert) at the contour caddy; prod VPN; rollback.
verified, but the option is idiomatic and now has a `bot` test asserting the `/bot<token>/test/getMe`
path). The test contour **pins `TELEGRAM_TEST_ENV=true` in `ci.yaml`** (the contour is the test
environment) rather than via a `TEST_`-prefixed variable — removing a confusing double-`TEST` operator
knob and the secret-vs-variable footgun; prod (Stage 17) leaves it `false`.
knob and the secret-vs-variable footgun; prod (Stage 18) leaves it `false`.
## Deferred TODOs (cross-stage)
+1 -1
View File
@@ -98,6 +98,6 @@ docker compose -f deploy/docker-compose.yml config # validate (needs the
CI auto-deploys the **test contour** on a PR into — or push to — `development`
(`.gitea/workflows/ci.yaml`); the **prod contour** is a manual deploy after
`development → master` (Stage 17). Env reference: [`deploy/.env.example`](deploy/.env.example);
`development → master` (Stage 18). Env reference: [`deploy/.env.example`](deploy/.env.example);
the topology and the two-contour model are in
[`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) §13.
+2 -2
View File
@@ -1,5 +1,5 @@
# Environment for deploy/docker-compose.yml. The CI deploy job (ci.yaml) maps the
# Gitea TEST_-prefixed secrets/variables onto these unprefixed names; Stage 17
# Gitea TEST_-prefixed secrets/variables onto these unprefixed names; Stage 18
# maps the PROD_-prefixed set the same way. Copy to deploy/.env for a local run.
#
# Full reference (required vs optional, defaults, secret-vs-variable): deploy/README.md.
@@ -17,7 +17,7 @@ LOG_LEVEL=info
# --- Edge / caddy -----------------------------------------------------------
# Test: ":80" (the host caddy terminates TLS and forwards to scrabble:80 on the
# external `edge` network). Prod (Stage 17): a domain so caddy does its own ACME.
# external `edge` network). Prod (Stage 18): a domain so caddy does its own ACME.
CADDY_SITE_ADDRESS=:80
GM_BASICAUTH_USER=gm
GM_BASICAUTH_HASH= # required; `caddy hash-password` bcrypt hash
+2 -2
View File
@@ -38,7 +38,7 @@ cd deploy && docker compose up -d --build
**In CI** (the test contour) — `.gitea/workflows/ci.yaml`'s `deploy` job maps the
Gitea **`TEST_`-prefixed** secrets/variables onto the unprefixed names below and
runs `docker compose up -d --build` on the runner host. Stage 17 (prod) maps the
runs `docker compose up -d --build` on the runner host. Stage 18 (prod) maps the
**`PROD_`** set the same way. So a Gitea secret named `TEST_POSTGRES_PASSWORD`
feeds the compose's `POSTGRES_PASSWORD`, etc.
@@ -71,7 +71,7 @@ connector **fails at boot** if both are empty.
| `GRAFANA_ADMIN_PASSWORD` | secret | `admin` | Grafana admin password. Low impact (the login form is disabled, access is anonymous-admin behind caddy) but set it anyway. |
| `TELEGRAM_GAME_CHANNEL_ID_EN` | variable | _(empty)_ | English game-channel id; empty/`0` disables channel posts. |
| `TELEGRAM_GAME_CHANNEL_ID_RU` | variable | _(empty)_ | Russian game-channel id; empty/`0` disables channel posts. |
| `TELEGRAM_TEST_ENV` | _pinned_ | `false` | `true` routes the bot through Telegram's test environment (`.../bot<token>/test/METHOD`). **The CI test contour pins this to `true` in `ci.yaml`** (the contour is the test environment) — it is not a Gitea variable. Set it in `.env` for a local run; prod (Stage 17) leaves it `false`. |
| `TELEGRAM_TEST_ENV` | _pinned_ | `false` | `true` routes the bot through Telegram's test environment (`.../bot<token>/test/METHOD`). **The CI test contour pins this to `true` in `ci.yaml`** (the contour is the test environment) — it is not a Gitea variable. Set it in `.env` for a local run; prod (Stage 18) leaves it `false`. |
| `TELEGRAM_API_BASE_URL` | variable | _(empty)_ | Override the Bot API host (a mock/self-hosted server); empty = `https://api.telegram.org`. |
| `GATEWAY_DEFAULT_SUPPORTED_LANGUAGES` | variable | `en,ru` | Variant-gating set for non-Telegram logins (web/email/guest). |
| `VITE_TELEGRAM_BOT_ID` | variable | _(empty)_ | UI build-arg: numeric bot id for the web Login Widget. |
+1 -1
View File
@@ -4,7 +4,7 @@
# Connect edge) goes to the gateway. Mirrors ../galaxy-game's /_gm model.
#
# CADDY_SITE_ADDRESS is ":80" in the test contour (the host caddy terminates TLS
# and forwards); set it to a domain in prod (Stage 17) so this caddy does its own
# and forwards); set it to a domain in prod (Stage 18) so this caddy does its own
# ACME and the contour is self-contained.
{
admin off
+1 -1
View File
@@ -11,7 +11,7 @@
# - `edge` (external): the host caddy reaches this contour at `scrabble:80`
# (the in-compose caddy's alias). The in-compose caddy terminates only HTTP in
# the test contour; the host caddy terminates TLS and forwards. For prod
# (Stage 17, no host caddy) set CADDY_SITE_ADDRESS to the domain so the caddy
# (Stage 18, no host caddy) set CADDY_SITE_ADDRESS to the domain so the caddy
# does its own ACME — the contour is then self-contained.
# - The connector egresses to api.telegram.org through the `vpn` sidecar
# (network_mode: service:vpn); it answers internal gRPC at `telegram:9091`.
+1 -1
View File
@@ -569,7 +569,7 @@ Two contours, two secret/variable prefixes (`TEST_` / `PROD_`):
host, then a `GET /` probe through caddy). The host caddy terminates TLS and
forwards the domain to `scrabble:80`, so the in-compose caddy serves plain HTTP
(`CADDY_SITE_ADDRESS=:80`).
- **Prod** (Stage 17): a manual SSH deploy after `development → master`. There is no
- **Prod** (Stage 18): a manual SSH deploy after `development → master`. There is no
host caddy, so the contour ships its own caddy terminating TLS — set
`CADDY_SITE_ADDRESS` to the domain and the caddy does its own ACME.