831ecd0cab
Root cause of the Grafana "readdirent /etc/grafana/dashboards: no such file or
directory": the CI runner checks out into an ephemeral act workspace that is
removed after the job, so binding the compose config files straight from it
dangles the mounts in the long-lived containers (verified the act source dir is
emptied after the job). caddy/otelcol/prometheus/tempo read their config once at
startup so they survive, but would break on a restart — same latent bug.
Fix (mirrors ../galaxy-game's $HOME/.galaxy-dev/monitoring): the deploy job seeds
the config dirs to a stable $HOME/.scrabble-deploy and the compose binds them via
${SCRABBLE_CONFIG_DIR:-.} (local runs keep "."). Documented in the compose header,
deploy/README.md and the ci.yaml step.
115 lines
7.8 KiB
Markdown
115 lines
7.8 KiB
Markdown
# deploy
|
|
|
|
The full Scrabble contour: `backend` + `gateway` + Postgres + the Telegram
|
|
connector (with a VPN sidecar) + the observability stack (OTel Collector →
|
|
Prometheus + Tempo → Grafana), fronted by a **caddy** that owns a single `/_gm`
|
|
Basic-Auth (the admin console + Grafana). Topology and the decision record are in
|
|
[`../docs/ARCHITECTURE.md`](../docs/ARCHITECTURE.md) §13; this file is the
|
|
operational reference for **every environment variable**.
|
|
|
|
## Services
|
|
|
|
| Service | Image | Role |
|
|
| --- | --- | --- |
|
|
| `caddy` | `caddy:2-alpine` | Edge proxy (alias `scrabble` on `edge`): single `/_gm` Basic-Auth → admin console + Grafana; everything else → gateway. TLS per `CADDY_SITE_ADDRESS`. |
|
|
| `gateway` | built (`gateway/Dockerfile`) | Public edge; serves the embedded SPA at `/` and `/telegram/`; Connect-RPC edge. |
|
|
| `backend` | built (`backend/Dockerfile`) | Domain service; bakes in the DAWG dictionaries; runs migrations at boot. |
|
|
| `postgres` | `postgres:17-alpine` | Database (named volume, `pg_isready` healthcheck). |
|
|
| `vpn` + `telegram` | sidecar + built (`platform/telegram/Dockerfile`) | Telegram connector; egresses through the AmneziaWG sidecar; internal gRPC at `telegram:9091`. |
|
|
| `otelcol` | `otel/opentelemetry-collector-contrib` | OTLP/gRPC `:4317` → Prometheus scrape (`:9464`) + Tempo. |
|
|
| `prometheus` | `prom/prometheus` | Metrics, 15d retention. |
|
|
| `tempo` | `grafana/tempo` | Traces, 72h retention. |
|
|
| `grafana` | `grafana/grafana` | Dashboards (provisioned), anonymous-admin behind caddy's `/_gm/grafana`. |
|
|
|
|
Networking: inter-service traffic is on the private `internal` network
|
|
(project-scoped DNS); only `caddy` joins the shared external `edge` network so the
|
|
host caddy can reach it at `scrabble:80`. `edge` must already exist on the host
|
|
(`docker network create edge`).
|
|
|
|
## Run it
|
|
|
|
**Locally** — copy the template, fill the required values, bring it up:
|
|
|
|
```sh
|
|
cp deploy/.env.example deploy/.env # then edit deploy/.env
|
|
docker network create edge # once, if it does not exist
|
|
cd deploy && docker compose up -d --build
|
|
```
|
|
|
|
**In CI** (the test contour) — `.gitea/workflows/ci.yaml`'s `deploy` job maps the
|
|
Gitea **`TEST_`-prefixed** secrets/variables onto the unprefixed names below and
|
|
runs `docker compose up -d --build` on the runner host. Stage 18 (prod) maps the
|
|
**`PROD_`** set the same way. So a Gitea secret named `TEST_POSTGRES_PASSWORD`
|
|
feeds the compose's `POSTGRES_PASSWORD`, etc.
|
|
|
|
The deploy job also **seeds the config files** (`caddy`, `otelcol`, `prometheus`,
|
|
`tempo`, `grafana`) to a stable host path (`$HOME/.scrabble-deploy`) and sets
|
|
`SCRABBLE_CONFIG_DIR` to it before `up`. The runner's checkout is an ephemeral act
|
|
workspace that is removed after the job — binding config straight from it would
|
|
dangle the mounts in the long-lived containers (Grafana would log
|
|
`no such file or directory`). Locally `SCRABBLE_CONFIG_DIR` defaults to `.`, so the
|
|
compose binds from this directory.
|
|
|
|
## Required variables
|
|
|
|
`docker compose` aborts immediately if any of these is unset (they use `:?`):
|
|
|
|
| Variable | Gitea kind | Purpose |
|
|
| --- | --- | --- |
|
|
| `POSTGRES_PASSWORD` | secret | Postgres password (also embedded in `BACKEND_POSTGRES_DSN`). |
|
|
| `AWG_CONF` | secret | AmneziaWG config for the VPN sidecar (the connector's only egress). **Must not contain a `DNS=` line** — it hijacks the shared netns's resolv.conf and breaks the connector resolving `otelcol` (telemetry export). Without it, Docker's resolver handles both `otelcol` and `api.telegram.org`. |
|
|
| `GM_BASICAUTH_HASH` | secret | bcrypt hash gating `/_gm` (admin console + Grafana). Generate with `docker run --rm caddy:2-alpine caddy hash-password --plaintext '<pw>'`. |
|
|
| `TELEGRAM_MINIAPP_URL` | variable | The Mini App URL the connector hands out in deep links / buttons. |
|
|
|
|
**Plus at least one bot token** — `TELEGRAM_BOT_TOKEN_EN` or `TELEGRAM_BOT_TOKEN_RU`
|
|
(secrets). Compose cannot express "one of", so they default to empty, but the
|
|
connector **fails at boot** if both are empty.
|
|
|
|
## Optional variables (with defaults)
|
|
|
|
| Variable | Gitea kind | Default | Purpose |
|
|
| --- | --- | --- | --- |
|
|
| `POSTGRES_DB` | variable | `scrabble` | Database name. |
|
|
| `POSTGRES_USER` | variable | `scrabble` | Database user. |
|
|
| `DICT_VERSION` | variable | `v1.0.0` | `scrabble-dictionary` release tag baked into the backend image (build-arg). |
|
|
| `LOG_LEVEL` | variable | `info` | Shared log level for backend / gateway / connector (`debug\|info\|warn\|error`). |
|
|
| `CADDY_SITE_ADDRESS` | variable | `:80` | Caddy site address. Test: `:80` (host caddy terminates TLS). Prod: a domain, so caddy does its own ACME. |
|
|
| `GM_BASICAUTH_USER` | variable | `gm` | Username for the `/_gm` Basic-Auth. |
|
|
| `GRAFANA_ROOT_URL` | variable | `/_gm/grafana/` | Grafana root URL (sub-path serving). Set the full `https://<domain>/_gm/grafana/` behind a real domain. |
|
|
| `GRAFANA_ADMIN_PASSWORD` | secret | `admin` | Grafana admin password. Low impact (the login form is disabled, access is anonymous-admin behind caddy) but set it anyway. |
|
|
| `TELEGRAM_GAME_CHANNEL_ID_EN` | variable | _(empty)_ | English game-channel id; empty/`0` disables channel posts. |
|
|
| `TELEGRAM_GAME_CHANNEL_ID_RU` | variable | _(empty)_ | Russian game-channel id; empty/`0` disables channel posts. |
|
|
| `TELEGRAM_TEST_ENV` | _pinned_ | `false` | `true` routes the bot through Telegram's test environment (`.../bot<token>/test/METHOD`). **The CI test contour pins this to `true` in `ci.yaml`** (the contour is the test environment) — it is not a Gitea variable. Set it in `.env` for a local run; prod (Stage 18) leaves it `false`. |
|
|
| `TELEGRAM_API_BASE_URL` | variable | _(empty)_ | Override the Bot API host (a mock/self-hosted server); empty = `https://api.telegram.org`. |
|
|
| `GATEWAY_DEFAULT_SUPPORTED_LANGUAGES` | variable | `en,ru` | Variant-gating set for non-Telegram logins (web/email/guest). |
|
|
| `VITE_TELEGRAM_BOT_ID` | variable | _(empty)_ | UI build-arg: numeric bot id for the web Login Widget. |
|
|
| `VITE_TELEGRAM_LINK` | variable | _(empty)_ | UI build-arg: deep-link base for share-to-Telegram (e.g. `https://t.me/<bot>/<app>`). |
|
|
| `VITE_GATEWAY_URL` | variable | _(empty)_ | UI build-arg: gateway origin; empty = same-origin (the usual single-origin deploy). |
|
|
|
|
The three `VITE_*` are **build-args** baked into the gateway image at build time, so
|
|
changing them requires a rebuild (`--build`), not just a restart.
|
|
|
|
## Fixed internal wiring (not operator-set)
|
|
|
|
These are hard-wired in `docker-compose.yml` (no `${...}`), pointing the services
|
|
at each other on the `internal` network — listed here so they are not mistaken for
|
|
missing config: `BACKEND_POSTGRES_DSN` (→ `postgres`, `search_path=backend`),
|
|
`GATEWAY_BACKEND_HTTP_URL`/`_GRPC_ADDR` (→ `backend`),
|
|
`GATEWAY_CONNECTOR_ADDR`/`BACKEND_CONNECTOR_ADDR` (→ `telegram:9091`), and all three
|
|
services' `*_OTEL_*_EXPORTER=otlp` → `OTEL_EXPORTER_OTLP_ENDPOINT=http://otelcol:4317`
|
|
(`_INSECURE=true`). The connector shares the VPN sidecar's netns: routing to the
|
|
collector's internal IP is fine (connected route), but its `AWG_CONF` must **not**
|
|
set a `DNS=` directive — that hijacks resolv.conf and breaks resolving `otelcol`
|
|
("produced zero addresses"); without it the netns uses Docker's resolver, which
|
|
resolves both `otelcol` and `api.telegram.org`. `GATEWAY_ADMIN_*` is intentionally
|
|
**unset** — caddy owns `/_gm` in the contour.
|
|
|
|
## Host-side setup (outside this repo)
|
|
|
|
- **`edge` network** must exist on the host (`docker network create edge`).
|
|
- **Host caddy** route `<domain> → scrabble:80` (the in-compose caddy serves HTTP
|
|
in the test contour; the host caddy terminates TLS). Not needed on prod, where the
|
|
contour caddy owns TLS (set `CADDY_SITE_ADDRESS` to the domain).
|
|
- **Branch protection** required-status-check names are `CI / unit`,
|
|
`CI / integration`, `CI / ui` (see [`../CLAUDE.md`](../CLAUDE.md) "Branching & CI").
|