Files
galaxy-game/tools/dev-deploy/KNOWN-ISSUES.md
T
Ilia Denisov cadb72b412
Tests · UI / test (push) Successful in 2m36s
Tests · Go / test (push) Successful in 2m38s
KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:

- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
  so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
  already running emitted `die` / `destroy` events only for
  `galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
  correctly matched the surviving engine in both cases
  (`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
  engine containers, so a graceful backend exit also leaves them
  alone.

The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:10:13 +02:00

140 lines
6.1 KiB
Markdown

# `tools/dev-deploy/` — known issues
Issues that surface in the long-lived dev environment but are not yet
fixed. Each entry lists the observed symptom, the diagnostic evidence,
the working hypothesis, and the open questions that have to be
answered before a fix lands.
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
### Symptom
A previously `running` "Dev Sandbox" game (created by
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
browser session survives (the same `device_session_id` keeps working),
but the lobby shows no game because the only game it had is now
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.
### Diagnostic evidence
Backend logs from the broken cycle (timestamps abbreviated):
```text
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
op=reconcile status=removed message="container disappeared"
```
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
`engine spawned`, no `engine container started`, no `runtime
transition` lines. The reconciler then fires and reports the engine
container as missing.
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
### What has been ruled out
A live `docker inspect` on a healthy engine container shows:
```text
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
galaxy.game_id=<uuid>,
org.opencontainers.image.title=galaxy-game-engine,
com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove: false
RestartPolicy: on-failure
NetworkMode: galaxy-dev-internal
```
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap the engine and a `--rm`-style
self-destruct is not in play. Two redispatches captured under
`docker events --filter event=create,start,die,destroy,kill,stop`
also confirmed it: across both runs the only `die` / `destroy`
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
container survived both redispatches, and the reconciler that
fires 60 seconds after the new backend boots correctly matched
it through `byGameID` / `byContainerID`.
`backend/internal/runtime/service.go` only removes engine
containers from the explicit `runStop` / `runRestart` / `runPatch`
paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers.
### Remaining hypotheses
1. **Engine self-crashed and was reaped by something host-side.**
`RestartPolicy=on-failure` only retries within Docker's own
limits; if the engine exited cleanly (status 0) Docker does
not restart, but does keep the row in `docker ps -a`. The
reproduction case had the engine missing from `docker ps -a`
entirely, so a separate cleanup (cron `docker container prune`,
a host script, manual `docker rm`) needs to be ruled out.
2. **An out-of-band Docker daemon restart dropped the container.**
A `dockerd` restart that loses sight of an unmanaged container
is rare, but would explain why both the live tracking and
`docker ps -a` are empty. Correlate the gap with
`journalctl -u docker` on the host.
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
and the engine exited on its own before `status=running`.**
Bootstrap logs `status=starting` and then is silent until the
reconciler 15 minutes later; the runtime row in that case
should have been written with `status=engine_unreachable`, so
any reproduction needs a `runtime_records` snapshot from the
bad window — that table got wiped together with the cancelled
game on the next boot, so the post-mortem currently lacks it.
### What to investigate next
- On the dev host: list cron jobs, systemd timers, and any custom
shell that periodically runs `docker container prune` or
`docker system prune`. The host also runs gitea + crowdsec so
unrelated maintenance is plausible.
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
'2026-05-16 20:56:34'` for the original repro window — confirm
whether the daemon flagged a container removal in that gap.
- Re-run with backend logging level `debug` so the
`runtime.scheduler` and `runtime.workers` paths surface their
per-game timer / job decisions. The current `info` level says
nothing between bootstrap and the reconciler.
- Capture `runtime_records` for the broken game *before* the next
boot purges it; the column set
(`status`, `current_container_id`, `engine_endpoint`) tells
whether the engine ever reached `running` or stopped at
`engine_unreachable`.
- Reproduce on a freshly seeded `clean-data` volume to rule out
postgres-state ambiguity.
### Workaround in use today
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
```sh
curl -X POST -n -H 'Content-Type: application/json' \
-d '{"ref":"<branch>"}' \
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
```
The next boot's `purgeTerminalSandboxGames` removes the cancelled
row, `findOrCreateSandboxGame` creates a fresh one, and
`ensureMembershipsAndDrive` puts the new game back to `running`.
### Owner
Unassigned. File an issue once we have the runtime / reconciler
analysis above; reference this section in the issue body so future
redeploys can short-circuit the diagnostic loop.