Files
galaxy-game/tools/dev-deploy/KNOWN-ISSUES.md
T
Ilia Denisov cadb72b412
Tests · UI / test (push) Successful in 2m36s
Tests · Go / test (push) Successful in 2m38s
KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:

- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
  so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
  already running emitted `die` / `destroy` events only for
  `galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
  correctly matched the surviving engine in both cases
  (`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
  engine containers, so a graceful backend exit also leaves them
  alone.

The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:10:13 +02:00

6.1 KiB

tools/dev-deploy/ — known issues

Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands.

Dev Sandbox game flips to cancelled after a dev-deploy redispatch

Symptom

A previously running "Dev Sandbox" game (created by backend/internal/devsandbox) transitions to cancelled ~15 minutes after a dev-deploy.yaml workflow_dispatch run finishes. The user's browser session survives (the same device_session_id keeps working), but the lobby shows no game because the only game it had is now terminal. purgeTerminalSandboxGames does pick it up on the next boot and creates a fresh sandbox — but the first redispatch leaves the user with an empty lobby until backend restarts again.

Diagnostic evidence

Backend logs from the broken cycle (timestamps abbreviated):

20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
         op=reconcile status=removed message="container disappeared"

Between 20:24:40 (status=starting) and 20:39:40 (reconciler cancel) the backend logs are silent on the runtime / engine paths — no engine spawned, no engine container started, no runtime transition lines. The reconciler then fires and reports the engine container as missing.

docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine' returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot.

What has been ruled out

A live docker inspect on a healthy engine container shows:

Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
        galaxy.game_id=<uuid>,
        org.opencontainers.image.title=galaxy-game-engine,
        com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove:    false
RestartPolicy: on-failure
NetworkMode:   galaxy-dev-internal

There are no com.docker.compose.* labels and AutoRemove=false, so --remove-orphans cannot reap the engine and a --rm-style self-destruct is not in play. Two redispatches captured under docker events --filter event=create,start,die,destroy,kill,stop also confirmed it: across both runs the only die / destroy events were for galaxy-dev-{backend,api,caddy}. The live engine container survived both redispatches, and the reconciler that fires 60 seconds after the new backend boots correctly matched it through byGameID / byContainerID.

backend/internal/runtime/service.go only removes engine containers from the explicit runStop / runRestart / runPatch paths. There is no runtime.Service.Shutdown that proactively kills containers on backend exit, so a graceful SIGTERM to galaxy-dev-backend will not touch its child engine containers.

Remaining hypotheses

  1. Engine self-crashed and was reaped by something host-side. RestartPolicy=on-failure only retries within Docker's own limits; if the engine exited cleanly (status 0) Docker does not restart, but does keep the row in docker ps -a. The reproduction case had the engine missing from docker ps -a entirely, so a separate cleanup (cron docker container prune, a host script, manual docker rm) needs to be ruled out.
  2. An out-of-band Docker daemon restart dropped the container. A dockerd restart that loses sight of an unmanaged container is rare, but would explain why both the live tracking and docker ps -a are empty. Correlate the gap with journalctl -u docker on the host.
  3. runStart errored at waitForEngineHealthz or Engine.Init and the engine exited on its own before status=running. Bootstrap logs status=starting and then is silent until the reconciler 15 minutes later; the runtime row in that case should have been written with status=engine_unreachable, so any reproduction needs a runtime_records snapshot from the bad window — that table got wiped together with the cancelled game on the next boot, so the post-mortem currently lacks it.

What to investigate next

  • On the dev host: list cron jobs, systemd timers, and any custom shell that periodically runs docker container prune or docker system prune. The host also runs gitea + crowdsec so unrelated maintenance is plausible.
  • Inspect journalctl -u docker --since '2026-05-16 20:50:00' --until '2026-05-16 20:56:34' for the original repro window — confirm whether the daemon flagged a container removal in that gap.
  • Re-run with backend logging level debug so the runtime.scheduler and runtime.workers paths surface their per-game timer / job decisions. The current info level says nothing between bootstrap and the reconciler.
  • Capture runtime_records for the broken game before the next boot purges it; the column set (status, current_container_id, engine_endpoint) tells whether the engine ever reached running or stopped at engine_unreachable.
  • Reproduce on a freshly seeded clean-data volume to rule out postgres-state ambiguity.

Workaround in use today

When the sandbox game flips to cancelled, redispatch dev-deploy:

curl -X POST -n -H 'Content-Type: application/json' \
  -d '{"ref":"<branch>"}' \
  https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches

The next boot's purgeTerminalSandboxGames removes the cancelled row, findOrCreateSandboxGame creates a fresh one, and ensureMembershipsAndDrive puts the new game back to running.

Owner

Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.