galaxy-game/tools/dev-deploy/KNOWN-ISSUES.md

# `tools/dev-deploy/` — known issues

Issues that surface in the long-lived dev environment but are not yet
fixed. Each entry lists the observed symptom, the diagnostic evidence,
the working hypothesis, and the open questions that have to be
answered before a fix lands.

## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch

### Symptom

A previously `running` "Dev Sandbox" game (created by
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
browser session survives (the same `device_session_id` keeps working),
but the lobby shows no game because the only game it had is now
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.

### Diagnostic evidence

Backend logs from the broken cycle (timestamps abbreviated):

```text
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
         op=reconcile status=removed message="container disappeared"
```

Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
`engine spawned`, no `engine container started`, no `runtime
transition` lines. The reconciler then fires and reports the engine
container as missing.

`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.

### What has been ruled out

A live `docker inspect` on a healthy engine container shows:

```text
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
        galaxy.game_id=<uuid>,
        org.opencontainers.image.title=galaxy-game-engine,
        com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove:    false
RestartPolicy: on-failure
NetworkMode:   galaxy-dev-internal
```

There are no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap the engine and a `--rm`-style
self-destruct is not in play. Two redispatches captured under
`docker events --filter event=create,start,die,destroy,kill,stop`
also confirmed it: across both runs the only `die` / `destroy`
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
container survived both redispatches, and the reconciler that
fires 60 seconds after the new backend boots correctly matched
it through `byGameID` / `byContainerID`.

`backend/internal/runtime/service.go` only removes engine
containers from the explicit `runStop` / `runRestart` / `runPatch`
paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers.

### Host-side hypotheses considered and rejected by the owner

The natural follow-up suspects after compose was cleared — host-side
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
`dockerd` restart, and an idle-state engine crash — were all
rejected by the project owner: the dev host runs none of those
periodic cleanups, no one manually removed the container, dockerd
was not restarted in the window, and the engine binary does not
crash while idling on API calls.

### Best remaining suspicion

Something the `dev-deploy.yaml` CI run does between successful
image builds and the final `docker compose up -d --wait
--remove-orphans` clobbers the previously-spawned engine container.
The chain at runtime contains:

1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
2. `docker compose build galaxy-backend galaxy-api`
3. `docker run --rm` alpine for the UI volume seed
4. `docker compose up -d --wait --remove-orphans`

None of these *should* touch an unmanaged engine container, but
the reproduction window points squarely inside this sequence. A
deliberate next reproduction with `docker events --since 0` armed
*before* the deploy starts and live for the entire job — captured
end-to-end on the dev host, not just the chunk after backend
recreate — would pin which step emits the `destroy` on the engine.

### Update 2026-05-19: integration preclean identified as one cause

A live reproduction during the post-merge auto-deploy cycle (Gitea
run #188 dev-deploy plus parallel run #190 integration) pinned one
clobbering source: `integration/scripts/preclean.sh` was unscoped
and removed *every* container labelled `galaxy.backend=1`, including
the dev-deploy engine. Timeline from the dev host:

```text
23:10:40  backend pre-bootstrap reconciler tick: engine alive
23:10:40  dev_sandbox bootstrap: status=running
23:10:56  preclean: removing 1 backend-managed engine containers  ← integration run #190
23:11:40  reconciler: container disappeared → game cancelled
```

Fix landed: `BACKEND_STACK_LABEL=integration` is now passed to
every integration backend (see
`integration/testenv/backend.go`) and `preclean.sh` AND-combines
`galaxy.backend=1` with `galaxy.stack=integration`, so dev-deploy /
local-dev engines stamped with different stack values are no longer
collateral.

This covers **push**-triggered cycles where `dev-deploy.yaml` and
`integration.yaml` run on the same Gitea host. The original
hypothesis (a `workflow_dispatch dev-deploy` solo run also losing
the engine) is *not* explained by the integration fix — manual
dispatches do not trigger `integration.yaml`. Keep this entry open
until a solo-dispatch reproduction confirms whether the symptom
still occurs.

### Status

Partially fixed (push-triggered cycles). Solo `workflow_dispatch`
reproductions still open. If the symptom recurs after the
integration fix lands, capture `docker events --since 0` for the
full dispatch window and attach here.

### Workaround in use today

When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:

```sh
curl -X POST -n -H 'Content-Type: application/json' \
  -d '{"ref":"<branch>"}' \
  https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
```

The next boot's `purgeTerminalSandboxGames` removes the cancelled
row, `findOrCreateSandboxGame` creates a fresh one, and
`ensureMembershipsAndDrive` puts the new game back to `running`.

### Owner

Unassigned. File an issue once we have the runtime / reconciler
analysis above; reference this section in the issue body so future
redeploys can short-circuit the diagnostic loop.

## `docker restart galaxy-dev-backend` fails after the CI runner cleans up

### Symptom

`docker restart galaxy-dev-backend` from the host fails with:

```text
Error response from daemon: ... error mounting
"/home/runner/.cache/act/<workspace>/hostexecutor/pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb"
to rootfs at "/var/lib/galaxy/geoip.mmdb": ... not a directory
```

The container ends up `Exited (127)` and never comes back.

### Cause

`tools/dev-deploy/docker-compose.yml` mounts the geoip database via
a path relative to the compose file
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`). When
the `dev-deploy.yaml` Gitea runner invokes `docker compose up` it
resolves that relative path against the runner's ephemeral workspace
under `/home/runner/.cache/act/<hash>/hostexecutor/tools/dev-deploy/`,
so the bind-mount source baked into the running container points at
that ephemeral path. The runner deletes the workspace once the
workflow ends, the source disappears, and the next `docker restart`
fails to remount it.

### Workaround

Bring the stack back up from a stable workspace, which re-binds the
mount source to the persistent checkout:

```sh
make -C tools/dev-deploy up
```

This restarts every service (including the broken `galaxy-dev-backend`)
with a stable source path.

### Status

Open. The clean fix is either to bake the geoip test fixture into
the backend image (no host bind-mount) or to copy it onto a named
volume during `dev-deploy.yaml` and bind that instead. Either change
removes the runner-workspace dependency entirely.