f70258849f
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.
Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:
- `tools/dev-deploy/docker-compose.yml`: mount
`galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
bind. Declare the volume in the top-level `volumes:` block.
- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
(placed right after the existing UI-volume seed) copies the
fixture from `pkg/geoip/test-data/test-data/` into the named
volume via an ephemeral alpine container, the same pattern UI
seeding already uses.
- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
the same copy from the persistent checkout. `up` and `rebuild`
now depend on it, so a hand-run `make -C tools/dev-deploy up`
populates the volume without operator action.
- `tools/dev-deploy/README.md`: updated the make-targets table to
list `seed-geoip`.
- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
failure is downgraded to a "fixed" postmortem; the symptom,
cause, and where the fix lives are kept for future reference.
Verification on the dev host (this branch checked out):
$ make -C tools/dev-deploy up # populates the volume, brings stack healthy
$ docker restart galaxy-dev-backend # used to error "not a directory"
$ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
$ echo "ok" # backend up 6s, healthy
The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
203 lines
8.3 KiB
Markdown
203 lines
8.3 KiB
Markdown
# `tools/dev-deploy/` — known issues
|
|
|
|
Issues that surface in the long-lived dev environment but are not yet
|
|
fixed. Each entry lists the observed symptom, the diagnostic evidence,
|
|
the working hypothesis, and the open questions that have to be
|
|
answered before a fix lands.
|
|
|
|
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
|
|
|
|
### Symptom
|
|
|
|
A previously `running` "Dev Sandbox" game (created by
|
|
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
|
|
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
|
|
browser session survives (the same `device_session_id` keeps working),
|
|
but the lobby shows no game because the only game it had is now
|
|
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
|
|
boot and creates a fresh sandbox — but the first redispatch leaves
|
|
the user with an empty lobby until backend restarts again.
|
|
|
|
### Diagnostic evidence
|
|
|
|
Backend logs from the broken cycle (timestamps abbreviated):
|
|
|
|
```text
|
|
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
|
|
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
|
|
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
|
|
...
|
|
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
|
|
...
|
|
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
|
|
op=reconcile status=removed message="container disappeared"
|
|
```
|
|
|
|
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
|
|
the backend logs are silent on the runtime / engine paths — no
|
|
`engine spawned`, no `engine container started`, no `runtime
|
|
transition` lines. The reconciler then fires and reports the engine
|
|
container as missing.
|
|
|
|
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
|
|
returns no rows during this window — the engine container is neither
|
|
running nor stopped on the host, so it either was never spawned or
|
|
was removed before the host snapshot.
|
|
|
|
### What has been ruled out
|
|
|
|
A live `docker inspect` on a healthy engine container shows:
|
|
|
|
```text
|
|
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
|
|
galaxy.game_id=<uuid>,
|
|
org.opencontainers.image.title=galaxy-game-engine,
|
|
com.galaxy.{cpu_quota,memory,pids_limit}
|
|
AutoRemove: false
|
|
RestartPolicy: on-failure
|
|
NetworkMode: galaxy-dev-internal
|
|
```
|
|
|
|
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
|
|
so `--remove-orphans` cannot reap the engine and a `--rm`-style
|
|
self-destruct is not in play. Two redispatches captured under
|
|
`docker events --filter event=create,start,die,destroy,kill,stop`
|
|
also confirmed it: across both runs the only `die` / `destroy`
|
|
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
|
|
container survived both redispatches, and the reconciler that
|
|
fires 60 seconds after the new backend boots correctly matched
|
|
it through `byGameID` / `byContainerID`.
|
|
|
|
`backend/internal/runtime/service.go` only removes engine
|
|
containers from the explicit `runStop` / `runRestart` / `runPatch`
|
|
paths. There is no `runtime.Service.Shutdown` that proactively
|
|
kills containers on backend exit, so a graceful SIGTERM to
|
|
`galaxy-dev-backend` will not touch its child engine containers.
|
|
|
|
### Host-side hypotheses considered and rejected by the owner
|
|
|
|
The natural follow-up suspects after compose was cleared — host-side
|
|
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
|
|
`dockerd` restart, and an idle-state engine crash — were all
|
|
rejected by the project owner: the dev host runs none of those
|
|
periodic cleanups, no one manually removed the container, dockerd
|
|
was not restarted in the window, and the engine binary does not
|
|
crash while idling on API calls.
|
|
|
|
### Best remaining suspicion
|
|
|
|
Something the `dev-deploy.yaml` CI run does between successful
|
|
image builds and the final `docker compose up -d --wait
|
|
--remove-orphans` clobbers the previously-spawned engine container.
|
|
The chain at runtime contains:
|
|
|
|
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
|
|
2. `docker compose build galaxy-backend galaxy-api`
|
|
3. `docker run --rm` alpine for the UI volume seed
|
|
4. `docker compose up -d --wait --remove-orphans`
|
|
|
|
None of these *should* touch an unmanaged engine container, but
|
|
the reproduction window points squarely inside this sequence. A
|
|
deliberate next reproduction with `docker events --since 0` armed
|
|
*before* the deploy starts and live for the entire job — captured
|
|
end-to-end on the dev host, not just the chunk after backend
|
|
recreate — would pin which step emits the `destroy` on the engine.
|
|
|
|
### Update 2026-05-19: integration preclean identified as one cause
|
|
|
|
A live reproduction during the post-merge auto-deploy cycle (Gitea
|
|
run #188 dev-deploy plus parallel run #190 integration) pinned one
|
|
clobbering source: `integration/scripts/preclean.sh` was unscoped
|
|
and removed *every* container labelled `galaxy.backend=1`, including
|
|
the dev-deploy engine. Timeline from the dev host:
|
|
|
|
```text
|
|
23:10:40 backend pre-bootstrap reconciler tick: engine alive
|
|
23:10:40 dev_sandbox bootstrap: status=running
|
|
23:10:56 preclean: removing 1 backend-managed engine containers ← integration run #190
|
|
23:11:40 reconciler: container disappeared → game cancelled
|
|
```
|
|
|
|
Fix landed: `BACKEND_STACK_LABEL=integration` is now passed to
|
|
every integration backend (see
|
|
`integration/testenv/backend.go`) and `preclean.sh` AND-combines
|
|
`galaxy.backend=1` with `galaxy.stack=integration`, so dev-deploy /
|
|
local-dev engines stamped with different stack values are no longer
|
|
collateral.
|
|
|
|
This covers **push**-triggered cycles where `dev-deploy.yaml` and
|
|
`integration.yaml` run on the same Gitea host. The original
|
|
hypothesis (a `workflow_dispatch dev-deploy` solo run also losing
|
|
the engine) is *not* explained by the integration fix — manual
|
|
dispatches do not trigger `integration.yaml`. Keep this entry open
|
|
until a solo-dispatch reproduction confirms whether the symptom
|
|
still occurs.
|
|
|
|
### Status
|
|
|
|
Partially fixed (push-triggered cycles). Solo `workflow_dispatch`
|
|
reproductions still open. If the symptom recurs after the
|
|
integration fix lands, capture `docker events --since 0` for the
|
|
full dispatch window and attach here.
|
|
|
|
### Workaround in use today
|
|
|
|
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
|
|
|
|
```sh
|
|
curl -X POST -n -H 'Content-Type: application/json' \
|
|
-d '{"ref":"<branch>"}' \
|
|
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
|
|
```
|
|
|
|
The next boot's `purgeTerminalSandboxGames` removes the cancelled
|
|
row, `findOrCreateSandboxGame` creates a fresh one, and
|
|
`ensureMembershipsAndDrive` puts the new game back to `running`.
|
|
|
|
### Owner
|
|
|
|
Unassigned. File an issue once we have the runtime / reconciler
|
|
analysis above; reference this section in the issue body so future
|
|
redeploys can short-circuit the diagnostic loop.
|
|
|
|
## `docker restart galaxy-dev-backend` fails after the CI runner cleans up
|
|
|
|
**Status: fixed (2026-05-19).** Kept here as a postmortem in case
|
|
the symptom resurfaces in a different form.
|
|
|
|
### Symptom
|
|
|
|
`docker restart galaxy-dev-backend` from the host failed with:
|
|
|
|
```text
|
|
Error response from daemon: ... error mounting
|
|
"/home/runner/.cache/act/<workspace>/hostexecutor/pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb"
|
|
to rootfs at "/var/lib/galaxy/geoip.mmdb": ... not a directory
|
|
```
|
|
|
|
The container ended up `Exited (127)` and never came back.
|
|
|
|
### Cause
|
|
|
|
`tools/dev-deploy/docker-compose.yml` used to mount the geoip
|
|
database via a path relative to the compose file
|
|
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`). When
|
|
the `dev-deploy.yaml` Gitea runner invoked `docker compose up`, it
|
|
resolved that relative path against the runner's ephemeral workspace
|
|
under `/home/runner/.cache/act/<hash>/hostexecutor/tools/dev-deploy/`,
|
|
so the bind-mount source baked into the running container pointed at
|
|
that ephemeral path. The runner deleted the workspace once the
|
|
workflow ended, the source disappeared, and the next `docker restart`
|
|
failed to remount it.
|
|
|
|
### Fix
|
|
|
|
Replaced the bind-mount with a named volume,
|
|
`galaxy-dev-geoip-data`, seeded by the `dev-deploy.yaml` workflow
|
|
(and by the new `make seed-geoip` target) at deploy time. The
|
|
backend mounts the volume as `/var/lib/galaxy:ro`, so the bind
|
|
source is a Docker-managed volume — independent of the runner
|
|
workspace — and survives a `docker restart`. See
|
|
`.gitea/workflows/dev-deploy.yaml` ("Seed geoip volume" step) and
|
|
`tools/dev-deploy/Makefile` (`seed-geoip` target).
|