`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.
Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:
- `tools/dev-deploy/docker-compose.yml`: mount
`galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
bind. Declare the volume in the top-level `volumes:` block.
- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
(placed right after the existing UI-volume seed) copies the
fixture from `pkg/geoip/test-data/test-data/` into the named
volume via an ephemeral alpine container, the same pattern UI
seeding already uses.
- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
the same copy from the persistent checkout. `up` and `rebuild`
now depend on it, so a hand-run `make -C tools/dev-deploy up`
populates the volume without operator action.
- `tools/dev-deploy/README.md`: updated the make-targets table to
list `seed-geoip`.
- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
failure is downgraded to a "fixed" postmortem; the symptom,
cause, and where the fix lives are kept for future reference.
Verification on the dev host (this branch checked out):
$ make -C tools/dev-deploy up # populates the volume, brings stack healthy
$ docker restart galaxy-dev-backend # used to error "not a directory"
$ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
$ echo "ok" # backend up 6s, healthy
The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.3 KiB
tools/dev-deploy/ — known issues
Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands.
Dev Sandbox game flips to cancelled after a dev-deploy redispatch
Symptom
A previously running "Dev Sandbox" game (created by
backend/internal/devsandbox) transitions to cancelled ~15 minutes
after a dev-deploy.yaml workflow_dispatch run finishes. The user's
browser session survives (the same device_session_id keeps working),
but the lobby shows no game because the only game it had is now
terminal. purgeTerminalSandboxGames does pick it up on the next
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.
Diagnostic evidence
Backend logs from the broken cycle (timestamps abbreviated):
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
op=reconcile status=removed message="container disappeared"
Between 20:24:40 (status=starting) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
engine spawned, no engine container started, no runtime transition lines. The reconciler then fires and reports the engine
container as missing.
docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
What has been ruled out
A live docker inspect on a healthy engine container shows:
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
galaxy.game_id=<uuid>,
org.opencontainers.image.title=galaxy-game-engine,
com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove: false
RestartPolicy: on-failure
NetworkMode: galaxy-dev-internal
There are no com.docker.compose.* labels and AutoRemove=false,
so --remove-orphans cannot reap the engine and a --rm-style
self-destruct is not in play. Two redispatches captured under
docker events --filter event=create,start,die,destroy,kill,stop
also confirmed it: across both runs the only die / destroy
events were for galaxy-dev-{backend,api,caddy}. The live engine
container survived both redispatches, and the reconciler that
fires 60 seconds after the new backend boots correctly matched
it through byGameID / byContainerID.
backend/internal/runtime/service.go only removes engine
containers from the explicit runStop / runRestart / runPatch
paths. There is no runtime.Service.Shutdown that proactively
kills containers on backend exit, so a graceful SIGTERM to
galaxy-dev-backend will not touch its child engine containers.
Host-side hypotheses considered and rejected by the owner
The natural follow-up suspects after compose was cleared — host-side
docker prune cron jobs, a manual docker rm, an out-of-band
dockerd restart, and an idle-state engine crash — were all
rejected by the project owner: the dev host runs none of those
periodic cleanups, no one manually removed the container, dockerd
was not restarted in the window, and the engine binary does not
crash while idling on API calls.
Best remaining suspicion
Something the dev-deploy.yaml CI run does between successful
image builds and the final docker compose up -d --wait --remove-orphans clobbers the previously-spawned engine container.
The chain at runtime contains:
docker build -t galaxy-engine:dev -f game/Dockerfile .docker compose build galaxy-backend galaxy-apidocker run --rmalpine for the UI volume seeddocker compose up -d --wait --remove-orphans
None of these should touch an unmanaged engine container, but
the reproduction window points squarely inside this sequence. A
deliberate next reproduction with docker events --since 0 armed
before the deploy starts and live for the entire job — captured
end-to-end on the dev host, not just the chunk after backend
recreate — would pin which step emits the destroy on the engine.
Update 2026-05-19: integration preclean identified as one cause
A live reproduction during the post-merge auto-deploy cycle (Gitea
run #188 dev-deploy plus parallel run #190 integration) pinned one
clobbering source: integration/scripts/preclean.sh was unscoped
and removed every container labelled galaxy.backend=1, including
the dev-deploy engine. Timeline from the dev host:
23:10:40 backend pre-bootstrap reconciler tick: engine alive
23:10:40 dev_sandbox bootstrap: status=running
23:10:56 preclean: removing 1 backend-managed engine containers ← integration run #190
23:11:40 reconciler: container disappeared → game cancelled
Fix landed: BACKEND_STACK_LABEL=integration is now passed to
every integration backend (see
integration/testenv/backend.go) and preclean.sh AND-combines
galaxy.backend=1 with galaxy.stack=integration, so dev-deploy /
local-dev engines stamped with different stack values are no longer
collateral.
This covers push-triggered cycles where dev-deploy.yaml and
integration.yaml run on the same Gitea host. The original
hypothesis (a workflow_dispatch dev-deploy solo run also losing
the engine) is not explained by the integration fix — manual
dispatches do not trigger integration.yaml. Keep this entry open
until a solo-dispatch reproduction confirms whether the symptom
still occurs.
Status
Partially fixed (push-triggered cycles). Solo workflow_dispatch
reproductions still open. If the symptom recurs after the
integration fix lands, capture docker events --since 0 for the
full dispatch window and attach here.
Workaround in use today
When the sandbox game flips to cancelled, redispatch dev-deploy:
curl -X POST -n -H 'Content-Type: application/json' \
-d '{"ref":"<branch>"}' \
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
The next boot's purgeTerminalSandboxGames removes the cancelled
row, findOrCreateSandboxGame creates a fresh one, and
ensureMembershipsAndDrive puts the new game back to running.
Owner
Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.
docker restart galaxy-dev-backend fails after the CI runner cleans up
Status: fixed (2026-05-19). Kept here as a postmortem in case the symptom resurfaces in a different form.
Symptom
docker restart galaxy-dev-backend from the host failed with:
Error response from daemon: ... error mounting
"/home/runner/.cache/act/<workspace>/hostexecutor/pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb"
to rootfs at "/var/lib/galaxy/geoip.mmdb": ... not a directory
The container ended up Exited (127) and never came back.
Cause
tools/dev-deploy/docker-compose.yml used to mount the geoip
database via a path relative to the compose file
(../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb). When
the dev-deploy.yaml Gitea runner invoked docker compose up, it
resolved that relative path against the runner's ephemeral workspace
under /home/runner/.cache/act/<hash>/hostexecutor/tools/dev-deploy/,
so the bind-mount source baked into the running container pointed at
that ephemeral path. The runner deleted the workspace once the
workflow ended, the source disappeared, and the next docker restart
failed to remount it.
Fix
Replaced the bind-mount with a named volume,
galaxy-dev-geoip-data, seeded by the dev-deploy.yaml workflow
(and by the new make seed-geoip target) at deploy time. The
backend mounts the volume as /var/lib/galaxy:ro, so the bind
source is a Docker-managed volume — independent of the runner
workspace — and survives a docker restart. See
.gitea/workflows/dev-deploy.yaml ("Seed geoip volume" step) and
tools/dev-deploy/Makefile (seed-geoip target).