fix(dev-deploy): recycle engine containers on galaxy-engine:dev SHA drift
Tests · Integration / integration (pull_request) Successful in 1m48s
Tests · Go / test (pull_request) Successful in 2m1s

`backend`'s reconciler adopts pre-existing `galaxy-game-*` containers
without comparing their image SHA against the freshly-built
`galaxy-engine:dev`, so a long-lived sandbox would otherwise keep
serving the previous engine code after a redeploy. Issue #59 surfaced
this: after the per-command-rejection fix was deployed via
`workflow_dispatch`, the running sandbox container was still on the
old image SHA and the browser kept seeing the 503/unavailable response.

Adds a `Recycle engine containers on image drift` step right before
`Reap stray dev-deploy containers`. The step compares the new
`galaxy-engine:dev` SHA against every running `galaxy-game-*`
container and, on drift, stops the backend, removes the container,
wipes the bind-mounted per-game state directory (Engine.Init() writes
turn-0 over any pre-existing `turn-N` files — silent state corruption
otherwise), and cascade-deletes the lobby `games` row. The
`dev-sandbox` bootstrap on the next backend boot finds no live
sandbox and provisions a fresh one on the new engine image.

When the engine sources are unchanged, the BuildKit cache hits and
the SHA stays the same — the recycle step is a no-op and the running
games keep their state across the deploy. Verified end-to-end against
the live dev environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-29 10:47:25 +02:00
parent af30846091
commit e038ea6154
2 changed files with 92 additions and 3 deletions
+69 -3
View File
@@ -148,14 +148,80 @@ jobs:
-v "${{ gitea.workspace }}/pkg/geoip/test-data/test-data:/src:ro" \
alpine sh -c 'cp /src/GeoIP2-Country-Test.mmdb /dst/geoip.mmdb'
- name: Recycle engine containers on image drift
run: |
# Compare the freshly-built `galaxy-engine:dev` SHA against
# every running `galaxy-game-*` container. The backend
# reconciler adopts pre-existing labelled engine containers
# without checking image drift, so a running sandbox would
# otherwise keep serving the previous engine code until the
# container is recycled by hand. This step makes the recycle
# automatic but only when it is actually needed:
#
# * BuildKit cache hit on the `Build galaxy-engine image`
# step → `galaxy-engine:dev` keeps its previous SHA →
# no drift → no-op (no engine source change to deploy).
# * engine source change → fresh SHA → for each drifted
# container we stop the backend, remove the container,
# wipe its bind-mounted state directory (Engine.Init()
# writes turn-0 over any pre-existing `turn-N` files —
# silent state corruption otherwise), and cascade-delete
# the lobby `games` row (the FKs in `00001_init.sql`
# drop the matching `runtime_records`, `memberships`,
# `player_mappings`, etc. in the same write). The
# `dev-sandbox` bootstrap on the next backend boot finds
# no live sandbox and provisions a fresh one on the new
# engine image.
#
# Backend is stopped first to keep the reconciler from
# racing the recycle (mid-stream adoption / restart). The
# subsequent `Bring up the stack` step restarts it.
set -u
new_sha=$(docker image inspect galaxy-engine:dev --format '{{.Id}}')
echo "fresh galaxy-engine:dev = $new_sha"
drift=()
for c in $(docker ps --filter "name=galaxy-game-" --format '{{.Names}}'); do
cur=$(docker inspect "$c" --format '{{.Image}}')
if [ "$cur" != "$new_sha" ]; then
drift+=("${c#galaxy-game-}")
echo " drift: $c was on $cur"
else
echo " match: $c"
fi
done
if [ ${#drift[@]} -eq 0 ]; then
echo "no drift detected — recycle skipped"
else
docker stop -t 30 galaxy-dev-backend >/dev/null 2>&1 || true
state_root="$HOME/.galaxy-dev/game-state"
for gid in "${drift[@]}"; do
echo "recycling $gid"
docker rm -f "galaxy-game-$gid" >/dev/null 2>&1 || true
# Wipe the per-game state dir as root inside a throwaway
# container so we can remove files left behind by the
# engine container even when its uid differs from the
# runner's.
docker run --rm -v "$state_root:/state" alpine \
sh -c "rm -rf -- /state/$gid"
done
ids_csv=$(printf "'%s'," "${drift[@]}")
ids_csv=${ids_csv%,}
docker exec galaxy-dev-postgres psql -v ON_ERROR_STOP=1 \
-U galaxy -d galaxy_backend \
-c "DELETE FROM backend.games WHERE game_id IN (${ids_csv});"
fi
- name: Reap stray dev-deploy containers
run: |
# Remove any non-running compose-managed containers from
# earlier deploys before `compose up`. Filter by the stack
# label so we never touch unrelated workloads on the same
# daemon. Running containers (incl. engine instances backend
# spawned itself with the same label) are left intact
# those are reattached by the backend reconciler on boot.
# daemon. Running engine containers spawned by backend with
# the same label are left intact when their image SHA still
# matches the freshly-built `galaxy-engine:dev` (handled by
# the preceding `Recycle engine containers on image drift`
# step); the reconciler reattaches them on backend boot.
ids=$(docker ps -aq \
--filter "label=galaxy.stack=dev-deploy" \
--filter "status=exited" \