fix(dev-deploy): recycle engine containers on galaxy-engine:dev SHA drift
Tests · Integration / integration (pull_request) Successful in 1m48s
Tests · Go / test (pull_request) Successful in 2m1s

`backend`'s reconciler adopts pre-existing `galaxy-game-*` containers
without comparing their image SHA against the freshly-built
`galaxy-engine:dev`, so a long-lived sandbox would otherwise keep
serving the previous engine code after a redeploy. Issue #59 surfaced
this: after the per-command-rejection fix was deployed via
`workflow_dispatch`, the running sandbox container was still on the
old image SHA and the browser kept seeing the 503/unavailable response.

Adds a `Recycle engine containers on image drift` step right before
`Reap stray dev-deploy containers`. The step compares the new
`galaxy-engine:dev` SHA against every running `galaxy-game-*`
container and, on drift, stops the backend, removes the container,
wipes the bind-mounted per-game state directory (Engine.Init() writes
turn-0 over any pre-existing `turn-N` files — silent state corruption
otherwise), and cascade-deletes the lobby `games` row. The
`dev-sandbox` bootstrap on the next backend boot finds no live
sandbox and provisions a fresh one on the new engine image.

When the engine sources are unchanged, the BuildKit cache hits and
the SHA stays the same — the recycle step is a no-op and the running
games keep their state across the deploy. Verified end-to-end against
the live dev environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-29 10:47:25 +02:00
parent af30846091
commit e038ea6154
2 changed files with 92 additions and 3 deletions
+23
View File
@@ -235,6 +235,29 @@ The deploy is idempotent — when the PR later merges into
healthcheck steps, overwriting whatever the manual dispatch left
behind. There is no separate state to clean up between the two paths.
### Engine image drift recycle
`backend` spawns one engine container per game (the long-lived "Dev
Sandbox" plus any user-created games) and the reconciler reattaches
to whatever it finds with the `galaxy.stack=dev-deploy` label. That
reattach does not check the running container's image SHA against the
freshly-built `galaxy-engine:dev` tag, so an unchanged container would
otherwise keep serving the previous engine code after a redeploy.
The `dev-deploy.yaml` workflow handles this in the
`Recycle engine containers on image drift` step. When `docker build`
produces a new `galaxy-engine:dev` SHA, the step compares it against
every running `galaxy-game-*` container and, for each drifted one,
stops the backend, removes the container, wipes its bind-mounted
state directory (Engine.Init() writes turn-0 over any pre-existing
`turn-N` files), and cascade-deletes the lobby `games` row. The
`dev-sandbox` bootstrap on the next backend boot finds no live
sandbox and provisions a fresh one on the new engine image.
When the engine sources are unchanged, the BuildKit cache hits and
the SHA stays the same — the recycle step is a no-op and the running
games keep their state across the deploy.
## Relationship to other infrastructure
- `tools/local-dev/` — single-developer playground, host-port mapped,