chore(dev-deploy): KNOWN-ISSUES entry for sandbox-cancel after redispatch #12

Merged
developer merged 3 commits from chore/dev-sandbox-cancel-todo into development 2026-05-16 21:17:37 +00:00
Owner

Captures the diagnostic notes for the long-running dev environment issue where the auto-provisioned Dev Sandbox game flips to cancelled ~15 minutes after a dev-deploy.yaml redispatch, with no engine spawned log between bootstrap and the reconcile cancel and the engine container missing from docker ps -a entirely.

This is a docs-only change. tools/dev-deploy/KNOWN-ISSUES.md records:

  • The symptom and the relevant log excerpt from the original bad window (2026-05-16 20:24-20:39 UTC).
  • The diagnostic evidence that already cleared the obvious suspects: engine container has no compose labels, AutoRemove=false, no runtime.Service.Shutdown reaps engines on backend exit, and two controlled redispatches under docker events capture failed to reproduce the destroy.
  • The host-side hypotheses (docker prune cron, manual docker rm, dockerd restart, idle-state engine crash) rejected by the project owner after the investigation.
  • The remaining suspicion that the issue lives somewhere inside the dev-deploy.yaml CI job sequence, plus the concrete next step (a docker events --since 0 capture armed before the deploy starts).
  • A Status: parked note. The bug is mildly disruptive and a tools/dev-deploy/ rewrite is on the medium-term list; reopen if the symptom recurs with a fresh trace attached.

No code changes.

Captures the diagnostic notes for the long-running dev environment issue where the auto-provisioned Dev Sandbox game flips to `cancelled` ~15 minutes after a `dev-deploy.yaml` redispatch, with no `engine spawned` log between bootstrap and the reconcile cancel and the engine container missing from `docker ps -a` entirely. This is a docs-only change. `tools/dev-deploy/KNOWN-ISSUES.md` records: - The symptom and the relevant log excerpt from the original bad window (2026-05-16 20:24-20:39 UTC). - The diagnostic evidence that already cleared the obvious suspects: engine container has no compose labels, `AutoRemove=false`, no `runtime.Service.Shutdown` reaps engines on backend exit, and two controlled redispatches under `docker events` capture failed to reproduce the destroy. - The host-side hypotheses (`docker prune` cron, manual `docker rm`, `dockerd` restart, idle-state engine crash) rejected by the project owner after the investigation. - The remaining suspicion that the issue lives somewhere inside the `dev-deploy.yaml` CI job sequence, plus the concrete next step (a `docker events --since 0` capture armed before the deploy starts). - A `Status: parked` note. The bug is mildly disruptive and a `tools/dev-deploy/` rewrite is on the medium-term list; reopen if the symptom recurs with a fresh trace attached. No code changes.
developer added 3 commits 2026-05-16 21:17:05 +00:00
Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.

`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.

No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
Tests · UI / test (push) Successful in 2m36s
Tests · Go / test (push) Successful in 2m38s
cadb72b412
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:

- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
  so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
  already running emitted `die` / `destroy` events only for
  `galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
  correctly matched the surviving engine in both cases
  (`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
  engine containers, so a graceful backend exit also leaves them
  alone.

The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.

Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
developer merged commit 5eec7013ba into development 2026-05-16 21:17:37 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: developer/galaxy-game#12