Files
galaxy-game/rtmanager/docs/flows.md
T
2026-04-28 20:39:18 +02:00

12 KiB

Flows

This document collects the lifecycle and observability flows that span Runtime Manager and its synchronous and asynchronous neighbours. Narrative descriptions of the rules these flows enforce live in ../README.md; the diagrams here focus on the message order across the boundary. Design-rationale records linked from each section explain the why.

Start (happy path)

sequenceDiagram
    participant Lobby as Lobby publisher
    participant Stream as runtime:start_jobs
    participant Consumer as startjobsconsumer
    participant Service as startruntime
    participant Lease as Redis lease
    participant Docker
    participant PG as Postgres
    participant Health as runtime:health_events
    participant Results as runtime:job_results

    Lobby->>Stream: XADD {game_id, image_ref, requested_at_ms}
    Consumer->>Stream: XREAD
    Consumer->>Service: Handle(game_id, image_ref, OpSourceLobbyStream, entry_id)
    Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
    Service->>PG: SELECT runtime_records WHERE game_id
    Service->>Docker: PullImage(image_ref) per pull policy
    Service->>Docker: InspectImage → resource limits
    Service->>Service: prepareStateDir(<root>/{game_id})
    Service->>Docker: ContainerCreate + ContainerStart
    Service->>PG: Upsert runtime_records (status=running)
    Service->>PG: INSERT operation_log (op_kind=start, outcome=success)
    Service->>Health: XADD container_started
    Service-->>Consumer: Result{Outcome=success, ContainerID, EngineEndpoint}
    Consumer->>Results: XADD {outcome=success, container_id, engine_endpoint}
    Service->>Lease: DEL rtmanager:game_lease:{game_id}

REST callers (Game Master, Admin Service) drive the same service through POST /api/v1/internal/runtimes/{game_id}/start; the diagram's last two arrows collapse to an HTTP 200 response carrying the runtime record. Sources: ../README.md §Lifecycles → Start, services.md §3.

Start failure (image pull)

sequenceDiagram
    participant Service as startruntime
    participant Docker
    participant PG as Postgres
    participant Intents as notification:intents
    participant Results as runtime:job_results

    Service->>Docker: PullImage(image_ref)
    Docker-->>Service: error
    Service->>PG: INSERT operation_log (op_kind=start, outcome=failure, error_code=image_pull_failed)
    Service->>Intents: XADD runtime.image_pull_failed {game_id, image_ref, error_code, error_message, attempted_at_ms}
    Service-->>Service: Result{Outcome=failure, ErrorCode=image_pull_failed}
    Service->>Results: XADD {outcome=failure, error_code=image_pull_failed}

The same shape applies to the configuration-validation failures (start_config_invalid from EnsureNetwork(ErrNetworkMissing), prepareStateDir, or invalid image_ref shape) and the Docker create/start failure (container_start_failed); only the error code and the matching runtime.* notification type differ. Three failure codes do not raise an admin notification: conflict, service_unavailable, internal_error (services.md §4).

Start failure (orphan / Upsert-after-Run rollback)

sequenceDiagram
    participant Service as startruntime
    participant Docker
    participant PG as Postgres
    participant Intents as notification:intents

    Service->>Docker: ContainerCreate + ContainerStart
    Docker-->>Service: container running
    Service->>PG: Upsert runtime_records
    PG-->>Service: error (transport / constraint)
    Note over Service: container is now an orphan<br/>(running, no PG record)
    Service->>Docker: Remove(container_id) [fresh background context]
    Docker-->>Service: ok or logged failure
    Service->>PG: INSERT operation_log (outcome=failure, error_code=container_start_failed)
    Service->>Intents: XADD runtime.container_start_failed
    Service-->>Service: Result{Outcome=failure, ErrorCode=container_start_failed}

The Docker adapter already removes the container when Run itself fails after a successful ContainerCreate (adapters.md §3); the start service adds the post-Run rollback for the Upsert path. A Remove failure is logged but not propagated; the reconciler adopts surviving orphans on its periodic pass (services.md §5).

Stop

sequenceDiagram
    participant Caller as Lobby / GM / Admin
    participant Service as stopruntime
    participant Lease as Redis lease
    participant PG as Postgres
    participant Docker
    participant Results as runtime:job_results

    Caller->>Service: stop(game_id, reason)
    Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
    Service->>PG: SELECT runtime_records WHERE game_id
    alt status in {stopped, removed}
        Service->>PG: INSERT operation_log (outcome=success, error_code=replay_no_op)
        Service-->>Caller: success / replay_no_op
    else status = running
        Service->>Docker: ContainerStop(container_id, RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS)
        Docker-->>Service: ok
        Service->>PG: UpdateStatus running→stopped (CAS by container_id)
        Service->>PG: INSERT operation_log (op_kind=stop, outcome=success)
        Service-->>Caller: success
    end
    Service->>Lease: DEL rtmanager:game_lease:{game_id}

Lobby callers receive the outcome through runtime:job_results; REST callers receive an HTTP 200. The reason enum (orphan_cleanup | cancelled | finished | admin_request | timeout) is recorded in operation_log and is otherwise opaque to the stop service — RTM does not branch on the reason in v1 (services.md §15, §17).

Restart

sequenceDiagram
    participant Admin as GM / Admin
    participant Service as restartruntime
    participant Stop as stopruntime.Run
    participant Start as startruntime.Run
    participant Docker
    participant PG as Postgres

    Admin->>Service: POST /restart
    Service->>PG: SELECT runtime_records WHERE game_id
    Note over Service: capture current image_ref
    Service->>Service: acquire per-game lease (held across both inner ops)
    Service->>Stop: Run(game_id) [lease bypass]
    Stop->>Docker: ContainerStop
    Stop->>PG: UpdateStatus running→stopped
    Service->>Docker: ContainerRemove
    Service->>Start: Run(game_id, image_ref) [lease bypass]
    Start->>Docker: PullImage / Run
    Start->>PG: Upsert runtime_records (status=running)
    Service->>PG: INSERT operation_log (op_kind=restart, outcome=success, source_ref=correlation_id)
    Service-->>Admin: 200 {runtime_record}
    Service->>Service: release lease

The lease is acquired by restartruntime and held across both inner operations; stopruntime.Run and startruntime.Run are lease-bypass entry points that skip the inner lease acquisition (services.md §12). The single operation_log row uses Input.SourceRef as a correlation id linking the implicit stop and start entries (services.md §13).

Patch

sequenceDiagram
    participant Admin as GM / Admin
    participant Service as patchruntime
    participant Restart as restartruntime.Run

    Admin->>Service: POST /patch {image_ref: "galaxy/game:1.4.2"}
    Service->>Service: parse new image_ref + current image_ref
    alt either ref not semver
        Service-->>Admin: 422 image_ref_not_semver
    else major or minor differ
        Service-->>Admin: 422 semver_patch_only
    else major.minor match, patch differs (or equal)
        Service->>Restart: Run(game_id, new_image_ref)
        Restart-->>Service: Result
        Service-->>Admin: 200 {runtime_record}
    end

The semver gate uses the tag fragment of the Docker reference; the extraction strategy is recorded in services.md §14. The restart delegate already owns the lease, the inner stop/start, the operation log, and the runtime:health_events container_started emission (workers.md §1).

Cleanup TTL

sequenceDiagram
    participant Worker as containercleanup worker
    participant PG as Postgres
    participant Service as cleanupcontainer
    participant Lease as Redis lease
    participant Docker

    loop every RTMANAGER_CLEANUP_INTERVAL
        Worker->>PG: SELECT runtime_records WHERE status='stopped' AND last_op_at < now - retention
        loop per game
            Worker->>Service: cleanup(game_id, op_source=auto_ttl)
            Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
            Service->>PG: re-read runtime_records WHERE game_id
            alt status = running
                Service-->>Worker: refused / conflict
            else status in {stopped, removed}
                Service->>Docker: ContainerRemove(container_id)
                Service->>PG: UpdateStatus stopped→removed (CAS)
                Service->>PG: INSERT operation_log (op_kind=cleanup_container)
                Service-->>Worker: success
            end
            Service->>Lease: DEL rtmanager:game_lease:{game_id}
        end
    end

Admin-driven cleanup follows the same path through DELETE /api/v1/internal/runtimes/{game_id}/container with op_source=admin_rest instead of auto_ttl. The host state directory is never removed by this flow (../README.md §Cleanup, services.md §17, workers.md §19).

Reconcile drift adopt

sequenceDiagram
    participant Reconciler as reconcile worker
    participant Docker
    participant PG as Postgres
    participant Lease as Redis lease

    Note over Reconciler: read pass (lockless)
    Reconciler->>Docker: List({label=com.galaxy.owner=rtmanager})
    Reconciler->>PG: ListByStatus(running)
    Note over Reconciler: write pass (per-game lease)
    loop per Docker container without matching record
        Reconciler->>Lease: SET NX PX rtmanager:game_lease:{game_id}
        Reconciler->>PG: re-read runtime_records WHERE game_id
        alt record now exists
            Reconciler-->>Reconciler: skip (state changed since read pass)
        else record still missing
            Reconciler->>PG: Upsert runtime_records (status=running, image_ref, started_at)
            Reconciler->>PG: INSERT operation_log (op_kind=reconcile_adopt, op_source=auto_reconcile)
        end
        Reconciler->>Lease: DEL rtmanager:game_lease:{game_id}
    end

The reconciler never stops or removes an unrecorded container — operators may have started one manually for diagnostics. The reconcile_dispose and observed_exited paths follow the same read-pass / write-pass split, with dispose updating the orphaned record to removed and emitting container_disappeared, and observed_exited updating to stopped and emitting container_exited (../README.md §Reconciliation, workers.md §14–§16).

Health probe hysteresis

sequenceDiagram
    participant Worker as healthprobe worker
    participant State as in-memory probe state
    participant Engine as galaxy-game-{id}:8080
    participant Health as runtime:health_events

    loop every RTMANAGER_PROBE_INTERVAL
        Worker->>Worker: ListByStatus(running)
        Worker->>State: prune entries for games no longer running
        loop per game (semaphore cap = 16)
            Worker->>Engine: GET /healthz (RTMANAGER_PROBE_TIMEOUT)
            alt success
                State->>State: consecutiveFailures = 0
                opt failurePublished was true
                    Worker->>Health: XADD probe_recovered {prior_failure_count}
                    State->>State: failurePublished = false
                end
            else failure
                State->>State: consecutiveFailures++
                opt consecutiveFailures == RTMANAGER_PROBE_FAILURES_THRESHOLD AND not failurePublished
                    Worker->>Health: XADD probe_failed {consecutive_failures, last_status, last_error}
                    State->>State: failurePublished = true
                end
            end
        end
    end

Hysteresis prevents a single transient failure from emitting a probe_failed event, and prevents repeated emission while the failure persists. State is non-persistent: a process restart re-establishes the counters from scratch; a game's state is pruned when it transitions out of the running list (workers.md §5–§6).