Files
multica/server/migrations/080_agent_task_queue_queued_index.down.sql
Multica Eve a2dd80d4f6 feat(autopilot): skip dispatch when assignee runtime is offline (MUL-1899) (#2311)
* feat(autopilot): skip dispatch when assignee runtime is offline (MUL-1899)

Prevents scheduled autopilots from accumulating doomed tasks against
offline / archived / unbound agents. Before this change, a paused laptop
or crashed daemon would let a 5-minute-cron autopilot pile up thousands
of queued agent_task_queue rows that no runtime would ever drain — this
is the dominant source of the 89k stuck-task backlog flagged in MUL-1899.

DispatchAutopilot now performs a pre-flight admission check on the
assignee agent's runtime status. If the runtime is not 'online' (or the
agent is archived / has no runtime bound / has no assignee), the run is
recorded as 'skipped' with a failure_reason and no task is enqueued.
Skipped runs still emit autopilot:run.done so the UI / activity feed
reflect that the trigger fired and was evaluated.

Skipped runs are deliberately NOT counted toward the failure-ratio
auto-pause: a user who closes their laptop overnight should not have
their autopilot paused. Sustained server-side failures keep their
existing pause path via the failure monitor.

Tests: added an integration test that creates an offline runtime and
asserts DispatchAutopilot records a skipped run with no task enqueued.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: multica-agent <github@multica.ai>

* feat(scheduler): expire stale queued tasks via TTL sweeper (MUL-1899)

Companion to the dispatch-time admission gate added in this PR. The
admission gate prevents *new* tasks from being enqueued against an
offline runtime, but it does not drain the historical backlog
(~89k stuck queued rows observed at MUL-1899 baseline) and does not
help when a runtime goes offline *after* a task has already been
queued. This adds a passive TTL sweeper:

- New SQL query `ExpireStaleQueuedTasks` transitions queued tasks
  older than the TTL to status='failed' with
  failure_reason='queued_expired' and a clear error message.
- Sweep is capped per tick (`queuedExpireBatchSize`, default 500) via
  a CTE+LIMIT so that draining a large backlog cannot monopolise the
  DB on a single tick. At 30s ticks the worst case is 60k rows/hour.
- Wired into the existing 30s `runRuntimeSweeper` loop alongside
  `sweepStaleTasks` and reuses `taskSvc.HandleFailedTasks` so the
  expired tasks broadcast `task:failed` events, reconcile agent
  status, and roll back any in-progress issues — same lifecycle as
  any other failed task.
- Default TTL = 2h. Conservatively above any reasonable
  "queued behind a long-running task" window (default agent timeout
  is 2h, sweeper runs every 30s) so legitimate work isn't expired.
- Integration tests cover the happy path (stale → expired, fresh →
  left alone, correct status/reason/error) and the per-tick batch cap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: multica-agent <github@multica.ai>

* fix(autopilot): address review blockers from PR #2311 (MUL-1899)

GPT-Boy review of the offline-runtime + queued-TTL PR flagged four
blockers; this commit addresses them all.

1. Restore the 'skipped' autopilot_run status in the DB constraint.
   Migration 043 had removed 'skipped' along with the now-defunct
   concurrency_policy feature, so the new admission gate's INSERT of
   status='skipped' violated `autopilot_run_status_check` and broke
   `TestAutopilotDispatchSkipsWhenRuntimeOffline` in CI. New
   migration 079 re-adds 'skipped' to the CHECK list. The down
   migration migrates skipped → failed before re-tightening, mirror-
   ing what 043 did for the original removal.

2. Make `ExpireStaleQueuedTasks` race-safe.
   The CTE-then-UPDATE pattern could clobber a task that the daemon
   claimed between victim selection and the outer update. Two
   guards added:
     - `FOR UPDATE SKIP LOCKED` in the CTE so we never wait on a
       row that's currently being claimed (and never block the
       claim path either).
     - The outer UPDATE now re-checks `t.status = 'queued'` AND the
       TTL predicate so even if a row's lock is released after a
       successful claim, we cannot transition a now-dispatched/
       running task to 'failed'.

3. Add a partial index for the queued-TTL sweeper.
   `idx_agent_task_queue_queued_created_at` on `created_at WHERE
   status = 'queued'` — keeps the 30s sweep query (status=queued
   AND created_at < ... ORDER BY created_at LIMIT 500) cheap even
   when historical terminal rows accumulate (~89k+ at MUL-1899
   baseline). The partial predicate keeps the index tiny because
   only in-flight rows live in 'queued'.

4. Fix the failure-monitor denominator.
   `SelectAutopilotsExceedingFailureThreshold` had been counting
   'skipped' toward total runs, which would have diluted the failure
   ratio: a 100%-failing autopilot could mask itself behind a wall
   of admission skips. With 'skipped' restored as a real status,
   the auto-pause monitor must explicitly exclude it from BOTH
   numerator and denominator — admission skips are neither a
   success nor a failure.

Verified: `go test ./cmd/server/... ./internal/service/...` passes
(including TestAutopilotDispatchSkipsWhenRuntimeOffline,
TestExpireStaleQueuedTasks, TestExpireStaleQueuedTasksRespectsBatch
Limit). `go build ./... && go vet ./...` clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: multica-agent <github@multica.ai>

* fix(migrations): split queued-task TTL index into concurrent migration

Per PR #2311 review: agent_task_queue is a hot table, so building the
new partial index with plain CREATE INDEX inside migration 079 would
hold ACCESS EXCLUSIVE on the queue and block dispatch during deploy.

The migration runner does not allow CONCURRENTLY to share a file with
other statements (documented in 068), so split the index into its own
single-statement file 080 — matching the existing pattern in 035 /
067 / 074 / 075 / 078. Migration 079 keeps the autopilot_run
constraint change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 15:07:57 +08:00

2 lines
74 B
SQL

DROP INDEX CONCURRENTLY IF EXISTS idx_agent_task_queue_queued_created_at;