Files
multica/server/cmd
Multica Eve eca6102365 refactor(autopilot): autopilot_run.planned_at + DispatchAutopilotForPlan (2/3 MUL-3551) (#4443)
* refactor(scheduler): add PlansForScope hook for non-cadence jobs

The current Manager.plansForTick assumes a uniform Cadence grid:
plan_times are derived via FloorPlan(db_now, Cadence). That works for
rollup_task_usage_hourly but not for the upcoming Autopilot schedule
dispatch job, where each trigger has its own cron expression and the
plan_times do not snap to a single global grid.

This change adds an optional JobSpec.PlansForScope hook. When set:

  * Manager loads the latest stored plan for (job, scope) and passes
    a new LatestPlanInfo to the hook (exported from the previously
    private latestPlanInfo). The hook returns the plan_times to attempt
    this tick.
  * Cadence, CatchUpMode and CatchUpWindow are bypassed; the hook is
    in full control of plan_time selection.
  * MaxPlansPerTick still acts as a safety cap on the hook's output.
  * All other timing fields (RunTimeout / StaleTimeout /
    HeartbeatInterval / MaxAttempts / RetryBackoff / AllowStaleReentry)
    and the lease/heartbeat/terminal-write SQL primitives are reused
    unchanged.

JobSpec.validate now allows Cadence=0 when PlansForScope is set, and
makes the every_plan MaxPlansPerTick > 0 invariant fire only on
Cadence-driven every_plan jobs. Existing rollup_task_usage_hourly
behaviour is unchanged — that JobSpec leaves PlansForScope nil.

Tests:
  * TestJobSpecValidatePlansForScopeRelaxesCadence — validate() rules.
  * TestManagerPlansForScopeHookDrivesPlans — end-to-end hook delegation
    through the manager (DB-backed), proving that hook-returned
    plan_times go through the same tryClaim path, MaxPlansPerTick
    truncates without erroring, and LatestPlanInfo is populated on the
    second tick.
  * TestManagerPlansForScopeHookEmptyIsNoOp — empty hook output is a
    valid no-op.

No behaviour change for callers that don't set PlansForScope.

Refs MUL-3551

Co-authored-by: multica-agent <github@multica.ai>

* refactor(autopilot): add planned_at + DispatchAutopilotForPlan for occurrence idempotency

PR 2 of 3 for the scheduled-Autopilot refactor on MUL-3551.

Adds dispatch-layer idempotency for scheduled triggers. This is the
second line of defence behind the primary uq_sys_cron_execution
guarantee in sys_cron_executions: if a runner crashes between
"create autopilot_run" and "write SUCCESS in sys_cron_executions",
the next stale-steal retry re-enters dispatch with the SAME
(trigger_id, planned_at). Without a row-level guard, that retry
would create a duplicate autopilot_run, issue, and task.

Changes:

  * Migration 124: ALTER TABLE autopilot_run ADD COLUMN planned_at
    TIMESTAMPTZ + partial unique index on (trigger_id, planned_at)
    WHERE both are NOT NULL. Manual / webhook / api dispatch leaves
    planned_at NULL so they keep the existing semantics unchanged.

  * autopilot.sql: CreateAutopilotRun now takes planned_at;
    GetAutopilotRunByTriggerAndPlanned is the fast-path lookup used
    by DispatchAutopilotForPlan to detect a prior attempt's row
    without burning an INSERT.

  * service.DispatchAutopilotForPlan: new entry point for scheduled
    triggers that already know the canonical UTC plan_time of the
    occurrence they are firing. Looks up an existing run for
    (trigger_id, planned_at) and reuses it on a stale-steal retry;
    otherwise dispatches normally with planned_at stamped on the
    new run.

  * service.DispatchAutopilot keeps its current signature for
    manual / webhook / api callers (planned_at stays NULL).

  * recordSkippedRun also threads planned_at so the skip path
    participates in the same partial-unique guarantee.

  * sqlc v1.31.1 regenerated autopilot.sql.go + models.go.
    Unrelated workspace.sql.go drift restored.

Tests (against local Postgres):

  * TestDispatchAutopilotForPlanIsIdempotent — first call creates a
    run; second call with same (trigger, planned_at) reuses it
    (autopilot_run row count stays at 1); third call with a different
    planned_at on the same trigger creates a second run (proves we
    are not collapsing legitimate occurrences).

  * TestDispatchAutopilotForPlanRejectsZeroArgs — invalid trigger_id
    and zero planned_at both fail loudly so callers cannot silently
    disable the idempotency guard.

  * Existing autopilot listener / squad / dispatch tests all still
    pass.

This PR has no scheduler / handler / UI behaviour change on its own:
the new entry point exists but is not yet wired into the schedule
goroutine. PR 3 will register the autopilot_schedule_dispatch
JobSpec that consumes it and remove the legacy
cmd/server/autopilot_scheduler.go path.

Refs MUL-3551

Co-authored-by: multica-agent <github@multica.ai>

* fix(autopilot): DispatchAutopilotForPlan recovers partial-state runs

Review fix for #4443 (MUL-3551).

Before this change, DispatchAutopilotForPlan returned ANY existing
autopilot_run for (trigger_id, planned_at), including the
half-written rows produced when a runner crashed between
"CreateAutopilotRun" and "create downstream issue/task". The
scheduler handler would then write SUCCESS in sys_cron_executions
even though no issue or agent task was ever created, silently
losing the scheduled occurrence.

Fix:

  * New isAutopilotRunComplete helper classifies an existing run:
      - terminal status (completed / failed / skipped) → reuse.
      - issue_created with valid issue_id → reuse (the issue
        listener owns task creation from here).
      - running with valid task_id → reuse (the task is queued).
      - anything else → partial; must NOT short-circuit.

  * New SQL RecoverPartialAutopilotRun marks a partial row FAILED
    with a recovery reason AND clears its planned_at. The cleared
    planned_at releases the partial-unique slot in
    uq_autopilot_run_trigger_planned, letting the fresh dispatch
    INSERT a new row at the same (trigger_id, planned_at) without
    conflict.

  * DispatchAutopilotForPlan now branches on the lookup:
    complete run → return; partial run → recover-then-fresh-
    dispatch; not-found → fresh dispatch. The fresh dispatch path
    still goes through dispatchAutopilot, so the new row carries
    the real issue_id / task_id by the time the handler returns.

  * Tests: TestDispatchAutopilotForPlanRecoversPartialRun seeds a
    partial run (status='running', task_id=NULL for run_only;
    status='issue_created', issue_id=NULL for create_issue) and
    asserts the retry:
      - returns a DIFFERENT run row (no false reuse);
      - leaves the partial row in status='failed', planned_at=NULL,
        with a non-empty failure_reason for ops;
      - produces a fresh row with planned_at preserved AND the
        appropriate downstream linkage (task_id for run_only,
        issue_id for create_issue);
      - exactly one live row at (trigger_id, planned_at) after
        recovery, so the partial-unique constraint is honoured.

Existing TestDispatchAutopilotForPlanIsIdempotent and
TestDispatchAutopilotForPlanRejectsZeroArgs still pass — the
complete-reuse path is unchanged for the realistic SUCCESS-state
case.

Refs MUL-3551

Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:01:10 +08:00
..