multica

mirror of https://github.com/multica-ai/multica.git synced 2026-07-05 13:29:44 +02:00

Files

Multica Eve eca6102365 refactor(autopilot): autopilot_run.planned_at + DispatchAutopilotForPlan (2/3 MUL-3551) (#4443 )

* refactor(scheduler): add PlansForScope hook for non-cadence jobs

The current Manager.plansForTick assumes a uniform Cadence grid:
plan_times are derived via FloorPlan(db_now, Cadence). That works for
rollup_task_usage_hourly but not for the upcoming Autopilot schedule
dispatch job, where each trigger has its own cron expression and the
plan_times do not snap to a single global grid.

This change adds an optional JobSpec.PlansForScope hook. When set:

* Manager loads the latest stored plan for (job, scope) and passes
a new LatestPlanInfo to the hook (exported from the previously
private latestPlanInfo). The hook returns the plan_times to attempt
this tick.
* Cadence, CatchUpMode and CatchUpWindow are bypassed; the hook is
in full control of plan_time selection.
* MaxPlansPerTick still acts as a safety cap on the hook's output.
* All other timing fields (RunTimeout / StaleTimeout /
HeartbeatInterval / MaxAttempts / RetryBackoff / AllowStaleReentry)
and the lease/heartbeat/terminal-write SQL primitives are reused
unchanged.

JobSpec.validate now allows Cadence=0 when PlansForScope is set, and
makes the every_plan MaxPlansPerTick > 0 invariant fire only on
Cadence-driven every_plan jobs. Existing rollup_task_usage_hourly
behaviour is unchanged — that JobSpec leaves PlansForScope nil.

Tests:
* TestJobSpecValidatePlansForScopeRelaxesCadence — validate() rules.
* TestManagerPlansForScopeHookDrivesPlans — end-to-end hook delegation
through the manager (DB-backed), proving that hook-returned
plan_times go through the same tryClaim path, MaxPlansPerTick
truncates without erroring, and LatestPlanInfo is populated on the
second tick.
* TestManagerPlansForScopeHookEmptyIsNoOp — empty hook output is a
valid no-op.

No behaviour change for callers that don't set PlansForScope.

Refs MUL-3551

Co-authored-by: multica-agent <github@multica.ai>

* refactor(autopilot): add planned_at + DispatchAutopilotForPlan for occurrence idempotency

PR 2 of 3 for the scheduled-Autopilot refactor on MUL-3551.

Adds dispatch-layer idempotency for scheduled triggers. This is the
second line of defence behind the primary uq_sys_cron_execution
guarantee in sys_cron_executions: if a runner crashes between
"create autopilot_run" and "write SUCCESS in sys_cron_executions",
the next stale-steal retry re-enters dispatch with the SAME
(trigger_id, planned_at). Without a row-level guard, that retry
would create a duplicate autopilot_run, issue, and task.

Changes:

* Migration 124: ALTER TABLE autopilot_run ADD COLUMN planned_at
TIMESTAMPTZ + partial unique index on (trigger_id, planned_at)
WHERE both are NOT NULL. Manual / webhook / api dispatch leaves
planned_at NULL so they keep the existing semantics unchanged.

* autopilot.sql: CreateAutopilotRun now takes planned_at;
GetAutopilotRunByTriggerAndPlanned is the fast-path lookup used
by DispatchAutopilotForPlan to detect a prior attempt's row
without burning an INSERT.

* service.DispatchAutopilotForPlan: new entry point for scheduled
triggers that already know the canonical UTC plan_time of the
occurrence they are firing. Looks up an existing run for
(trigger_id, planned_at) and reuses it on a stale-steal retry;
otherwise dispatches normally with planned_at stamped on the
new run.

* service.DispatchAutopilot keeps its current signature for
manual / webhook / api callers (planned_at stays NULL).

* recordSkippedRun also threads planned_at so the skip path
participates in the same partial-unique guarantee.

* sqlc v1.31.1 regenerated autopilot.sql.go + models.go.
Unrelated workspace.sql.go drift restored.

Tests (against local Postgres):

* TestDispatchAutopilotForPlanIsIdempotent — first call creates a
run; second call with same (trigger, planned_at) reuses it
(autopilot_run row count stays at 1); third call with a different
planned_at on the same trigger creates a second run (proves we
are not collapsing legitimate occurrences).

* TestDispatchAutopilotForPlanRejectsZeroArgs — invalid trigger_id
and zero planned_at both fail loudly so callers cannot silently
disable the idempotency guard.

* Existing autopilot listener / squad / dispatch tests all still
pass.

This PR has no scheduler / handler / UI behaviour change on its own:
the new entry point exists but is not yet wired into the schedule
goroutine. PR 3 will register the autopilot_schedule_dispatch
JobSpec that consumes it and remove the legacy
cmd/server/autopilot_scheduler.go path.

Refs MUL-3551

Co-authored-by: multica-agent <github@multica.ai>

* fix(autopilot): DispatchAutopilotForPlan recovers partial-state runs

Review fix for #4443 (MUL-3551).

Before this change, DispatchAutopilotForPlan returned ANY existing
autopilot_run for (trigger_id, planned_at), including the
half-written rows produced when a runner crashed between
"CreateAutopilotRun" and "create downstream issue/task". The
scheduler handler would then write SUCCESS in sys_cron_executions
even though no issue or agent task was ever created, silently
losing the scheduled occurrence.

Fix:

* New isAutopilotRunComplete helper classifies an existing run:
- terminal status (completed / failed / skipped) → reuse.
- issue_created with valid issue_id → reuse (the issue
listener owns task creation from here).
- running with valid task_id → reuse (the task is queued).
- anything else → partial; must NOT short-circuit.

* New SQL RecoverPartialAutopilotRun marks a partial row FAILED
with a recovery reason AND clears its planned_at. The cleared
planned_at releases the partial-unique slot in
uq_autopilot_run_trigger_planned, letting the fresh dispatch
INSERT a new row at the same (trigger_id, planned_at) without
conflict.

* DispatchAutopilotForPlan now branches on the lookup:
complete run → return; partial run → recover-then-fresh-
dispatch; not-found → fresh dispatch. The fresh dispatch path
still goes through dispatchAutopilot, so the new row carries
the real issue_id / task_id by the time the handler returns.

* Tests: TestDispatchAutopilotForPlanRecoversPartialRun seeds a
partial run (status='running', task_id=NULL for run_only;
status='issue_created', issue_id=NULL for create_issue) and
asserts the retry:
- returns a DIFFERENT run row (no false reuse);
- leaves the partial row in status='failed', planned_at=NULL,
with a non-empty failure_reason for ops;
- produces a fresh row with planned_at preserved AND the
appropriate downstream linkage (task_id for run_only,
issue_id for create_issue);
- exactly one live row at (trigger_id, planned_at) after
recovery, so the partial-unique constraint is honoured.

Existing TestDispatchAutopilotForPlanIsIdempotent and
TestDispatchAutopilotForPlanRejectsZeroArgs still pass — the
complete-reuse path is unchanged for the realistic SUCCESS-state
case.

Refs MUL-3551

Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>

2026-06-24 12:01:10 +08:00

cmd

refactor(autopilot): autopilot_run.planned_at + DispatchAutopilotForPlan (2/3 MUL-3551) (#4443 )

2026-06-24 12:01:10 +08:00

internal

refactor(autopilot): autopilot_run.planned_at + DispatchAutopilotForPlan (2/3 MUL-3551) (#4443 )

2026-06-24 12:01:10 +08:00

migrations