mirror of
https://github.com/multica-ai/multica.git
synced 2026-07-05 13:29:44 +02:00
* refactor(scheduler): add PlansForScope hook for non-cadence jobs
The current Manager.plansForTick assumes a uniform Cadence grid:
plan_times are derived via FloorPlan(db_now, Cadence). That works for
rollup_task_usage_hourly but not for the upcoming Autopilot schedule
dispatch job, where each trigger has its own cron expression and the
plan_times do not snap to a single global grid.
This change adds an optional JobSpec.PlansForScope hook. When set:
* Manager loads the latest stored plan for (job, scope) and passes
a new LatestPlanInfo to the hook (exported from the previously
private latestPlanInfo). The hook returns the plan_times to attempt
this tick.
* Cadence, CatchUpMode and CatchUpWindow are bypassed; the hook is
in full control of plan_time selection.
* MaxPlansPerTick still acts as a safety cap on the hook's output.
* All other timing fields (RunTimeout / StaleTimeout /
HeartbeatInterval / MaxAttempts / RetryBackoff / AllowStaleReentry)
and the lease/heartbeat/terminal-write SQL primitives are reused
unchanged.
JobSpec.validate now allows Cadence=0 when PlansForScope is set, and
makes the every_plan MaxPlansPerTick > 0 invariant fire only on
Cadence-driven every_plan jobs. Existing rollup_task_usage_hourly
behaviour is unchanged — that JobSpec leaves PlansForScope nil.
Tests:
* TestJobSpecValidatePlansForScopeRelaxesCadence — validate() rules.
* TestManagerPlansForScopeHookDrivesPlans — end-to-end hook delegation
through the manager (DB-backed), proving that hook-returned
plan_times go through the same tryClaim path, MaxPlansPerTick
truncates without erroring, and LatestPlanInfo is populated on the
second tick.
* TestManagerPlansForScopeHookEmptyIsNoOp — empty hook output is a
valid no-op.
No behaviour change for callers that don't set PlansForScope.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
* refactor(autopilot): add planned_at + DispatchAutopilotForPlan for occurrence idempotency
PR 2 of 3 for the scheduled-Autopilot refactor on MUL-3551.
Adds dispatch-layer idempotency for scheduled triggers. This is the
second line of defence behind the primary uq_sys_cron_execution
guarantee in sys_cron_executions: if a runner crashes between
"create autopilot_run" and "write SUCCESS in sys_cron_executions",
the next stale-steal retry re-enters dispatch with the SAME
(trigger_id, planned_at). Without a row-level guard, that retry
would create a duplicate autopilot_run, issue, and task.
Changes:
* Migration 124: ALTER TABLE autopilot_run ADD COLUMN planned_at
TIMESTAMPTZ + partial unique index on (trigger_id, planned_at)
WHERE both are NOT NULL. Manual / webhook / api dispatch leaves
planned_at NULL so they keep the existing semantics unchanged.
* autopilot.sql: CreateAutopilotRun now takes planned_at;
GetAutopilotRunByTriggerAndPlanned is the fast-path lookup used
by DispatchAutopilotForPlan to detect a prior attempt's row
without burning an INSERT.
* service.DispatchAutopilotForPlan: new entry point for scheduled
triggers that already know the canonical UTC plan_time of the
occurrence they are firing. Looks up an existing run for
(trigger_id, planned_at) and reuses it on a stale-steal retry;
otherwise dispatches normally with planned_at stamped on the
new run.
* service.DispatchAutopilot keeps its current signature for
manual / webhook / api callers (planned_at stays NULL).
* recordSkippedRun also threads planned_at so the skip path
participates in the same partial-unique guarantee.
* sqlc v1.31.1 regenerated autopilot.sql.go + models.go.
Unrelated workspace.sql.go drift restored.
Tests (against local Postgres):
* TestDispatchAutopilotForPlanIsIdempotent — first call creates a
run; second call with same (trigger, planned_at) reuses it
(autopilot_run row count stays at 1); third call with a different
planned_at on the same trigger creates a second run (proves we
are not collapsing legitimate occurrences).
* TestDispatchAutopilotForPlanRejectsZeroArgs — invalid trigger_id
and zero planned_at both fail loudly so callers cannot silently
disable the idempotency guard.
* Existing autopilot listener / squad / dispatch tests all still
pass.
This PR has no scheduler / handler / UI behaviour change on its own:
the new entry point exists but is not yet wired into the schedule
goroutine. PR 3 will register the autopilot_schedule_dispatch
JobSpec that consumes it and remove the legacy
cmd/server/autopilot_scheduler.go path.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
* fix(autopilot): DispatchAutopilotForPlan recovers partial-state runs
Review fix for #4443 (MUL-3551).
Before this change, DispatchAutopilotForPlan returned ANY existing
autopilot_run for (trigger_id, planned_at), including the
half-written rows produced when a runner crashed between
"CreateAutopilotRun" and "create downstream issue/task". The
scheduler handler would then write SUCCESS in sys_cron_executions
even though no issue or agent task was ever created, silently
losing the scheduled occurrence.
Fix:
* New isAutopilotRunComplete helper classifies an existing run:
- terminal status (completed / failed / skipped) → reuse.
- issue_created with valid issue_id → reuse (the issue
listener owns task creation from here).
- running with valid task_id → reuse (the task is queued).
- anything else → partial; must NOT short-circuit.
* New SQL RecoverPartialAutopilotRun marks a partial row FAILED
with a recovery reason AND clears its planned_at. The cleared
planned_at releases the partial-unique slot in
uq_autopilot_run_trigger_planned, letting the fresh dispatch
INSERT a new row at the same (trigger_id, planned_at) without
conflict.
* DispatchAutopilotForPlan now branches on the lookup:
complete run → return; partial run → recover-then-fresh-
dispatch; not-found → fresh dispatch. The fresh dispatch path
still goes through dispatchAutopilot, so the new row carries
the real issue_id / task_id by the time the handler returns.
* Tests: TestDispatchAutopilotForPlanRecoversPartialRun seeds a
partial run (status='running', task_id=NULL for run_only;
status='issue_created', issue_id=NULL for create_issue) and
asserts the retry:
- returns a DIFFERENT run row (no false reuse);
- leaves the partial row in status='failed', planned_at=NULL,
with a non-empty failure_reason for ops;
- produces a fresh row with planned_at preserved AND the
appropriate downstream linkage (task_id for run_only,
issue_id for create_issue);
- exactly one live row at (trigger_id, planned_at) after
recovery, so the partial-unique constraint is honoured.
Existing TestDispatchAutopilotForPlanIsIdempotent and
TestDispatchAutopilotForPlanRejectsZeroArgs still pass — the
complete-reuse path is unchanged for the realistic SUCCESS-state
case.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
---------
Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>