mirror of
https://github.com/multica-ai/multica.git
synced 2026-07-05 13:29:44 +02:00
fix/description-image-cdn-src
21 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8abdc77961 |
MUL-2489 fix(runtime): delete archived squads before runtime teardown (#2955)
* fix(runtime): delete squads referencing archived agents before runtime teardown The DeleteAgentRuntime handler was failing with 500 'failed to clean up archived agents' because squad.leader_id has an ON DELETE RESTRICT FK on agent(id). When an archived agent was still referenced as a squad leader (even on an archived squad), the DELETE FROM agent query was blocked. Fix: add DeleteSquadsByArchivedAgentsOnRuntime query that removes squads whose leader_id points to an archived agent on the target runtime, and call it before DeleteArchivedAgentsByRuntime in the handler. Closes TMI-85 * test(runtime): cover squad cleanup before archived-agent deletion Adds four tests around the DeleteSquadsByArchivedAgentsOnRuntime fix: * TestDeleteSquadsByArchivedAgentsOnRuntime_Query — query-level: deletes squads whose leader is an archived agent on the target runtime, leaves squads with active leaders or archived leaders on a different runtime alone, and is safe to call when nothing matches. Covers the archived- squad case that originally hid the FK blocker from `multica squad list`. * TestDeleteAgentRuntime_RemovesSquadsLedByArchivedAgents — handler end-to-end regression for TMI-85. Reverting the handler change makes this fail with the exact 500 'failed to clean up archived agents' the user reported. * TestDeleteAgentRuntime_NoSquadsRegression — happy path for runtimes whose archived agents were never squad leaders, ensuring the new step is a no-op there. * TestDeleteAgentRuntime_StillBlockedByActiveAgents — preserves the 409 CountActiveAgentsByRuntime guard so the active-agent contract isn't silently regressed by the new cleanup ordering. Refs TMI-85 * chore: remove internal issue tracker references from test comments * fix(runtime): keep active squads during runtime teardown * fix(runtime): block runtime delete on active archived-leader squads * fix(runtime): make runtime delete 409 path a no-op --------- Co-authored-by: Kiro <kiro@multica.ai> |
||
|
|
341ce7bfa5 |
feat: support local working directory for projects (MUL-2618 v1) (#3283)
* feat(project): add local_directory project_resource type (MUL-2662)
Adds a second project_resource type alongside github_repo so a project
can be pinned to an existing directory on a specific daemon (the v1 of
the local-working-directory flow tracked in MUL-2618). The ref schema is
{ local_path, daemon_id, label? }; local_path must be absolute and
daemon_id is required. The same (daemon_id, local_path) pair is allowed
on multiple projects by design — no UNIQUE constraint is added.
Implementation reuses the existing project_resource API surface: the new
type is wired through the validator switch with no migration, no new
events, and no daemon-handler changes (daemon already passes through
arbitrary resource types via ProjectResources). The CLI gains
--local-path / --daemon-id / --ref-label shortcuts so
`multica project resource add --type local_directory` mirrors the
existing `--type github_repo --url ...` ergonomics; the generic --ref
flag still works for both types.
Tests cover the full CRUD lifecycle, the same-path-across-projects
allowance, the same-path-same-project conflict, the validator rejections
(missing/blank/relative path, missing daemon_id, wrong payload type),
and the cross-platform isAbsoluteLocalPath helper.
Co-authored-by: multica-agent <github@multica.ai>
* feat(project): add update endpoint + label-shadow guard for project_resource (MUL-2662)
Addresses the Elon review on PR #3263:
- Add PUT /api/projects/{id}/resources/{resourceId} with sqlc query,
matching handler, CLI `project resource update`, and a new
EventProjectResourceUpdated WS event. resource_type stays immutable;
ref/label/position are all individually optional.
- Catch same-project (daemon_id, local_path) collisions where only the
embedded label differs — the row-level UNIQUE only matches the full
ref JSON, so a label typo would otherwise let the same working
directory bind twice.
- Tests cover the update lifecycle (label-only / ref / clear / 404 /
invalid path) and the label-shadow conflict on both create and
update; the in-place rename still succeeds because the conflict
scan ignores the row being edited.
Incidental: regenerating sqlc picked up a missing skills_local scan in
UpdateAgentCustomEnv that drifted in from #3200.
Co-authored-by: multica-agent <github@multica.ai>
* fix(project): close bundled-create label-shadow gap + merge resource_ref on CLI update (MUL-2662)
Two follow-ups from MUL-2662 review round 2:
- CreateProject inline resources path now dedupes local_directory entries on
(daemon_id, local_path) before opening the transaction. The DB-level
UNIQUE(project_id, resource_type, resource_ref) constraint only fires on a
full JSON match, so two rows with the same target but different `label`
would otherwise slip past. Standalone POST/PUT already cover this via
findLocalDirectoryConflict; bundled create was the missing surface.
- `multica project resource update` now seeds resource_ref from the existing
row before applying per-type shortcut flags, so `--default-branch-hint x`
on its own no longer constructs a payload missing `url` (which the server
400s on). Local_directory partial edits get the same merge behavior.
Co-authored-by: multica-agent <github@multica.ai>
* feat(desktop): local_directory project_resource UI (MUL-2665) (#3273)
* feat(desktop): local_directory project_resource UI (MUL-2665)
First UI surface for the local-working-directory flow tracked in MUL-2618.
Lets users on the desktop pin a project to an existing folder on this
machine; web stays read-only since the per-daemon check can't be done in
the browser.
What's new for the renderer:
- ProjectResourcesSection grows a desktop-only "Add local directory"
button next to the existing GitHub-repo popover. Clicking it opens
Electron's native folder picker, validates the path through a new
IPC pair (existence + r/w), and submits a project_resource of
resource_type=local_directory with daemon_id pulled live from
daemonAPI.getStatus.
- LocalDirectoryRow renders the rename pencil + path tooltip, and
greys out when ref.daemon_id != this machine's daemon_id (with a
"only available on the machine that registered this directory"
tooltip). Delete stays enabled so users can drop stale registrations
from any device.
- LocalDirectoryHint sits above the issue-detail comment composer and
shows "Agent will work in-place at {label} ({path})" when the issue's
project has a local_directory matching this daemon. Hidden on web.
- TaskStatusPill picks up a new "waiting_for_directory_release" stage
that the daemon will publish when it dequeues a task but can't
acquire the path lock. The render is in place now so the daemon
sibling subtask can wire the status string without an additional UI
PR.
Plumbing:
- @multica/core/types gains LocalDirectoryResourceRef +
UpdateProjectResourceRequest, and the api client gets the matching
PUT method backed by the server endpoint that landed in
|
||
|
|
bf8a346cf0 |
feat(runtimes): cascade-archive agents on runtime delete (MUL-2667) (#3266)
* feat(runtimes): cascade-archive agents on runtime delete (MUL-2667) Replace the bare 409 "cannot delete runtime: it has active agents" with a structured response carrying the blocking agent list, and wire a cascade endpoint that archives those agents, cancels their tasks, pauses dangling autopilots and deletes the runtime in a single transaction. The unified DeleteRuntimeDialog opens directly in cascade mode when the runtime has bound agents, pivots from light to cascade if the strict DELETE refuses with runtime_has_active_agents, and re-prompts when the cascade refuses with runtime_delete_plan_changed (live agent set drifted while the dialog was open). The online-local self-healing rule is preserved at the affordance level (kebab hidden, Diagnostics button disabled with tooltip) and re-checked at confirm time as defence in depth. Co-authored-by: multica-agent <github@multica.ai> * fix(runtimes): close cascade race + i18n delete dialog (PR #3266 review) - Acquire FOR UPDATE on the runtime row at the top of the cascade tx so FK-validated agent INSERTs/UPDATEs that would point at this runtime block until commit, and lock each currently-active agent row via ListActiveAgentsByRuntimeForUpdate so a concurrent archive/move of an existing active row also blocks. - Switch the bulk archive from runtime-keyed (ArchiveAgentsByRuntime) to ID-keyed (ArchiveAgentsByIDs), narrowed to the user-confirmed expected_active_agent_ids set. Combined with the runtime row lock, this guarantees no agent outside the confirmed plan can be silently archived between plan-compare and archive even at read-committed. - Wire delete-runtime-dialog.tsx to runtimes locale via useT(); add detail.delete_dialog.{light,cascade} keys (EN with _one/_other plurals, zh-Hans _other) covering titles, descriptions, warning, notices, checkbox, buttons, table headers, presence labels, and toasts. Resolves the i18next/no-literal-string CI failure. - Locale parity test passes (51 tests). All 4 dialog test cases pass unmodified (EN copy preserves original wording). Full views vitest: 91 files / 792 tests green; full server go test: green. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
614dfae884 |
MUL-2488 feat(timezone): Scheduling / Viewing two-layer timezone architecture (#2968)
* docs(timezone): add scheduling/viewing timezone architecture RFC * feat(db): replace daily rollups with task_usage_hourly, add user.timezone Migrations 100-104: add "user".timezone (Viewing tz), build the UTC hourly task_usage_hourly rollup with its pipeline, drop the legacy task_usage_daily / task_usage_dashboard_daily pipelines, and drop the agent_runtime.timezone column. Report queries now slice day boundaries at read time by the caller-supplied @tz instead of materialising in a fixed tz. Regenerate sqlc. * feat(server): add task_usage_hourly backfill command Replace the two legacy backfill commands (daily / dashboard_daily) with a single backfill_task_usage_hourly that loads historical task_usage into the new UTC hourly rollup, sliced per workspace. * refactor(server): resolve viewing timezone in report handlers Report handlers resolve the Viewing tz per request (?tz query param, then user.timezone, then UTC) and pass it to the hourly-rollup queries. Drop the UseDailyRollup feature flags and the old raw-scan/daily-rollup dual paths, remove the /api/usage endpoints, and stop the daemon from reporting and the runtime handler from accepting host timezone. * refactor(core): switch report queries to viewing timezone API client and dashboard/runtime queries send ?tz with each report request, the user schema/types carry the new timezone field, and the runtime timezone field/mutation is removed. * feat(views): add viewing timezone preference and UI Add the useViewingTimezone hook and a Timezone setting in Preferences; report charts and the dashboard week boundary follow the viewer tz. Remove the runtime detail timezone editor and its locale strings. * fix(test): update fixtures and stabilize tests for timezone refactor The timezone architecture refactor changed several types without updating dependent test code: - RuntimeDevice no longer has a timezone field — drop it from the create-agent-dialog runtime fixture. - User now requires a timezone field — add it to the apps/web mockUser fixture. - The PreferencesTab timezone tests asserted on the async save handler (PATCH then store update) with a bare expect, racing the mutation's settle callback, and timed out querying the Select's ~600-option IANA list on a loaded CI runner. Wrap the assertions in waitFor and extend the timeout for those three tests. * docs(timezone): document self-host migration order and trigger invariant Add a SELF-HOST UPGRADE ORDER runbook to the backfill command's package comment: applying migrations 100-104 in a single migrate-up drops the legacy daily rollups before the hourly backfill runs, leaving dashboards empty until cron catches up. Add an INVARIANT comment on trg_atq_dirty_hourly noting that agent_id must be added to the trigger's OF list if it ever becomes mutable, otherwise dirty buckets for the old agent_id are silently missed. * style(runtimes): drop trailing blank line in runtime-detail |
||
|
|
fc8528d64d |
feat(autopilot): support assigning to a squad (MUL-2429) (#2888)
* feat(autopilot): support assigning autopilot to a squad (MUL-2429) Path A (Squad-as-Leader) from the RFC: when an autopilot's assignee is a squad, dispatch resolves to squad.leader_id and executes against the leader's runtime — semantics match a human manually assigning the issue to that squad, no fan-out. Backend scope only; frontend picker change is a follow-up PR. Changes: - 096_autopilot_squad_assignee migration: drop agent FK on autopilot.assignee_id, add assignee_type column (default 'agent'), add autopilot_run.squad_id attribution column. - service.AgentReadiness: single source of truth for archived / runtime-bound / runtime-online checks. Shared by autopilot admission gate, run_only dispatch, and isSquadLeaderReady. - service.resolveAutopilotLeader: translates assignee_type/id to the agent that actually runs the work. - dispatchCreateIssue: stamps issue with assignee_type='squad' for squad autopilots and enqueues via EnqueueTaskForSquadLeader. - dispatchRunOnly: belt-and-braces readiness re-check after resolving squad → leader so a leader that went offline between admission and dispatch produces a clean failure instead of a doomed task. - handler.CreateAutopilot / UpdateAutopilot: accept assignee_type with squad/agent existence + leader-archived validation. Backward-compatible default of "agent" preserves the contract for older clients. - Analytics: AutopilotRunStarted/Completed/Failed events carry assignee_type and squad_id; PostHog can now group autopilot runs by squad without joining back to the autopilot row. Co-authored-by: multica-agent <github@multica.ai> * fix(autopilot): reject archived squads, route post-admission skips, cleanup dangling-agent autopilots (MUL-2429) Addresses three review findings on PR #2888: 1. Archived squad handling: validateAutopilotAssignee now rejects squads with archived_at set; resolveAutopilotLeader returns errSquadArchived so the admission gate fails closed; DeleteSquad now mirrors the issue transfer for autopilot rows (TransferSquadAutopilotsToLeader) so surviving autopilots flip to assignee_type='agent' (leader) instead of dangling at the archived squad. 2. dispatchRunOnly post-admission readiness: introduces errDispatchSkipped sentinel, recognised by DispatchAutopilot via handleDispatchSkip so the run is recorded as `skipped` (not `failed`). Manual triggers no longer 500 when the leader's runtime goes offline between admission and task creation. New TestManualTriggerDoesNotErrorOnPostAdmissionSkip locks the behaviour in. 3. Dangling agent assignee after migration 096 dropped the FK: shouldSkipDispatch now distinguishes pgx.ErrNoRows / errSquadArchived (hard skip — retrying won't help) from transient DB errors (fail-open). DeleteAgentRuntime pauses autopilots that target agents about to be hard-deleted (ListArchivedAgentIDsByRuntime + PauseAutopilotsByAgentAssignees) so the breakage surfaces as a paused row in the UI instead of a quiet skip-burning loop. Unit tests cover the sentinel unwrap contract and errSquadArchived errors.Is behaviour. Integration test TestAutopilotDispatchSkipsWhenRuntimeOffline re-verified against a fresh DB with migration 096 applied. Co-authored-by: multica-agent <github@multica.ai> * fix(autopilot): bump last_run_at on post-admission skip (MUL-2429) Match recordSkippedRun (pre-flight skip) and the success path so the scheduler / "last seen" UI both reflect that this tick evaluated the trigger, even when the post-admission readiness gate caught a late regression. Addresses Emacs review caveat #1 on PR #2888. Co-authored-by: multica-agent <github@multica.ai> * feat(autopilot): mixed agent/squad assignee picker in dialog (MUL-2429) End-to-end UI for assigning an autopilot to a squad. Closes the PR #2888 backend gap: the squad-as-assignee feature was already wired in Go (Path A, RFC §4) but the desktop dialog never offered the choice. - core/types/autopilot: add `AutopilotAssigneeType`, surface `assignee_type` on `Autopilot` + Create/Update request payloads. - views/autopilots/pickers/agent-picker: switch to a polymorphic AssigneeSelection (`{type, id}`); render agents and squads as two grouped sections with shared pinyin search. - views/autopilots/autopilot-dialog: maintain `assigneeType` state, send it on create/update, render the trigger avatar / hover dot with `assignee.type`. - views/autopilots/autopilots-page + autopilot-detail-page: render the assignee row using `autopilot.assignee_type` so squad-typed autopilots show the squad avatar + name, not a broken agent lookup. - locales: add `agents_group` / `squads_group` / `select_assignee` keys (en + zh-Hans), keep legacy `select_agent` for callers that still reference it. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Lambda <lambda@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
63d215e1c3 |
feat(runtime): visibility (public/private) gate on CreateAgent / UpdateAgent (#2419)
* feat(runtime): visibility (public/private) gate on CreateAgent / UpdateAgent Closes the hole where a plain workspace member could pick another member's runtime in the Create Agent dialog and bind an agent to it — the backend wasn't checking runtime ownership, so the agent ran on someone else's hardware / tokens. Reported on GH #1804. Schema - Migration 083 adds agent_runtime.visibility ('private' default, 'public') with a CHECK constraint. Existing rows default to private — same ownership semantics as before, no behavior change for legacy data. Backend - canUseRuntimeForAgent predicate: allow when caller is workspace owner/admin, the runtime owner, or the runtime is public. - CreateAgent and UpdateAgent both gate on it: UpdateAgent matters because a plain member could otherwise create on their own runtime, then re-bind to a private one. - PATCH /api/runtimes/:id accepts { visibility } — owner/admin only, validated against the same private/public allow-list. Frontend - Create-agent dialog renders other-owned private runtimes disabled with a Lock badge + tooltip explaining who to ask. - Inspector runtime-picker disables the same set so re-binding fails the same way at the UI layer. - Runtime detail diagnostics gains a Visibility editor (owner/admin) or read-only chip (everyone else). - Runtime list shows a private/public chip next to the name. Tests - Go: canUseRuntimeForAgent truth table; CreateAgent / UpdateAgent end-to-end gate tests (admin / runtime owner / plain member); PATCH visibility owner / admin / member / invalid-value coverage. - Vitest: create-agent dialog disabled state on private/public runtimes, default-runtime selection skips locked rows; runtime detail visibility editor → mutation, read-only fallback. Migrating runtimes: existing rows default to private to preserve the "owner only" status quo. Owners switch to public via the detail page diagnostics card. Co-authored-by: multica-agent <github@multica.ai> * fix(runtime): apply timezone+visibility atomically; don't seed locked template runtime Two issues surfaced in review of MUL-2062: 1. PATCH /api/runtimes/:id ran the timezone branch first, which: - returned early on a tz no-op, silently dropping a concurrent `visibility` patch in the same body; - committed the timezone mutation (+ usage rollup rebuild) before validating visibility, so an invalid visibility left the row half-updated. Validate every field first, then run the mutations in order. The no-op short-circuit now only triggers when nothing else is requested. 2. The Create Agent dialog in duplicate mode unconditionally seeded `template.runtime_id` as the selected runtime, even when that runtime is now private and owned by someone else — the user saw a selected row they couldn't submit (Create → backend 403). Fall back to the first usable runtime when the template's runtime is locked, and gate the Create button on `selectedRuntimeLocked` as defense in depth. Tests: - Go: TestUpdateAgentRuntime_CombinedPatchAppliesBoth (tz no-op + visibility flip), TestUpdateAgentRuntime_InvalidVisibilityDoesNotMutateTimezone (atomic-fail invariant). - Vitest: duplicate template pointing at a locked runtime now seeds the first usable one; Create button stays disabled when no usable alternative exists. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
f5c2994aed |
feat(workspace): revoke a member's runtimes when they leave or are removed (#2401)
* feat(workspace): revoke a member's runtimes when they leave or are removed Previously, leaving or being removed from a workspace only deleted the member row — every runtime the departed user owned in that workspace remained in the DB, kept its daemon_token valid, and stayed reachable to the workspace's other members. The departed user lost access but their machine kept doing work. This change converges the runtime state in the same transaction as the member-row deletion: agents pinned to those runtimes are archived, in-flight tasks are cancelled (so the daemon's per-task status poller interrupts the running agent gracefully), the runtimes are forced offline, and the daemon_token rows are deleted. After commit the DaemonTokenCache is invalidated and agent:archived / daemon:register events fire so connected clients reconcile immediately. Server-side state convergence is the production safety net; the daemon_token revoke takes effect once the mdt_ flow is live (today most daemons fall back to PAT/JWT, and the member-row deletion is what stops those requests via requireWorkspaceMember). Daemon-side handling (recognising the resulting 401/404 and tearing down the local pairing for that workspace) lands in a follow-up. Co-authored-by: multica-agent <github@multica.ai> * fix(workspace): also cancel tasks for archived agents on member revoke CancelAgentTasksByRuntime only matched tasks whose runtime_id was in the revoked set, missing a real path: agent.runtime_id can be reassigned via UpdateAgent, but agent_task_queue.runtime_id keeps the value from when the task was queued. So an agent currently bound to the leaving member's runtime gets archived correctly, but its older tasks still pinned to a prior runtime stay 'queued' — and ClaimAgentTask does not gate on agent.archived_at, so those orphaned tasks remain claimable by the prior runtime. Replace CancelAgentTasksByRuntime with CancelAgentTasksByRuntimeOrAgent, which OR-matches runtime_ids and the archived agent IDs in one UPDATE. Pass the archived agent IDs through from revokeAndRemoveMember. Adds TestDeleteMember_CancelsTasksFromAgentReassignment as a regression guard: same agent, two runtimes, the older task on the surviving runtime must end up cancelled while the surviving runtime stays online. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
d6349c16ec |
feat(runtime): per-runtime timezone for token-usage aggregation (MUL-1950) (#2394)
* feat: per-runtime timezone for token usage aggregation The runtime token-usage charts (daily and hourly tabs on the runtime-detail page) bucketed every event by the Postgres session timezone, which is UTC in production. For an operator in UTC+8 that meant a Tuesday afternoon's tasks landed in Tuesday early-morning's bar — the chart was always one off. Fix: store an IANA timezone on agent_runtime and aggregate under it. * migrations 081 / 082 add agent_runtime.timezone (TEXT NOT NULL DEFAULT 'UTC') and rebuild the rollup pipeline (window function and both trigger functions) to compute bucket_date with AT TIME ZONE rt.timezone instead of bare DATE(). * No historical backfill — task_usage_daily rows already on disk keep their UTC bucket_date; only future writes / re-touches recompute under the new tz. (Product call from MUL-1950: 'guarantee future correctness'.) * runtime_usage.sql gains a @tz parameter on ListRuntimeUsage and GetRuntimeUsageByHour and threads tz through GetRuntimeTaskHourly Activity. ListRuntimeUsageDaily reads bucket_date as-is since the rollup already wrote it in tz. * parseSinceParamInTZ replaces the raw N×24h cutoff with start-of- day-N in the runtime's tz so 'last 7 days' lines up with bucket boundaries. * Daemon registration sends the host's IANA tz (TZ env, then time.Local), and UpsertAgentRuntime preserves any user override via a CASE-on-existing-value pattern so a daemon reconnect can't silently revert the operator's setting. * New PATCH /api/runtimes/:id endpoint (UpdateAgentRuntime) lets the runtime detail page edit the tz; the editor seeds with the browser tz on first interaction. Refs: MUL-1950 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: multica-agent <github@multica.ai> * fix: harden runtime timezone rollups Co-authored-by: multica-agent <github@multica.ai> * fix: address runtime timezone review nits Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica.ai> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: multica-agent <github@multica.ai> Co-authored-by: Eve <eve@multica-ai.local> |
||
|
|
ce00e05169 |
Add canonical PostHog core metrics events (#2302)
* Add canonical PostHog core metrics events Co-authored-by: multica-agent <github@multica.ai> * Address analytics review feedback Co-authored-by: multica-agent <github@multica.ai> * Tighten analytics review follow-ups Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Devv <devv@Devvs-Mac-mini.local> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
cc527c34be |
perf(heartbeat): batch runtime last_seen_at writes (#2213)
Batches runtime heartbeat last_seen_at updates while preserving the 60s flush / 150s sweeper stale-window invariant. Also drains pending heartbeat writes during graceful shutdown. |
||
|
|
09f04847d3 | feat(server): redis-backed runtime liveness with DB fallback (#2121) | ||
|
|
0b1333fb00 |
feat(server): orphan-task recovery + auto-retry + manual rerun (MUL-1128) (#1476)
* feat(server): orphan-task recovery + auto-retry + manual rerun (MUL-1128)
When the daemon process crashed mid-task the issue was stuck at
in_progress for up to 2.5h: the in-flight task timeout was the only
mechanism that ever moved the row, and the runtime heartbeat sweeper
only fires after the runtime stays offline for 45s — a quick restart
beats both windows.
This change implements the A+B plan from the issue thread:
A. lifecycle hygiene
- migration 055 adds attempt / max_attempts / parent_task_id /
failure_reason / last_heartbeat_at to agent_task_queue
- new daemon-auth endpoint POST /runtimes/{id}/recover-orphans:
daemon calls it on every register so the server fails any
dispatched/running tasks the previous process left behind
- new daemon-auth endpoint POST /tasks/{id}/session: persists the
agent's session_id + work_dir mid-flight so a crash doesn't
lose the resume pointer (claude+codex emit MessageStatus with
SessionID; daemon forwards on the first one it sees)
- FailAgentTask / FailStaleTasks / FailTasksForOfflineRuntimes
now set failure_reason ('agent_error' / 'timeout' /
'runtime_offline')
B. auto-retry with resume context
- TaskService.MaybeRetryFailedTask spawns a fresh queued attempt
carrying parent's session_id/work_dir when the failure reason
is infrastructure-shaped (timeout, runtime_offline,
runtime_recovery) and attempt < max_attempts; skips autopilot
- wired into the runtime sweeper paths and TaskService.FailTask
so the user transparently sees a new in_progress run instead of
a stuck row
- new user-auth POST /api/issues/{id}/rerun + multica issue rerun
CLI for the manual escape hatch
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(server): address PR review for orphan-task recovery (MUL-1128)
Three review-must-fix items on top of the A+B implementation:
1. recover-orphans now funnels through TaskService.HandleFailedTasks,
the same shared post-failure pipeline used by the runtime sweeper.
This guarantees task:failed events are emitted, agent status is
reconciled, and issues stuck in_progress with no remaining active
task are reset to todo even when no auto-retry is created
(max_attempts exhausted, autopilot, non-retryable reason).
2. RerunIssue now uses CancelAgentTasksByIssueAndAgent, scoped to the
issue's current assignee. The previous implementation called
CancelAgentTasksByIssue, which would collateral-cancel parallel
@-mention agents on the same issue.
3. GetLastTaskSession now considers both completed and failed tasks
(mirroring GetLastChatTaskSession), ordering by the most recent
timestamp. With UpdateAgentTaskSession pinning session_id/work_dir
mid-flight, an auto-retry or manual rerun of a daemon-crash failure
now actually resumes the prior conversation context instead of
starting fresh — matching the stated B-branch behaviour.
go build / go vet pass; the existing service and agent test suites pass.
runtime_sweeper / handler integration tests require a local DB with the
055 migration (and the pre-existing 050 first_executed_at column).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
||
|
|
637bdc8eb3 |
feat(analytics): full PostHog pipeline + 6 funnel events (MUL-1122) (#1367)
* feat(analytics): add PostHog client with async batch shipping Introduces server/internal/analytics, the shipping layer for the product funnel defined in docs/analytics.md. Capture is non-blocking — events are enqueued into a bounded channel and a background worker batches them to PostHog's /batch/ endpoint. A broken backend drops events rather than blocking request handlers. Local dev and self-hosted instances run a noop client until the operator sets POSTHOG_API_KEY. This is PR 1 of MUL-1122; signup and workspace_created emission land in the follow-up commit so this change is independently reviewable. * feat(server): emit signup and workspace_created analytics events Wires analytics.Client through handler.New and main, then emits the first two funnel events: - signup fires from findOrCreateUser (which now reports isNew), covering both the verification-code and Google OAuth entry points — a single emission site guarantees Google signups aren't missed. - workspace_created fires after the CreateWorkspace transaction commits, with is_first_workspace computed from a post-commit ListWorkspaces count so we can distinguish fresh-user activation from returning-user expansion. Tests use analytics.NoopClient so nothing ships from test runs. PR 1 of MUL-1122; runtime_registered and issue_executed follow in later PRs per the plan. * refactor(analytics): drop is_first_workspace from workspace_created Stamping "is this the user's first workspace?" at emit time races under concurrent CreateWorkspace requests: two transactions committing close together can both read a post-commit count greater than one and both emit false. Fixing it at the SQL layer requires a schema change we don't want in PR 1. PostHog answers the same question exactly from the event stream (funnel on "first time user does X" / cohort on $initial_event), so removing the property loses no information and makes the emit side race-free. * docs(analytics): document self-host safety defaults Spell out why self-hosted instances never ship events upstream by default (empty POSTHOG_API_KEY → noop client) and explain how operators can point at their own PostHog project without any code change. * feat(analytics): emit runtime_registered, issue_executed, team_invite_* Three server-side funnel events, all gated on first-time state transitions so retries and re-runs don't inflate the WAW buckets: - runtime_registered fires from DaemonRegister when UpsertAgentRuntime reports (xmax = 0) — i.e. the row was inserted, not updated. Heartbeats and re-registrations stay silent. - issue_executed fires from CompleteTask after an atomic UPDATE issue SET first_executed_at = now() WHERE id = $1 AND first_executed_at IS NULL flips the column for the first time. Retries, re-assignments, and comment-triggered follow-up tasks hit the WHERE clause and no-op. Carries nth_issue_for_workspace so the ≥1/≥2/≥5/≥10 buckets filter without extra queries. - team_invite_sent fires from CreateInvitation and team_invite_accepted from AcceptInvitation, closing the expansion funnel. Adds a 050 migration for issue.first_executed_at plus a partial index so the workspace-scoped executed-count query doesn't scan the never-executed tail. * feat(config): surface PostHog key via /api/config Extends AppConfig with posthog_key / posthog_host sourced from env on every request (so operators can rotate the key via secret refresh without a restart). Reading the key off the server — rather than baking it into the frontend bundle via NEXT_PUBLIC_* — means self-hosted instances inherit the blank key automatically and never ship events upstream. * feat(analytics): wire posthog-js identify + UTM capture on the client Adds @multica/core/analytics — a thin wrapper around posthog-js that owns attribution capture and identity merge. Posthog-js config comes from /api/config (not NEXT_PUBLIC_*), so self-hosted instances whose server returns an empty key automatically run the SDK inert. captureSignupSource stamps a multica_signup_source cookie with UTM params and the referrer's origin (never the full referrer — that can leak OAuth code/state in the callback URL). The backend signup event reads this cookie on new-user creation. Identity flows: - auth-initializer fires identify() right after getMe() resolves, on both cookie and token paths. A getConfig/getMe race is handled by buffering a pending identify inside the analytics module and flushing it once initAnalytics finishes. - auth store calls identify() on verifyCode / loginWithGoogle / loginWithToken and resetAnalytics() on logout so the next login merges cleanly without bleeding events. * docs(analytics): describe runtime_registered, issue_executed, invite events Fills in the schema for the remaining funnel events. Captures the design commentary that belongs next to the contract rather than in a PR description — in particular why issue_executed uses the atomic first_executed_at flip instead of counting task-terminal events, and why runtime_registered relies on xmax = 0 rather than a query-then-write. * fix(analytics): drop non-atomic nth_issue_for_workspace from issue_executed Computing the workspace's Nth-issue ordinal at emit time is not atomic under concurrent first-completions — two transactions can both run MarkIssueFirstExecuted, then both run CountExecutedIssuesInWorkspace, and both observe count=1 before either has committed, so both events go out stamped as n=1. Serialising it would mean a per-workspace advisory lock or a SERIALIZABLE-isolated tx; PostHog answers the same question exactly at query time via row_number() partitioned by workspace_id, so the emit-time property adds risk without adding information. Removes the property from analytics.IssueExecuted, deletes the unused CountExecutedIssuesInWorkspace query, and regenerates sqlc. The partial index stays — any future workspace-scoped executed-issue query will want it. * fix(analytics): wire $pageview and harden signup_source cookie payload Two frontend fixes from the PR review: - PageviewTracker, mounted under WebProviders, fires capturePageview on every Next.js App Router path / query-string change. Without this the capturePageview helper in @multica/core/analytics was never called and the acquisition funnel's / → signup step was empty. - captureSignupSource now caps each UTM / referrer value at 96 chars *before* JSON.stringify, and drops the whole cookie when the serialised payload still exceeds 512 chars. Previously the overall slice(0, 256) could leave a half-JSON string on the wire that neither the backend nor PostHog could parse. Both capturePageview and identify now buffer a single pending call when fired before initAnalytics resolves — otherwise the initial "/" pageview and same-turn login identify race the /api/config fetch and get dropped. resetAnalytics clears both buffers so a logout→login cycle stays clean. * fix(analytics): URL-decode signup_source cookie on read Go does not URL-decode Cookie.Value automatically, so the frontend's JSON-then-encodeURIComponent payload was landing in PostHog as percent-encoded garbage (%7B%22utm_source...). Unescape on read so the backend receives the original JSON string the frontend intended, and drop values that fail to decode or exceed the server-side cap — sending truncated garbage is worse than sending nothing. Oversized-cookie guard matches the frontend's SIGNUP_SOURCE_MAX_LEN. * docs(analytics): reflect nth-issue drop, $pageview wiring, cookie encoding Pulls the schema doc back in line with the code: issue_executed no longer advertises nth_issue_for_workspace (with a note about why PostHog derives it at query time instead), the frontend $pageview section names the actual PageviewTracker component that fires it, and the signup_source section documents the per-value cap / overall drop rule and the encode-on-write / decode-on-read contract. --------- Co-authored-by: Jiang Bohan <bhjiang@outlook.com> |
||
|
|
a73336dcf8 |
feat(daemon): persistent UUID identity + legacy-id merge at register-time (#1220)
* feat(daemon): persistent UUID identity + legacy-id merge at register-time daemon_id is now a stable UUID persisted to `<profile-dir>/daemon.id` on first start, replacing the hostname-derived id that drifted whenever `.local` appeared/disappeared, a system was renamed, or a profile switched — each of which used to mint a fresh `agent_runtime` row and strand agents on the old one. To migrate existing installs without operator intervention, the daemon reports every legacy id it may have registered under previously (`host`, `host` with `.local` stripped, and `host[-profile]` variants for both). At register-time the server looks up each candidate row scoped to (workspace, provider), re-points its agents and tasks onto the new UUID-keyed row, records which legacy id was subsumed in the new `legacy_daemon_id` column for audit, and deletes the stale row. Result: users running `xxx.local`-keyed runtimes today transparently land on the new UUID row on next daemon restart. The hostname-prefix `MigrateAgentsToRuntime` / `daemon_id LIKE '...-%'` compatibility shim is no longer needed and has been removed along with the handler call that invoked it. * fix(daemon): handle bidirectional .local drift and case drift in legacy merge Review on #1220 flagged two gaps in the legacy-id migration candidate set: 1. Reverse .local: LegacyDaemonIDs only added the stripped variant when the current hostname ended in `.local`. The opposite direction — DB has `foo.local`, current host is `foo` — was missed, so runtimes registered under the `.local` variant stayed orphaned after upgrade. Now both variants (`foo` and `foo.local`) are always emitted, regardless of what `os.Hostname()` currently returns, plus their `-<profile>` suffix forms. 2. Case drift: os.Hostname() has been observed returning different casings on the same machine across mDNS/reboot state. A case-sensitive `=` comparison stranded rows like `Jiayuans-MacBook-Pro.local` when the daemon later reported `jiayuans-macbook-pro.local`. FindLegacyRuntimeByDaemonID now uses `LOWER(daemon_id) = LOWER(@daemon_id)` on both sides, so casing differences merge rather than orphan. The (workspace_id, provider) prefix still bounds the scan to a tiny set of rows so the non-indexed LOWER() comparison has negligible cost. Tests: TestLegacyDaemonIDs gets the mixed-case + reverse-direction cases; daemon_test.go adds TestDaemonRegister_MergesLegacyDaemonIDRuntime_ReverseDotLocal and TestDaemonRegister_MergesLegacyDaemonIDRuntime_CaseDrift. * fix(daemon): consolidate every case-duplicate legacy runtime, not just the first Follow-up review on #1220: after switching to `LOWER(daemon_id) = LOWER(@daemon_id)`, the single-row lookup still only merged one legacy row per candidate. If a machine already had two rows in the DB that differed only in casing (e.g. `Jiayuans-MacBook-Pro.local` AND `jiayuans-macbook-pro.local` coexisting because earlier hostname drift already minted a duplicate), only one of them got consolidated and the other stayed orphaned — violating the "no duplicate runtime per machine after backfill" acceptance. - FindLegacyRuntimeByDaemonID → FindLegacyRuntimesByDaemonID (:many) - mergeLegacyRuntimes iterates every returned row and dedupes across overlapping legacy candidates so `foo` and `foo.local` both resolving to the same stored row don't double-process Test: TestDaemonRegister_MergesAllCaseDuplicateLegacyRuntimes seeds two case-duplicate rows with one agent each and confirms both rows are deleted and both agents end up on the new UUID-keyed row. |
||
|
|
ff5f6ac2ee |
fix(daemon): prevent duplicate runtime registration on profile switch (#906)
* fix(daemon): prevent duplicate runtime registration on profile switch The daemon_id included a profile name suffix (e.g. "hostname-staging"), so switching profiles created a new daemon_id that bypassed the UPSERT dedup constraint, leaving orphaned runtime records in the database. Three changes: - Remove profile suffix from daemon_id — use stable hostname only. The unique constraint (workspace_id, daemon_id, provider) already prevents collisions within the same workspace. - Auto-migrate agents from old offline runtimes to the newly registered runtime during DaemonRegister (same workspace/provider/owner). - Add TTL-based GC in the runtime sweeper to delete offline runtimes with no active agents after 7 days. Closes MUL-695 * fix(daemon): address code review issues on PR #906 1. Move gcRuntimes() to the main sweep loop — previously it was inside sweepStaleRuntimes() after an early return, so it only ran when new runtimes were marked stale. Now it runs every sweep cycle independently. 2. Fix DeleteStaleOfflineRuntimes to exclude runtimes with ANY agent reference (not just active ones). The FK agent.runtime_id is ON DELETE RESTRICT, so archived agents also block deletion. 3. Scope MigrateAgentsToRuntime to the same machine by matching daemon_id LIKE '<current_daemon_id>-%'. This prevents cross-machine agent migration when the same user has multiple devices. |
||
|
|
6209e2f3ae |
fix(server): allow deleting runtimes when all bound agents are archived (#589)
Previously, runtimes could never be deleted once an agent was created because agents can only be archived (not deleted) and the count check included archived agents. Now the check only counts active agents, and archived agents are cleaned up before runtime deletion. |
||
|
|
ff27a249cc |
feat(runtime): add owner tracking, filtering, and delete (#535)
Add owner_id to agent_runtime table to track who registered each runtime. Backend: new delete endpoint with role-based permissions (owner/admin can delete any, members only their own), list filtering by owner (?owner=me), and agent dependency check before deletion. Frontend: Mine/All filter toggle in runtime list, owner display in list items and detail view, delete button with AlertDialog confirmation. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b112d1f1ae |
feat(tasks): add coalescing queue and task lifecycle guards
- Coalescing queue: use HasPendingTaskForIssue (queued/dispatched only) instead of HasActiveTaskForIssue so comments during a running task enqueue exactly one follow-up task that picks up all new comments. - Stale task cleanup: runtime sweeper now fails orphaned tasks when their runtime goes offline (daemon crash/network partition). - Cancel-aware daemon: handleTask checks task status after execution and discards results if the task was cancelled mid-run (e.g. reassign). - Terminal issue guard: ClaimTaskForRuntime auto-cancels pending tasks for done/cancelled issues instead of executing them. - Race condition safety net: unique partial index ensures at most one pending task per issue at the DB level. |
||
|
|
b3bbf92a1d |
fix(runtime): add server-side sweeper to detect stale runtimes
The only path to marking a runtime offline was the daemon's deregister call on graceful shutdown. If the daemon crashed, was killed, or lost network, the status stayed "online" forever. Add a background goroutine that sweeps every 30s and marks runtimes offline after 45s without a heartbeat (3 missed intervals). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
38d595d81d |
feat(cli): restructure CLI commands for better UX
- Add top-level `multica login` that combines auth + workspace auto-discovery - Restructure daemon into subcommands: start, stop, status, logs - Add background daemon mode with PID management - Add daemon deregistration on shutdown (new API endpoint + SQL query) - Remove unused commands: runtime list, status, agent get/delete/stop - Make `config` show config directly instead of requiring `config show` - Update README to reflect new CLI structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
cdfa63af15 | feat(runtime): add local codex daemon pairing |