mirror of
https://github.com/multica-ai/multica.git
synced 2026-07-05 13:29:44 +02:00
e7aecfc574c90ed67ead1fa00a06e08a3a39e7ea
1163 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
f59cb2f494 |
MUL-3834: harden daemon websocket reconnect (#4699)
* MUL-3834 harden daemon websocket reconnect Co-authored-by: multica-agent <github@multica.ai> * MUL-3834 stabilize daemon websocket liveness tests Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
63eb6f73ad |
feat(analytics): anonymous self-host onboarding source beacon (MUL-3708) (#4691)
* feat(analytics): anonymous self-host onboarding source beacon (MUL-3708) Production self-host servers now report the anonymous onboarding "how did you hear about us" channel to Multica's public write-only ingest, so the self-host source distribution becomes visible alongside official cloud. Official cloud keeps its existing PostHog capture unchanged; this is a submit-time beacon, not a background telemetry pipeline. - server/internal/sourcebeacon: ShouldSend gate (production + non-local + non-*.multica.ai app host, fail-closed — judged by the app/frontend host, not the backend URL, which official often leaves unset), per-instance salted hashing, deterministic event uuid, fire-and-forget sender. - POST /api/telemetry/self-host-source: public, write-only, per-IP rate-limited, 4 KiB body cap, channel allowlist, strict unknown-field rejection. Lands in PostHog as self_host_source_channel with a deterministic uuid (best-effort dedup), $process_person_profile=false, and deployment=self_host — a distinct event name so it never pollutes the official onboarding funnel. - Hook in PatchOnboarding fires once when the source is first set; never blocks onboarding. Only channel enum(s) + two per-instance hashes leave the box — never user_id/email/name/workspace/org/domain/role/use_case/the source_other free-text/IP. - migration 128: system_settings singleton holding instance_salt. - frontend: self-host-only anonymous-collection notice on the source step, gated by a new /api/config self_host_source_notice flag (en/zh-Hans/ko/ja). - analytics.Event gains an optional top-level uuid; docs/analytics.md, SELF_HOSTING.md and .env.example document exactly what is/isn't sent and how to disable it (ANALYTICS_DISABLED). Also fixes the long-standing team_size→source drift in docs/analytics.md. Verified locally: go build/vet, go test (sourcebeacon, analytics, handler), pnpm typecheck (all packages), locale parity (157), step-source (6) + core config/schema (69) vitest, lint (0 errors). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: multica-agent <github@multica.ai> * fix(analytics): wire self-host source beacon through metrics, guard nil pool (MUL-3708) Addresses Howard CI blockers on #4691 (no product-direction change): - loadInstanceSalt returns "" on nil pool; salt is only loaded when ShouldSendFromEnv() is true, via a bounded (5s) context — restores the "router constructible without a DB" invariant (nil-pool routing tests). - Add multica_self_host_source_channel_total counter (by source) + an IncForEvent case, so every analytics event is paired with a Prometheus counter. NormalizeSourceChannel reuses sourcebeacon allowlist (no 3rd copy). - Beacon handler now builds the event via the analytics.SelfHostSourceChannel helper and ships it through obsmetrics.RecordEvent (no naked Capture); not IsMetricsOnly, so it still reaches PostHog. - Prime the new family in the registry-families test. Verified: go build/vet, go test ./internal/metrics ./internal/sourcebeacon ./internal/handler ./cmd/server (incl. the 3 named blockers + registry + record-event-helper lints) all green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
11a3cf206b |
feat(slack): bring-your-own-app install + per-installation Socket Mode (MUL-3666) (#4566)
* feat(slack): single app-level Socket Mode connection routed by team_id (MUL-3666) Reshape the Slack adapter from the stage-3 per-installation Socket Mode model into the multi-tenant B2 connection model: ONE deployment-level Socket Mode connection (app-level xapp- token, env MULTICA_SLACK_APP_TOKEN) receives the Events API stream for every installed workspace and routes each inbound event to its channel_installation by team_id — the existing GetChannelInstallationByAppID routing, unchanged. - AppConnector: the single shared connection (slack/app_connector.go). No leader election — per the design "one (or a few)" connections are fine: each replica opens one, Slack delivers each event to one of them, and the existing (installation, message_id) two-phase dedup guarantees exactly-once processing. Resolves the per-team bot user id (via the same app_id query) to detect/strip @-mentions, since one connection serves many workspaces. - Inbound translation (Events API -> channel.InboundMessage) extracted to slack/inbound.go as free functions parameterized by the per-team bot identity. - channel.go trimmed to the outbound Send-only sender; per-installation config (config.go) no longer carries an app-level token — installs hold only the per-workspace bot token (xoxb-) for outbound, since xapp- can't be OAuth'd. - engine.Supervisor now skips channel types with no registered Factory, so Slack installs (driven by the app-level connector, not per-installation channels) no longer churn the lease/Build loop. - Wiring: router.go builds the connector when MULTICA_SLACK_APP_TOKEN is set; main.go runs it alongside the Supervisor. Feishu untouched; channel_* schema unchanged. Verified: go build ./..., go vet ./..., gofmt, and go test ./internal/integrations/... all pass. Co-authored-by: multica-agent <github@multica.ai> * feat(slack): OAuth self-serve install backend (MUL-3666) Add the in-product OAuth install flow that creates Slack installations, the keystone the B2 connector consumes. - slack.InstallService: Begin (build authorize URL, seal workspace/agent/ initiator into the OAuth state), Complete (verify state, exchange code via oauth.v2.access, upsert channel_type='slack' install with the bot token encrypted at rest, auto-bind the installer's Slack id so their first message is not dropped), plus List/Get/Revoke. State is stateless: sealed with the deployment secretbox + an embedded expiry, no session store. - HTTP handlers (handler/slack.go): member-visible list, admin-only begin + revoke, and the public OAuth callback (recovers context from the sealed state, redirects the browser back to Settings → Integrations with a result flag). - Routes + wiring: workspace-scoped list/begin/revoke mirror the Lark admin/member split; the callback is a public route like GitHub's. Built from MULTICA_SLACK_CLIENT_ID/SECRET (+ redirect derived from MULTICA_PUBLIC_URL, override MULTICA_SLACK_REDIRECT_URL; scopes via MULTICA_SLACK_SCOPES). - Realtime: slack_installation:created / :revoked events. Verified: go build ./..., go vet, gofmt, and go test ./internal/integrations/slack/... all pass (new install_test.go covers state sign/verify/expiry/tamper, authorize URL, code exchange + encrypted upsert + installer bind, and oauth error paths). Co-authored-by: multica-agent <github@multica.ai> * feat(slack): in-product OAuth install UI for web + desktop (MUL-3666) Add the "Connect Slack" self-serve install UI mirroring the Feishu/Lark integration, completing the in-product install half of B2. Slack's OAuth flow is a redirect (not a device-code QR poll), so the UI is simpler than Lark's. - core: SlackInstallation / List / Begin types; api.listSlackInstallations / beginSlackInstall / deleteSlackInstallation; slackKeys + slackInstallationsOptions query; realtime invalidation on slack_installation:* events. - views: slack-tab.tsx (SlackTab settings panel + per-agent SlackAgentBindButton + connected badge + disconnect confirm). Connect calls beginSlackInstall and hands the authorize URL to openExternal (system browser on desktop, new tab on web); Slack bounces to the backend callback which lands the install, and the realtime event refreshes the list. Wired into the Settings → Integrations tab and the agent-detail Integrations tab alongside Lark. - i18n: en + zh-Hans settings.slack.* strings. Verified: pnpm typecheck (full monorepo, 6/6) and pnpm lint (@multica/core, @multica/views — 0 errors) pass. Co-authored-by: multica-agent <github@multica.ai> * feat(slack): outbound Replier + user-binding redeem flow (MUL-3666) Fill the stage-3 Replier=nil tail so non-installer Slack users can onboard and get status feedback — completing B2 end to end. - slack.OutboundReplier (engine.OutboundReplier): on NeedsBinding it mints a single-use binding token and DMs/replies a "link your account" prompt with the redeem URL (wrapped as <url|label> so formatMrkdwn doesn't mangle the base64url token); on AgentOffline/AgentArchived it posts a status notice; on an /issue-created Ingest it confirms the new issue. Plain chat stays silent (the agent's own reply lands via EventChatDone). Reuses the bot-token Send path and reads the installation row from ResolvedInstallation.Platform — no new transport. - slack.BindingTokenService: Mint + transactional RedeemAndBind over the generic channel_binding_token / channel_user_binding queries (channel_type='slack'), mirroring lark.BindingTokenService. 15-min TTL, SHA256-hashed tokens, the three typed failure modes (invalid/expired, already-assigned, not-member). - HTTP: POST /api/slack/binding/redeem (public, session-authed) maps the failures to 410/409/403. NewSlackResolverSet now takes the replier (nil disables it). - Frontend: /slack/bind redeem page (packages/views/slack + apps/web route) + api.redeemSlackBindingToken + en/zh slack_bind copy. Verified: go build ./..., go vet, gofmt, go test ./internal/integrations/... (new replier_test.go covers all outcome branches + the prompt URL), plus full pnpm typecheck (6/6) and pnpm lint (0 errors). Co-authored-by: multica-agent <github@multica.ai> * fix(slack): address review must-fixes — connector leak, team-keyed install, /issue copy (MUL-3666) Three fixes from Niko's review: 1. AppConnector.connectOnce leaked the Socket Mode goroutine/connection on a handler error: it ran sm.RunContext on the long-lived ctx and returned the error without cancelling it, so a transient DB/router error left the old connection alive (consuming events into an unread channel) while Run opened a second one. Each connection now runs under its own cancellable context and a deferred cancel + join tears it down on every exit path before reconnect. 2. Slack re-install collided with the (channel_type, app_id) unique index: connecting the same Slack team to a different agent failed because the upsert conflict key was (workspace_id, agent_id, channel_type). Add a team-keyed UpsertChannelInstallationByAppID (ON CONFLICT on the (channel_type, app_id) index, updating agent_id) and use it for the Slack OAuth install, so re-connecting a workspace moves the bot to the chosen agent instead of erroring. Feishu's per-agent upsert is unchanged. 3. /issue clarified: it is not a registered Slack slash command (no `commands` scope), so Slack never routes one to us. Issue creation runs through the message path — `@bot /issue <title>` in a channel or `/issue <title>` in a DM — which the engine parser handles. Documented in the connector and the user-facing copy (en + zh). Verified: go build ./..., go vet, gofmt, go test ./internal/integrations/..., make sqlc, plus pnpm typecheck (6/6) and pnpm lint (0 errors). Co-authored-by: multica-agent <github@multica.ai> * fix(slack): make OAuth install transactional — agent-move binding consistency + cross-workspace guard (MUL-3666) Address Elon's review: the team-keyed upsert kept the same installation row and only flipped agent_id, but engine session reuse matches purely on (installation_id, channel_chat_id) and each chat_session is permanently tied to the agent it was created under — so after moving a Slack team from Agent A to Agent B, existing DMs/threads kept routing to Agent A; only brand-new channels/threads reached B. Cross-workspace re-install was worse: the SQL also moved workspace_id while the application-layer user/chat-session bindings stayed behind, inheriting the previous workspace's relations. InstallService.Complete now runs one transaction (lookup → upsert → retire → installer-bind), all application-layer per the no-FK rule: - Look up the existing installation by team_id (config->>'app_id'). - Reject a silent cross-workspace ownership change (ErrTeamOwnedByAnotherWorkspace → callback redirects with slack_error=team_in_other_workspace). The owning workspace must disconnect first. - On an agent change within the same workspace, retire the installation's chat-session bindings (new DeleteChannelChatSessionBindingsByInstallation) so the next message creates a fresh session under the new agent. The chat_session rows are preserved for history; user bindings stay valid (same users/workspace). - Installer auto-bind moves into the tx; an already-bound-elsewhere id is a benign skip, a real DB error aborts the whole install. InstallService now takes a TxStarter; the queries seam gains WithTx (dbInstallQueries adapter) so Complete stays unit-testable with a fake tx. Verified: make sqlc, go build ./..., go vet, gofmt, go test ./internal/integrations/... (new tests: agent-move retire, same-agent no-retire, cross-workspace reject, fresh-install no-retire). Co-authored-by: multica-agent <github@multica.ai> * fix(slack): atomic cross-workspace install guard + green up frontend CI (MUL-3666) Two things: address Elon's review and fix the failing frontend CI job. Review (atomic cross-workspace guard): the previous guard was a SELECT before the upsert, which loses the concurrent-OAuth race — two workspaces can both read no rows, one inserts, the other's ON CONFLICT update then silently re-points the team. Move the guard into the upsert itself: ON CONFLICT ... DO UPDATE ... WHERE channel_installation.workspace_id = EXCLUDED.workspace_id, and map the empty RETURNING (pgx.ErrNoRows) to ErrTeamOwnedByAnotherWorkspace. The pre-SELECT now only feeds the agent-change cleanup. Also corrected the error copy: a team stays bound to its first Multica workspace (revoke is soft, keeping the row + unique index), so migration is an operator action, not "disconnect first". CI (frontend vitest, @multica/views#test): - The agent IntegrationsTab now renders the real SlackAgentBindButton, whose connected badge calls useQueryClient — absent from integrations-tab.test.tsx's react-query mock. Hoisted the owner/admin gate above the per-platform sections (one role notice instead of one per platform), made the agents members_note generic (en/zh/ja/ko), and updated the test (mock @multica/core/slack, stub SlackAgentBindButton, assert both platforms). - Added slack-tab.test.tsx covering the real SlackAgentBindButton / SlackTab. - locale parity: added the slack (settings) + slack_bind (common) blocks to ja and ko so every EN key has a translated counterpart. Verified: make sqlc, go build ./..., go vet, gofmt, go test ./internal/integrations/...; pnpm --filter @multica/views test (1478 pass), pnpm typecheck (6/6), pnpm lint (0 errors). Co-authored-by: multica-agent <github@multica.ai> * fix(slack): surface agent-page Slack entry points when Lark is off (MUL-3666) The agent-detail Integrations tab and the inspector's Integrations section only considered Lark, so a Slack-only deployment (Lark disabled) showed neither the Integrations tab nor a Connect-Slack button — the per-agent entry points were unreachable. - agent-overview-pane: gate the Integrations tab on Lark OR Slack configured (new slackInstallationsOptions query), not Lark alone. - agent-detail-inspector: render SlackAgentBindButton alongside LarkAgentBindButton in the Integrations section. - regression test: the Integrations tab appears when only Slack is configured. Verified: pnpm typecheck (6/6), pnpm --filter @multica/views test (1478+ pass), pnpm lint (0 errors). Co-authored-by: multica-agent <github@multica.ai> * feat(slack): BYO-app install backend — paste xoxb+xapp, per-app install keyed by real app id (MUL-3666) Adds the bring-your-own-app install path so multiple agents can each have their own bot identity in the SAME Slack workspace (hosted B2 caps at one agent/workspace). User pastes their app's bot token (xoxb-) + app-level token (xapp-); we validate the bot token via auth.test, parse the real Slack app id from the xapp- token, encrypt both tokens, and persist a per-app installation keyed by that app id (real 'A…' ids never collide with hosted 'T…' team ids in the existing unique index — no schema change). - config.go: add app_token_encrypted (BYO discriminator + per-app socket token) - install.go: extract shared persistInstall (atomic cross-ws guard + agent-move retire) - byo_install.go: RegisterBYO + auth.test + app-id parse - handler + route: POST /api/workspaces/{id}/slack/install/byo (admin-only) - tests: keying, encryption, invalid tokens, auth.test failure, cross-ws, agent move Follow-ups (separate commits): per-app Socket Mode connector that consumes the stored app token; in-product BYO install dialog (video + paste form). Co-authored-by: multica-agent <github@multica.ai> * refactor(slack): drop OAuth, unify on BYO per-installation model (MUL-3666) Per product decision, Slack drops the hosted-app OAuth path entirely and unifies on bring-your-own-app (BYO): every installation carries its OWN app-level token and gets its OWN Socket Mode connection, so multiple agents can each have a distinct bot identity in one Slack workspace. - Remove OAuth install (Begin/Complete/code-exchange/sealed state/OAuthConfig/ default scopes), the OAuth callback + begin handlers + routes, and the MULTICA_SLACK_CLIENT_ID/SECRET/REDIRECT/APP_TOKEN env wiring. - Replace the single deployment-level AppConnector with a per-installation slackChannel (authenticated with its own xapp- token) registered as a channel Factory, so the engine Supervisor drives one Socket Mode connection per installation (exactly like Feishu). inbound/outbound/resolvers reused as-is. - Route inbound by the event's api_app_id (== the installation's real app id), not team_id. - InstallService slims to at-rest encryption + the shared persistInstall + list/get/revoke; install is the BYO paste path only (byo_install.go). - Tests: drop the OAuth tests; slack + handler + engine all green. Follow-up (frontend): replace the OAuth "Connect Slack" button with the BYO paste dialog (the begin endpoint it calls is now gone). Co-authored-by: multica-agent <github@multica.ai> * fix(slack): verify BYO bot + app tokens are from the same app, and the app token is live (MUL-3666) Niko review: RegisterBYO only parsed the app id from the xapp string and auth.test'd the bot token, so pasting app A's bot token with app B's app token would 'connect' but be broken (inbound on B's socket, outbound with A's identity). Now: resolve the bot's owning app id via bots.info (on the bot_id from auth.test) and require it to equal the xapp's app id; and live- validate the app token via apps.connections.open. Reject (no persist) on mismatch or a dead app token. Co-authored-by: multica-agent <github@multica.ai> * feat(slack): in-product BYO install dialog (paste bot + app tokens) (MUL-3666) The OAuth begin endpoint was removed server-side, so the "Connect Slack" button now opens a dialog where the admin pastes the bot token (xoxb-) and app-level token (xapp-) of the Slack app they created, and submits to the BYO install endpoint. Includes an optional setup-video link (URL constant, left empty until the walkthrough is recorded). - core: drop beginSlackInstall / BeginSlackInstallResponse; add registerSlackBYO + RegisterSlackBYORequest. - views: SlackAgentBindButton opens the BYO dialog; refreshed comments and install_supported docs (now means "configured", no OAuth). - i18n: new slack.byo_* keys + refreshed page_description in en/zh-Hans/ja/ko. - tests: dialog submit path; views vitest (1479), typecheck, lint, locale parity all green. Co-authored-by: multica-agent <github@multica.ai> * fix(slack): Elon review — team_id routing guard, per-agent reconnect, users:read hint (MUL-3666) 1. Inbound routing keys on api_app_id (the APP, not the Slack workspace), so additionally require the event's team_id to match the installation's stored team. A distributed BYO app installed into another Slack workspace emits the same app id and would otherwise mis-route to this Multica installation. Extracted installationServesTeam() + unit test. 2. BYO install is now agent-keyed (UpsertChannelInstallation, conflict on workspace_id+agent_id+channel_type): one bot per agent. Disconnect → reconnect a NEW app for the SAME agent now UPDATES that agent's row in place instead of violating the (workspace, agent, channel) unique. A unique violation on the (channel_type, app_id) routing index → ErrTeamOwnedByAnother- Workspace (the app is already connected to another agent/workspace). No chat-session retire is needed: a row's agent_id never changes. 3. UX: bots.info (the same-app check) needs the users:read scope — the connect dialog now lists the required bot scopes including it, and the error text says so. Backend build/vet/gofmt/test + views vitest + typecheck + locale parity green. Co-authored-by: multica-agent <github@multica.ai> * fix(slack): publish slack_installation:created on BYO connect; refresh stale comments (MUL-3666) Niko final review: RegisterSlackBYO wrote the response but never published EventSlackInstallationCreated, so only the installer's own tab refreshed — other open clients (Settings, Agent Integrations, other tabs) did not see the new bot in realtime, inconsistent with the revoke event and Lark. Now publishes it on success via a small publishSlackInstallationCreated helper, with a unit test (Bus.Publish is synchronous). Also refreshed comments that still described the removed hosted-OAuth / single deployment-level AppConnector model (handler SlackInstall field, channel.go / inbound.go / outbound.go / byo_install.go). PR title updated separately to the BYO per-installation Socket Mode model. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
4fb6c0fb0e |
fix(daemon): bound runtime --version probe so one wedged CLI can't block all runtimes (MUL-3812) (#4685)
* fix(daemon): bound runtime --version probe so one wedged CLI can't block all runtimes A CLI whose `--version` never returns (e.g. a brew-installed claude wedged by a bun regression) stalled the daemon's sequential runtime registration loop forever. Registration runs inside the blocking preflight that gates /health, so the daemon never flipped from "starting" to "running" and every runtime on the host appeared disconnected — not just the broken one. detectCLIVersion now derives a 10s timeout context and sets cmd.WaitDelay so a node/bun shim that leaves a child holding the stdout pipe open can't defeat the timeout. A wedged probe now fails fast; the existing per-agent error skip isolates the broken runtime and the rest register normally. MUL-3812 Co-authored-by: multica-agent <github@multica.ai> * test(agent): reap the hang script's orphaned child instead of leaking it The MUL-3812 regression test spawned a background `sleep 60` that outlived the killed parent and lingered for up to 60s on CI. The hang script now records the child's PID and t.Cleanup reaps it, so no temporary process is left behind. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
0c2f93bcd1 | fix: allow same-origin attachment previews (#4679) | ||
|
|
78d668a2f2 |
fix(agent): clarify Antigravity daemon mode
Fixes MUL-3726 |
||
|
|
e2103a240d |
fix(server): emit issue:updated when failed-task handler resets stuck issue (#4662)
HandleFailedTasks resets a stuck in_progress issue back to todo via a direct UpdateIssueStatus, bypassing the HTTP handler that emits issue:updated. Without that event the frontend realtime reconcile never runs, so status-filtered board columns/lists stay stale until the next write. Publish issue:updated (status_changed + prev_status) after the reset. Fixes #4648 (MUL-3782). |
||
|
|
24754f091b |
fix: allow framed attachment redirects (#4635)
Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
a252f47337 |
fix(scheduler): advance autopilot next_run_at after each scheduled dispatch (MUL-3749) (#4618)
* fix(scheduler): advance autopilot next_run_at after each scheduled dispatch The display-only autopilot_trigger.next_run_at column was written only on trigger create/update and never advanced afterward, so for a recurring schedule it froze at a past slot and the list rendered it as a 'next run' in the past (e.g. '53m ago'). The intended AdvanceTriggerNextRun query was dead code with zero callers. Wire it up at the scheduler's existing post-dispatch seam (replacing the last_fired_at-only TouchAutopilotTriggerFiredAt bump, which AdvanceTrigger- NextRun already supersets). The advanced value is computed on the app local clock via ComputeNextRun — the same path create/update use — so the whole next_run_at display column is owned by one clock and stays consistent; scheduling itself is untouched and still runs off DB time via NextOccurrencesUTC. On a cron/timezone parse failure we fall back to the last_fired_at-only bump. Adds a deterministic regression test for the reported scenario (hourly cron in America/New_York) and documents the local-clock ownership on ComputeNextRun. MUL-3749 Co-authored-by: multica-agent <github@multica.ai> * fix(scheduler): floor next_run_at advance at plan_time to survive clock skew Addresses review feedback on the next_run_at write-back (MUL-3749): - The post-dispatch advance computed the value from time.Now() alone. The handler is entered only after DB time judged the plan due, so if this app instance's clock lags the DB clock at a period boundary, time.Now() could recompute the slot that just fired and next_run_at would not advance — the original staleness bug, at the boundary. Extract advancedNextRun, which anchors at max(now, plan_time) via NextOccurrenceAfterUTC so the written value is always strictly after the fired plan_time while still tracking the local clock in the normal case. - Add scheduler-layer tests asserting the written value is strictly after plan_time across skew / on-slot / normal cases. The previous service-layer test only exercised the helper with an explicit after, not this path. - Sync the stale ListSchedulableAutopilotTriggers comment: the scheduler now writes last_fired_at via AdvanceTriggerNextRun (sqlc regenerated). MUL-3749 Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
256a0a9b27 |
fix(squad): skip leader on reply that inherits parent @mention (MUL-3744)
Refs MUL-3744. |
||
|
|
3692b6a862 |
fix(squad): inject leader briefing by task flag, not issue assignee (MUL-3730) (#4606)
* fix(squad): inject leader briefing by task flag, not issue assignee Key squad-leader briefing injection off task.IsLeaderTask + task.SquadID instead of issue.AssigneeType=='squad'. The old gate missed the most common path — an @squad mention in a comment on an issue assigned to a plain agent (MUL-3724) — so the leader booted with zero squad context and did the work itself instead of orchestrating. - migration 127: add agent_task_queue.squad_id (no FK) + partial index - sqlc: CreateAgentTask stamps squad_id; CreateRetryTask inherits it - service: thread squadID through EnqueueTaskForSquadLeader(+WithHandoff), enqueueMentionTask, and the rerun path; all 5 call sites pass the squad id - daemon claim: unified injection keyed on leader-task + squad_id, with a defensive leader-identity re-check; quick-create block retained (it serves issue-less tasks and sets resp.SquadID/SquadName) - briefing: strengthen leader Operating Protocol opening - tests: claim-time injection (comment-mention/non-leader/null-squad), squad_id enqueue stamping, retry inheritance; existing fixture updated Co-authored-by: multica-agent <github@multica.ai> * test+docs(squad): dangling squad_id regression + clarify quick-create path Address review nits on #4606: - Add TestClaim_LeaderTaskWithDanglingSquadID_NoBriefing: squad hard-deleted after enqueue leaves task.squad_id dangling (no FK); claim still 200 and skips injection via the err!=nil guard. This is the load-bearing contract for dropping the FK. - Rewrite the daemon.go injection comment to state quick-create does NOT use the is_leader_task/squad_id columns — it routes squad via the context JSON branch (qc.SquadID) and must not be folded into the column-based path. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: 魏和尚 <agent@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
8d0ea04fb0 |
feat(composio): add standalone Go SDK client (MVP) (#4603)
* feat(composio): add standalone Go SDK client (MVP)
Adds server/pkg/composio — a self-contained Go SDK for the Composio v3.1
REST API. Built on go-resty/resty v2; zero coupling to other Multica
packages so it can be vendored or extracted later without surgery.
MVP surface (just the endpoints Stage 2 needs):
- POST /connected_accounts/link Client.CreateLink
- POST /tool_router/session Client.CreateSession
- GET /connected_accounts Client.ListConnectedAccounts
- POST /connected_accounts/{id}/revoke Client.RevokeConnection
- DELETE /connected_accounts/{id} Client.DeleteConnectedAccount
(404 -> nil, idempotent)
- GET /toolkits Client.ListToolkits
- GET /toolkits/{slug} Client.GetToolkit
- POST /tools/execute/{slug} Client.ExecuteTool
- Webhook HMAC-SHA256 verification composio.VerifyWebhook /
VerifyHTTPRequest + ParseEvent
Other notes:
- Auth via x-api-key header (Composio v3.1 contract).
- Typed *APIError envelope with IsNotFound / IsUnauthorized /
IsRateLimited helpers; falls back to raw body when upstream returns
non-JSON.
- Webhook signature accepts the official "v1,<sig>" format and any
comma-separated multi-version list; 300s replay tolerance by default,
honors an injectable clock for tests; RFC3339 timestamps tolerated.
- README.md documents all public APIs and design choices.
Tests:
- All exercise httptest.NewServer - no real Composio calls.
- 36 tests covering happy paths, validation, 404 idempotence, error
decoding, signature verify (good / tampered / stale / multi-version /
bare / RFC3339 / missing headers / empty secret).
- go test ./pkg/composio/... -cover -> 82.2%, exceeds the >=80% bar.
Follow-ups (separate PRs):
- server/internal/integrations/composio - DB schema, REST handlers,
registration_service (CSRF), dispatch hook (MUL-3720 remainder).
- Pagination iterators, retry middleware, proxy execute, triggers.
Refs: MUL-3720, MUL-3715
Co-authored-by: multica-agent <github@multica.ai>
* fix(composio): align SDK with v3.1 wire contract (PR #4603 review)
Addresses GPT-Boy's review on PR #4603:
Must-fix
- ListConnectedAccountsRequest: switch UserID/AuthConfigID singular fields
to plural slices (UserIDs, AuthConfigIDs, ToolkitSlugs, ConnectedAccountIDs,
Statuses). The Composio v3.1 spec ships these as `*_ids` array params;
our singular form silently dropped the user-filter on the wire. Also
surfaces order_by / order_direction / account_type from the same spec.
- ExecuteToolRequest: rename ToolkitVersions -> Version with json tag
`version` (the actual v3.1 body field). Marks AllowTracing as
deprecated per the spec.
Nits
- ListToolkitsRequest.SortBy comment: `popular | alphabetical` -> the
real enum `usage | alphabetically`.
- Client constructor: when Options.HTTPClient is provided, use
resty.NewWithClient(hc) so the caller's Transport, Jar, CheckRedirect,
and Timeout all carry through — the prior code only forwarded
Transport + Timeout despite the field comment promising the full
*http.Client.
Tests
- TestListConnectedAccounts_QueryString now asserts plural query keys
(user_ids, auth_config_ids, connected_account_ids, statuses) and
explicitly guards that the legacy singular keys do not leak.
- TestExecuteTool_Success decodes the body as a raw map and asserts the
wire key is `version` (not `toolkit_versions`).
- New TestExecuteToolRequest_VersionSerialization locks the json tag.
- New TestNewClient_HonorsInjectedHTTPClient drives a request through a
recordingTransport and asserts the inbound *http.Client actually
handled it.
Verification
- go test ./pkg/composio/... -cover -> 85.1% (38 tests; was 82.2% / 36).
- go vet, go build, gofmt -l all clean.
Refs: PR #4603 review, MUL-3720
Co-authored-by: multica-agent <github@multica.ai>
---------
Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
|
||
|
|
9e807efc62 |
feat(sidebar): per-workspace switcher dot + count unread per issue (MUL-3695) (#4591)
* feat(sidebar): mark which workspace has unread in the switcher dropdown (MUL-3695) The aggregate avatar dot only says "some other workspace has unread". When the user opens the workspace switcher they couldn't tell which one. Add a per-row brand dot next to each OTHER workspace that has unread inbox items, in the same right-edge slot as the active-workspace check (the active workspace is excluded — its unread is the Inbox nav count — so dot and check never collide on one row). Reuses the existing cross-workspace summary data; no backend change. New pure helper unreadWorkspaceIds() + unit tests, and AppSidebar dropdown tests covering: dot only on the other unread workspace, no dot at count 0, and never on the active workspace. Co-authored-by: multica-agent <github@multica.ai> * fix(inbox): count switcher unread per issue, matching the inbox dedup (MUL-3695) The unread-summary that drives the workspace-switcher dot counted raw unread inbox_item rows, but the inbox UI deduplicates notifications per issue and treats an issue as read when its NEWEST non-archived item is read. Opening an issue marks only that newest item read (markInboxRead is per-item; only archive cascades to siblings), so older siblings stay unread in the DB. Result: a workspace whose inbox the user sees as empty still lit the dot (reported on bohan-personal showing a dot for Multica AI with no unread). Rewrite CountUnreadInboxByWorkspace to pick the newest non-archived item per (workspace, issue-or-id group) via DISTINCT ON and count only groups whose newest item is unread — the exact semantics of deduplicateInboxItems(...).filter(!read) on the client. No schema/handler change; query-only. Adds TestInboxUnreadSummaryDedupesByIssue covering the read-newest / unread-older case and its inverse. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
54145ad72e |
feat(sidebar): dot the workspace switcher when other workspaces have unread inbox (MUL-3695) (#4577)
Adds a cross-workspace unread summary so the workspace switcher shows the existing brand dot when a workspace OTHER than the active one has unread inbox items. The active workspace's own unread stays on the Inbox nav count to avoid a duplicate signal, and the dot is shared with the pending- invitation indicator. Backend: new GET /api/inbox/unread-summary returns per-workspace unread counts for the user, scoped via a member join so a left workspace can't light the dot. One account-level query instead of N per-workspace inbox fetches. Frontend: schema-guarded api.getInboxUnreadSummary, a single account-level TanStack Query, and a derived "other workspace has unread" boolean in AppSidebar (shared by web + desktop). Inbox WS events (new/read/archived/ batch) and reconnect invalidate the summary, so the dot appears and clears in realtime even for events from a non-active workspace. Closes multica-ai/multica#3773 Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
7d0c73d11f |
MUL-3417: tolerate OpenClaw config file CLI mismatch
Closes MUL-3417 Fixes #4299 |
||
|
|
8e7d28bff1 |
fix(issues): emit project_changed so moved issues leave the old project list (MUL-3669) (#4571)
The per-project issue list rides the filtered myAll cache. Changing an issue's project is a membership change, but the surgical patch (patchIssueInBuckets) is filter-blind and never removes a card that no longer matches the list's project filter — so a moved issue stayed visible in the old project's list until a manual refetch (#4548 / MUL-3669). Root cause: project_id was the only membership-affecting field with no server *_changed flag. The WS handler fell back to diffing project_id against its own cache, which breaks once onMutate has optimistically overwritten the cached value on a local move. - server: stamp project_changed on issue:updated (UpdateIssue + Batch), alongside status_changed / assignee_changed. - events.ts: surface project_changed (optional, additive — old clients ignore). - ws-updaters: prefer the server flag, fall back to the cache diff only when absent (older backend) so a new frontend on an old backend does not regress. - mutations: onSettled invalidates myAll when project_id changed — a local safety net that never depends on the WS echo (update + batch). Tests: WS flag wins over a matching cache (local-move repro), explicit false suppresses the legacy diff, the cache-diff fallback still fires, and both mutations invalidate myAll on a project change. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
35e5455953 |
fix: allow split-origin attachment previews (#4539)
Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
33967611b2 |
fix(cli): add daemon signal check to prevent silent PAT fallback
Closes #4204 Add a daemon signal guard in resolveToken so daemon-managed subprocesses fail closed instead of falling back to the user-global config token when agent/task env markers are missing. Also adds boundary tests for MULTICA_DAEMON_PORT, MULTICA_SERVER_URL, explicit MULTICA_TOKEN priority, and normal config fallback. |
||
|
|
93ed3dc131 |
fix(lark): use app URL for web links (MUL-3679)
Point Lark binding and issue-created links at the web app URL (MULTICA_APP_URL) instead of the backend public URL, so users on split-domain / self-host deployments get working links. appURLFromEnv falls back to FRONTEND_ORIGIN, matching the backend's existing app-URL resolution. |
||
|
|
aa4478af52 |
fix(codex): unhang cleanup after stdout scanner overflow (#4520) (#4563)
When codex emits a single stdout line larger than the daemon's 10 MB
bufio.Scanner cap (typical trigger: thread/resume on a long-history
session), the reader goroutine returns scanner.Err()="token too long",
markProcessExited fails the in-flight RPC, and the lifecycle goroutine
enters its failure path. That path calls drainAndWait() — stdin.Close()
+ cmd.Wait() — before sending the failed Result. But cmd.Wait() never
returns: codex is alive and blocked writing the rest of the oversized
line into a stdout pipe nobody is reading, so it never reaches its
stdin read syscall and never sees the EOF. The lifecycle goroutine
therefore never sends Result{failed} to its caller, the outer daemon
blocks on the result channel, and the existing PriorSessionID-with-
empty-SessionID fallback never fires — the task is permanently
stalled and codex (Node wrapper + native Rust app-server) leaks until
the OS reaps them.
The cancel() that would have unblocked things via cmd.WaitDelay's
SIGKILL was registered as a defer AFTER drainAndWait, so LIFO defer
order put cancel last — drainAndWait blocks first, cancel never runs.
Fix:
1. drainAndWait now runs the existing graceful-then-cancel pattern
itself, in two bounded phases. Phase 1 waits for readerDone (capped
by codexGracefulShutdownTimeout, so we still give codex its OTEL
flush window on clean exits); on timeout it cancels the runCtx so
cmd.Cancel kills the tree and the reader unblocks. Phase 2 bounds
cmd.Wait() the same way for the scanner-overflow case, where
readerDone closed early but the process is still alive on a full
stdout pipe. The success-path cleanup that previously duplicated the
graceful-cancel pattern around readerDone collapses to a single
drainAndWait() call.
2. cmd.Cancel is set to send SIGKILL to the whole codex process group
(Setpgid via configureProcessGroup, signalProcessGroup on cancel)
instead of just the leader. This addresses YOMXXX's
orphaned-Codex-child concern: the Node wrapper and the native
app-server it spawns now both die when cleanup forces the kill,
rather than the native binary leaking as an orphan reparented to
init. configureProcessGroup is a no-op on Windows.
3. codexGracefulShutdownTimeoutNanos atomic.Int64 mirrors
opencodeTerminateGraceNanos so the regression test can shrink the
grace window from 10 s to 500 ms. Production code is unchanged
(default 10 s).
Outer daemon (daemon.go) already retries with a fresh session when
result.Status == "failed" && PriorSessionID != "" && result.SessionID
== ""; the failed Result now actually reaches it, so the recovery
fires on its own without any daemon-side change.
Tests:
- New regression TestCodexExecuteCleansUpWhenScannerOverflowsOnResume
spawns a fake codex that emits an 11 MB single-line thread/resume
response (trips the scanner cap) and then sleeps without re-reading
stdin. With the original drainAndWait body it blocks at the 10 s
executeFakeCodex deadline ("timeout waiting for result") — verified
by temporarily reverting just the helper body — and with the fix it
completes in ~1.3 s with Result.Status="failed",
Result.SessionID="" so the outer fallback can fire.
- Full codex test suite, full agent package, daemon + execenv +
repocache packages, go build ./..., and go vet on agent/daemon all
pass.
Out of scope (deferred to follow-up per YOMXXX): bumping the 10 MB
bufio.Scanner cap on codex / claude / copilot / cursor / hermes /
kimi / kiro / codebuddy / antigravity / qoder / openclaw / opencode
(pi already sits at 32 MB), and the shared bounded JSON-RPC line
reader that would eliminate the single-line-overflow risk class
entirely. Buffer size alone is not the fix — recovery behaviour is.
Refs: GH#4520
Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
|
||
|
|
cb6616f530 |
feat(slack): Socket Mode channel.Channel adapter (MUL-3516) (#4523)
* feat(slack): Socket Mode channel.Channel adapter (MUL-3516) First slice of the Slack adapter: implements channel.Channel (Type/Connect/Disconnect/Send/Capabilities) over Slack Socket Mode, normalizes inbound events to channel.InboundMessage (DM, channel @mention, thread reply; bot-loop + edit/delete guards), decodes the per-installation config/secret blob, and registers the Factory under TypeSlack. No engine, core, or channel_* schema change. Unit-tested (translation, capabilities, config decode, chunking, Send via httptest). Resolvers + engine wiring + Block Kit binding replier follow. Co-authored-by: multica-agent <github@multica.ai> * fix(slack): address adapter review (MUL-3516) - Propagate InboundHandler errors through dispatchEventsAPI/handleSocketEvent to Connect so an infra failure tears down the connection for Supervisor reconnect/backoff instead of being silently swallowed (ACK still happens first). - Capabilities: declare only CapText | CapThreadReply; drop CapRichCard/CapAttachment/CapMessageEdit until those Send paths are wired. - slackChatType: map mpim (multi-party DM) to group, not p2p, so the 'must address bot' filter applies; only 1:1 im is p2p. - Document the group-addressing decision: explicit @bot mention required in groups; mention-free thread continuation deferred to the session-aware layer. - Tests: handler-error propagation, slackChatType table, mpim-requires-mention, capabilities negative assertions. Co-authored-by: multica-agent <github@multica.ai> * refactor(channel): shared channel-agnostic ChatSession service (MUL-3516) Extract the session/append//issue machinery — currently locked inside the Feishu-pinned lark.chatSessionService — into a shared engine.ChatSession parameterized by channel_type + session titles, so every IM adapter reuses it instead of re-implementing it. Logic is verbatim (find-or-create session+binding with unique-violation race re-read; append+touch+reply-target+in-tx dedup Mark; /issue parse with bare-command previous-message fallback) but channel-neutral: command-parse source is supplied by the adapter (enrichment is platform-specific). Backed by a narrow SessionQueries interface so it is unit-tested with an in-memory fake (no DB). /issue parser moved to engine.ParseIssueCommand. Next: migrate Feishu onto it and wire Slack's ResolverSet, removing the lark duplicate. Co-authored-by: multica-agent <github@multica.ai> * fix(channel): decouple session binding key from outbound target (MUL-3516) Addresses Elon's round-2 review. engine.ChatSession.EnsureSession previously keyed the binding on a raw chat id (EnsureSessionInput.ChatID), so a resolver wiring Slack straight through would collapse every @bot thread in one channel into a single chat_session and overwrite last_thread_id. Make the API un-misusable: - EnsureSessionInput.ChatID -> BindingKey: the explicit session-isolation key (Feishu: chat id; Slack DM: channel id; Slack channel: channel id + thread root), documented so a raw threaded-platform chat id is never passed straight through. - Add EnsureSessionInput.BindingConfig (opaque) persisted on the binding's config column, so the real outbound channel/thread is preserved when BindingKey is composite — outbound routing stays separate from the isolation key. - channel.sql CreateChannelChatSessionBinding now writes config (additive, uses the existing NOT NULL column; lark caller passes '{}', no schema change, no Feishu regression). - Tests: TestEnsureSession_ThreadRootIsolation (two thread roots in one channel -> two sessions; same root reuses) and TestEnsureSession_StoresBindingConfig. No production wiring change yet (per review, the not-yet-wired shared service is an accepted preparatory state); this makes the API correct before Feishu/Slack are migrated onto it. Co-authored-by: multica-agent <github@multica.ai> * feat(slack): Slack ResolverSet with thread-root session isolation (MUL-3516) Wires Slack into the channel-agnostic engine.Router via a ResolverSet built on the generic channel_* queries (installation route by team_id, identity + workspace-membership recheck, two-phase dedup, audit) plus the shared engine.ChatSession. No new query, no schema change. slackSessionRouting is the per-message isolation rule (Elon round-2 / Niko round-3): a DM is one session per channel; a channel/group message is isolated by thread root (key = channel:threadRoot, root = inbound thread_ts or the message ts for a top-level @mention), so two @bot threads in one channel are two sessions. The real channel id rides in BindingConfig for outbound; the reply thread is returned separately. Tests cover DM/channel/thread routing, config, and that distinct thread roots isolate while a same-thread follow-up reuses its key. Not yet wired into router.go (still a preparatory commit, per review); Feishu migration onto the shared service, router/config wiring, and the Slack outbound path follow. Co-authored-by: multica-agent <github@multica.ai> * feat(slack): Markdown->mrkdwn outbound formatting (MUL-3516) Slack renders mrkdwn, not Markdown, so an unconverted agent reply shows literal ** , ## and [text](url). Add formatMrkdwn — a faithful Go port of Hermes Agent's slack format_message (MIT) — and apply it in slackChannel.Send before chunking/posting. Protects fenced+inline code, converted links, and existing Slack entities behind placeholders; converts headers/bold/italic/strike/links; escapes control chars. Unit tests cover each construct plus fenced-code protection and a link nested in bold. Co-authored-by: multica-agent <github@multica.ai> * docs(slack): preserve Hermes MIT notice for ported mrkdwn converter (MUL-3516) Addresses Niko's review. formatMrkdwn is a substantial port of Hermes Agent's slack format_message; MIT requires preserving the copyright + permission notice. Add the full Hermes MIT copyright/permission notice + source URL as a header on mrkdwn.go (no repo-level third-party notice file exists, and the header cannot get separated from the ported code). Also add the suggested Send-layer regression test (TestSend_AppliesMrkdwn) that pins the wiring: slackChannel.Send converts Markdown to mrkdwn before posting. Co-authored-by: multica-agent <github@multica.ai> * refactor(lark): migrate Feishu onto shared engine.ChatSession, drop duplicate (MUL-3516) Completes 'every IM reuses one shared session service' and removes the dual-path the reviewers flagged as temporary. Feishu's ResolverSet now drives the channel-agnostic engine.ChatSession (channel_type=feishu, Lark session titles preserved) instead of the Feishu-specific lark.chatSessionService, which is deleted. Behavior is unchanged: engine.ChatSession is the verbatim port of the old logic and is unit-tested; the new Feishu binder param-mapping (BindingKey=chat id, CommandText=un-enriched CommandBody from Raw) is covered by feishu_resolvers_test.go. - Delete chat_service.go (chatSessionService + helpers) and issue_command.go/_test.go (parser now engine.ParseIssueCommand). Relocate the shared TxStarter interface to tx.go (still used by binding-token + registration services). - chat.go keeps only the AuditLogger seam; remove the now-dead ChatSessionService / EnsureChatSessionParams / AppendUserMessageParams / AppendResult / IssueCommand types. - router.go constructs engine.NewChatSession for Feishu; inbound_enricher_test + doc.go updated. make-test parity: go build ./..., go vet, gofmt, and go test ./internal/integrations/{lark,channel/...,slack} all pass (full Feishu suite green). Co-authored-by: multica-agent <github@multica.ai> * feat(slack): wire Slack adapter + ResolverSet + outbound into router (MUL-3516) Activates the full Slack pipeline, gated by MULTICA_SLACK_SECRET_KEY (the bot/app-token decryption key). When unset the block is skipped, so existing deployments are unaffected and Feishu is untouched. - router.go registers slack.RegisterSlack (Socket Mode connect/send Factory) + channelRouter.Register(TypeSlack, NewSlackResolverSet) (inbound pipeline) + slack.NewOutbound(...).Register(bus) (outbound). - New slack/outbound.go: an EventChatDone subscriber mirroring the Feishu Patcher. It finds the Slack chat binding for the finished session, recovers the real channel from the binding config (the channel_chat_id may be a composite thread-isolation key) + the reply thread from last_thread_id, and posts via slackChannel.Send (reusing formatMrkdwn / chunking / threading). Sessions with no Slack binding are ignored, so it coexists with the Feishu Patcher on the shared bus. - Tests: posts to the bound channel/thread with the real channel id; ignores non-Slack sessions, empty completions, revoked installations, and non-chat events. Slack now shares engine.ChatSession, channel_* tables, IssueService and TaskService with Feishu. Remaining: config-driven installation provisioning (an operator currently creates the channel_type='slack' row; the config block shape — which workspace/agent — is a product decision) and a live end-to-end smoke. go build ./..., go vet, gofmt, and go test ./internal/integrations/{slack,channel/...,lark} all pass. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
f4dba5d6b0 |
fix(cli): setup self-host respects existing config and shows URL changes (#4537)
Honor an already-configured self-host server_url/app_url when re-running `multica setup self-host` without flags, instead of silently resetting to http://localhost:8080. The existing config is added as a fallback step in the URL resolution chain (flag → env → existing config → localhost), and the overwrite confirmation now shows URL changes as `old -> new`. Closes #4536 MUL-3660 |
||
|
|
b71d9d0ab9 |
MUL-3674: Preserve Kiro goal completion on close error (#4560)
* fix(daemon): preserve Kiro goal completion on close error Co-authored-by: multica-agent <github@multica.ai> * fix(daemon): require completed Kiro goal marker Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
d9bf4b85c9 | test(cli): cover additional subcommands (#4555) | ||
|
|
0d3b49f2c7 |
fix(webhook): use unique ZSET member in Redis rate limiter (#4546)
The sliding-window Lua script used the nanosecond timestamp as both the ZSET score and member. Two requests landing in the same nanosecond collided on an identical member, so ZADD updated in place instead of inserting and the window under-counted — letting requests through past the limit. This surfaced as a flaky CI failure in TestRedisWebhookIPRateLimiter_HasSeparateBudgetFromTokenLimiter. Keep the timestamp as the score (so ZREMRANGEBYSCORE trimming is unchanged) and use a per-request UUID as the member so each admitted request is counted exactly once. Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
a03055b07d |
fix(agent): terminate opencode process group before closing stdout (#4533) (#4541)
On cancellation/timeout the opencode backend closed the stdout read end immediately, leaving the child writing into a closed pipe. Every write then returns EPIPE and, per anomalyco/opencode#33653, can spin an orphaned process at 100% CPU — surfacing as high idle CPU after a cancelled task or daemon restart (MUL-3655). Cleanup now runs opencode in its own process group and, on cancel, drives a graceful group-wide SIGTERM → grace → SIGKILL, closing the stdout pipe only as a last-resort unblock once the tree has been signalled (SIGKILL is uncatchable, so no member can write again — no EPIPE window). The group signal also reaps tool subprocesses opencode spawned instead of orphaning them. WaitDelay remains the hard backstop. Adds unix tests covering the graceful path and the SIGTERM-ignored → SIGKILL escalation, asserting the whole process group is reaped and the run never deadlocks on the scanner. Windows behaviour is unchanged (no process groups). Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
bea028784a | fix(labels): reject control characters in label names (#4531) | ||
|
|
dfa384ffa2 |
fix(daemon): resolve skill bundles per-skill with size-scaled timeout (#4505) (#4530)
* fix(daemon): resolve skill bundles per-skill with size-scaled timeout (MUL-3650, #4505) Cold-start skill resolution downloaded the agent's entire bundle in one atomic request bounded by the shared 30s control-plane http.Client timeout. On a slow/jittery link a large bundle (15+ skills) could not finish the body read in 30s, and because the cache was only written after the whole batch succeeded, nothing was persisted on failure — so every dispatch re-downloaded the full bundle and timed out again, never converging. Resolve each missing bundle in its own request and cache it the moment it arrives: - daemon: per-skill resolve with a deadline scaled to the bundle's declared size (floor 30s, cap 5m, ~50KB/s floor throughput) instead of the fixed control-plane timeout; each success is persisted independently, so a dispatch that fails on one skill still caches the rest and the next dispatch only re-fetches what is missing. - client: dedicated bundleClient with no fixed Timeout (deadline comes from ctx), a singular ResolveSkillBundle, and a short transient-retry schedule. Tests cover the size-scaled timeout and the cross-dispatch incremental caching / convergence (a failed skill does not discard its siblings, and cached skills are not re-fetched). Co-authored-by: multica-agent <github@multica.ai> * fix(daemon): accept server-side skill updates in per-skill resolve (MUL-3650) Address review on #4530: resolveSkillBundle validated the returned bundle against the claim-time ref, which pinned it to the requested hash. The resolve endpoint intentionally serves the agent's current bundle and hash when the requested hash is stale (the skill can be edited between claim and prepare), so a legitimate updated bundle was rejected as invalid and the task failed. Confirm only that the server returned the requested skill (source/id), then validate self-consistency against a ref derived from the returned bundle and cache it under its own hash — matching the documented endpoint contract. Adds a regression test covering a stale-hash request answered with an updated bundle. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
5e824a94ae |
chore(channel): remove the one-time MULTICA_LARK_HUB_DISABLED cutover switch (#4527)
The lark_*->channel_* cutover (MUL-3515) is deployed to prod, and the MULTICA_LARK_HUB_DISABLED park-switch was a one-time scaffold for that rollout — the end state intentionally does not use it (prod never set the env). Remove the env-gated branch from cmd/server/main.go so the channel supervisor always starts when built; its existing nil-guard and shutdown join are unchanged. Trim migration 124's now-obsolete switch runbook to a short historical note (comment-only; 124 is already applied, so this does not re-run). Refs MUL-3515 Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
34bd115808 |
test(execenv): fix stale test name reference in comment (#3028)
Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
3adfaf4285 |
fix(execenv): support OpenClaw 2026.6.x agents schema (#3028) (#4319)
Adapts OpenClaw execenv prep to the 2026.6.x agents schema (agents.list config path removed; agents live in a sqlite registry). Case-insensitive key-missing guard + registry fallback on read, version-aware emission on write so per-task workspace pinning keeps working. Closes #3028 MUL-3643 |
||
|
|
c3d8529ba5 |
fix(channel): eliminate WaitGroup Add/Wait data race in engine.Supervisor (MUL-3620) (#4517)
The race detector caught a flaky failure on main (passes on retry): Supervisor.startSupervisor does s.wg.Add(1) under s.mu, while Supervisor.Wait calls s.wg.Wait() with no lock. Calling WaitGroup.Add concurrently with WaitGroup.Wait is a data race and undefined per the WaitGroup contract — so it only trips occasionally (it passed locally and in PR CI). Wait now blocks on stopChan (closed by Run's defer when Run returns) before calling wg.Wait(). Run is the sole caller of startSupervisor, so once Run has returned no further Add can happen and wg.Wait is race-free. WaitWithTimeout inherits the fix (it calls Wait), and its timer still bounds shutdown. This latent race existed in the original lark.Hub.Wait too; fixed properly in the generalized Supervisor. Verified: go test -race -count=300 on the flagged test and -count=8 on the whole engine package, all clean; no deadlock from the stopChan gate (every caller pairs Wait with a started Run + cancelled ctx). Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
9db80a0940 |
fix(daemon): forbid mid-run progress comments in runtime brief (#4516)
A run could post running progress/plan narration as issue comments, and a review run surfaced its in-progress narration as the result instead of a conclusion (MUL-3605). Add one rule to the Output section's issue-task branch, in both the legacy and slim briefs: post exactly one comment per run — the final result, before the turn exits — and keep plans/progress in the agent's own reasoning. The pre-existing "Final results MUST be delivered … a task that finishes without a result comment is invisible" line already makes the comment mandatory, and "state the outcome, not the process" already rules out progress dumps, so no second rule is added. Chat / quick-create / autopilot keep their own delivery channels. Adds a regression test across both brief paths. Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
3e21e58df0 |
feat(channel): channel-agnostic engine (Supervisor + Router); Feishu as channel.Channel (MUL-3620) (#4512)
* feat(channel): add channel-agnostic engine Supervisor (MUL-3620) Stage-1 (MUL-3515) shipped the channel abstraction but nothing drove it. Add the generic engine that does: - channel.InboundHandler + Config.Handler: the single shared inbound entry the engine injects into every adapter (Hermes set_message_handler model). - channel.Channel.Connect now blocks for the connection lifetime (doc), so the supervisor can tie lease renewal to connection liveness. - new package channel/engine: Supervisor, generalized out of lark.Hub. It enumerates active installations across ALL channel types (no hard-coded feishu), fences each behind the WS lease CAS, builds the platform Channel via channel.Registry, drives Connect/Disconnect with backoff+jitter, and restarts on credential rotation. Knows nothing about any platform. channel.Channel is now driven by an engine; integrations/channel has an external consumer. Feishu adapter + boot cutover follow next. Tests: supervisor_test.go covers lease CAS, reclaim, reap-on-revoke, rotation restart + token fencing, backoff on build error, lease-loss teardown, bounded release, shutdown timeout. Race-clean. Co-authored-by: multica-agent <github@multica.ai> * feat(lark): drive Feishu through the channel engine; remove lark.Hub (MUL-3620) Refactor Feishu into the first channel.Channel and cut boot over to the channel-agnostic engine.Supervisor, removing the Feishu-only Hub. - feishuChannel implements channel.Channel: Connect runs the existing WS long-conn connector for one installation; Send posts a text reply via the Lark IM API; Capabilities declares Feishu's feature set. RegisterFeishu wires it to channel.TypeFeishu — adding a platform is now 'register a Factory', no engine edit. - FeishuRuntime extracts the former Hub.handleEvent / scheduleReply: runs the Dispatcher and drives the detached typing indicator + OutcomeReplier off the connector ACK path. main.go drains it on shutdown after the supervisor stops delivering events. - channelInstallationStore (engine.InstallationStore) enumerates active installations across ALL channel types via the new de-hardcoded query ListAllActiveChannelInstallations; the Supervisor routes each row to its registered Factory by channel_type. Generic per-row fingerprint replaces the feishu-specific one. - boot: engine.Supervisor replaces lark.Hub.Run; MULTICA_LARK_HUB_DISABLED keeps its name for runbook compatibility. - delete hub.go / hub_pgx.go / hub_test.go; relocate the connector contract (EventConnector/EventEmitter), uuidString, and the reply-path tests (-> feishu_runtime_test.go) so coverage is preserved. No channel_* schema change. Feishu behaviour unchanged; lark + channel + engine tests green under -race; go build/vet ./... clean. Remaining (follow-up): lift the Dispatcher pipeline into a channel- agnostic engine.Router over channel.InboundMessage + resolver interfaces, so the inbound core stops being Lark-shaped and adding a channel needs zero core edits (validated by Slack, MUL-3516). Co-authored-by: multica-agent <github@multica.ai> * feat(channel): add channel-agnostic engine.Router (inbound pipeline) (MUL-3620) Generalize lark.Dispatcher's inbound pipeline into engine.Router: the single shared channel.InboundHandler the Supervisor injects into every Channel. It routes by ChannelType to a registered ResolverSet and runs the same ordered pipeline for every platform (install route -> two-phase dedup -> group @bot filter -> identity+membership -> ensure session -> append+mark -> /issue -> debounced run), then drives the detached OutboundReplier + typing indicator. Platform specifics live behind resolver interfaces (InstallationResolver, IdentityResolver, Deduper, SessionBinder, Auditor, OutboundReplier, TypingNotifier) + shared services (IssueCreator/TaskEnqueuer/SessionReader). Adding a platform is 'register a ResolverSet', not 'edit the Router'. Outcome / DropReason values match the legacy lark ones 1:1. Additive: lark.Dispatcher untouched and still wired; the feishu ResolverSet, the cutover, and the old-path removal land next. channel.InboundMessage gains ForceFresh (the normalized /fresh affordance). Batcher moved into engine. router_test.go covers the pipeline invariants (routing, dedup finalize states, group filter, identity, membership, ensure/append, /issue, debounce, flush offline, force-fresh, drain) with generic fakes; race-clean. Co-authored-by: multica-agent <github@multica.ai> * feat(lark): cut Feishu over to engine.Router; remove lark.Dispatcher; core no longer Lark-shaped (MUL-3620) Wire the channel-agnostic engine.Router (added in the prior commit) as the shared inbound handler and refactor Feishu into a ResolverSet, completing the generic-engine cutover. The inbound core (engine.Router) now contains zero platform specifics. - Feishu ResolverSet (feishu_resolvers.go): InstallationResolver, IdentityResolver, Deduper, SessionBinder, Auditor, OutboundReplier, TypingNotifier — each backed by the existing ChannelStore / ChatSessionService / OutcomeReplier / typing indicator, translating at the channel.InboundMessage boundary (platform fields read from Raw). origin_type stays 'lark_chat'. - feishuChannel now produces a normalized channel.InboundMessage and hands it to the engine handler via channel.Config.Handler; the old Raw round-trip through lark.Dispatcher is gone. - Remove lark.Dispatcher, FeishuRuntime, and lark's pending_batcher (the engine owns the pipeline + batcher now); their behavioural coverage moved to engine.Router tests. Surviving native types (InboundMessage / Outcome / DispatchResult) relocated to feishu_types.go. elon review nits addressed: - The channel engine (Registry + Router + Supervisor) is now built UNCONDITIONALLY, outside the MULTICA_LARK_SECRET_KEY gate, so a non-Lark deployment runs it; Feishu registers its Factory + ResolverSet only when its key is present. - channel.Config.Raw is now genuinely the platform config JSONB (channel_installation.config): the feishu factory builds a credentials-only Installation from it, and the workspace/agent identity is resolved per message by the Router — no full-db-row marshaling. - feishuChannel gains direct unit tests: factory config decode, Send text + reply-target mapping, Capabilities, inbound normalization + Raw round-trip, msg-type + result mapping. No channel_* schema change. go build/vet ./... clean; channel + engine + lark green under -race. Feishu behaviour preserved (pipeline logic lifted verbatim, only generalized). Co-authored-by: multica-agent <github@multica.ai> * docs(channel): fix stale comments on the channel engine boot (MUL-3620) Address Elon's review nit: three comments still described the pre-cutover behavior. - handler.go: ChannelSupervisor is built UNCONDITIONALLY now, not nil when MULTICA_LARK_SECRET_KEY is unset. - main.go: same — the supervisor always exists; only MULTICA_LARK_HUB_DISABLED parks it. - router.go: with no platform registered the store still lists active rows; Registry.Build returns ErrUnknownType and the supervisor backs off (it does not 'find no installations'). Comment-only; no behavior change. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
4d7111d396 |
MUL-3605: Fix serialization of agent task claims by capacity
* fix(server): serialize agent task claims by capacity * test(server): clean up claim race fixture --------- Co-authored-by: Yanziqing25 <1519319045@qq.com> |
||
|
|
20eecfb093 | fix(projects): honor repo resource checkout refs (MUL-3593) (#4470) | ||
|
|
1ac3a03e5d |
MUL-3618: dispatch daemon feature flag snapshots (#4509)
* MUL-3618: dispatch daemon feature flag snapshots Co-authored-by: multica-agent <github@multica.ai> * MUL-3618: narrow daemon flag snapshots to process scope Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
79c9158097 |
fix(issue): order sub-issues by number ASC instead of position (#4511)
ListChildIssues and ListChildrenByParents ordered by `position ASC, created_at DESC`. position is assigned by NextTopPosition as MIN(position)-1 scoped to (workspace, status), not relative to siblings, so a parent's children interleave unpredictably across creation batches and statuses. Order by `number` (a per-workspace monotonic counter) instead. ASC keeps sub-issues in stable creation order (oldest first), so a parent's plan reads top-to-bottom in the order tasks were added. Adds ordering tests covering both queries with scrambled positions and mixed statuses. Closes #4232 MUL-3362 Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
cb7cc82ecb |
fix: allow same-origin attachment previews (#4504)
Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
76c58a4ee8 |
MUL-3617: remove Gemini CLI runtime (#4503)
* fix: remove gemini cli runtime Co-authored-by: multica-agent <github@multica.ai> * fix: skip unsupported custom runtime profiles Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
af34b8f83a |
feat(lark): add proxy support for WebSocket connections (#4166)
Add ProxyURL field to GorillaDialer so deployments behind a corporate proxy can route Lark WebSocket connections through an HTTP CONNECT proxy. - GorillaDialer.ProxyURL: optional proxy URL parsed and applied to the underlying gorilla/websocket dialer before each DialContext call. Empty value preserves the default ProxyFromEnvironment behaviour. - Router reads MULTICA_LARK_WS_PROXY_URL env var and sets it on the production dialer. - Three new unit tests cover invalid URL, proxy-applied, and empty-URL default paths. Closes #4032 Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
8ad673fdb7 |
MUL-3560: gate slim runtime brief behind runtime_brief_slim feature flag (#4449)
The MUL-3560 slim runtime brief — kind-driven dispatcher, per-section
gating, prose compression for ~7k chars saved on the typical
comment-triggered task — now ships behind the `runtime_brief_slim`
feature flag wired via the framework-level service from MUL-3615.
Default: OFF in every environment (production stays on the legacy
brief that has shipped for ~2 years). Staging opts in via the YAML
rule set; ops can override per-process with `FF_RUNTIME_BRIEF_SLIM=true`.
Production is held back until staging has burned in long enough that
we are confident the slim brief does not regress agent behaviour.
Architecture (one toggle point, two code paths, both fully tested):
buildMetaSkillContent (runtime_config.go)
│
└─ useSlimBrief() → false (default)
│ → fall through to the legacy verbose body that ships on
│ main today — byte-for-byte unchanged, no migration risk
│
└─ useSlimBrief() → true
→ buildMetaSkillContentSlim (runtime_config_sections.go)
→ classifyTask → 5-way kind switch → per-section writers
BuildCommentReplyInstructions takes the same gate, so the per-turn
comment prompt and the runtime brief stay in sync on which template
they emit.
What's in this PR:
- runtime_config_flag.go (new): package-scope `runtimeFlags` atomic
pointer + `SetFeatureFlags` setter + `useSlimBrief` toggle point.
Nil-safe: a daemon that forgets to wire the service falls back to
legacy, no panic.
- runtime_config_kind.go (new): `taskKind` enum + `classifyTask` +
`hasIssueContext` predicate. Used only by the slim path.
- runtime_config_sections.go (new): the slim brief itself —
`buildMetaSkillContentSlim` + per-section `writeXxx` helpers
+ `writeAvailableCommandsQuickCreate` minimal variant +
`writeBackgroundTaskSafetySlim` compressed safety section. The
Section × Kind matrix is documented inline on
`buildMetaSkillContentSlim` and the test below checks the
dispatcher does not diverge from the spec.
- reply_instructions.go: `BuildCommentReplyInstructions` gains a
short slim-or-legacy prelude; new `buildCommentReplyInstructionsSlim`
is the compressed cookbook (defers the shell-hazard rationale to
`## Comment Formatting`).
- runtime_config.go: `buildMetaSkillContent` gains a 2-line
dispatcher at the top; the legacy body is otherwise untouched.
- runtime_config_kind_test.go (new): canaries for both paths.
- TestClassifyTask: 5 kinds + 3 tiebreak cases.
- TestTaskKindHasIssueContext: predicate semantics.
- TestSlimFlagOffUsesLegacy: nil flag service → legacy path
(renders "Get full issue details.", a legacy-only substring).
- TestSlimFlagOnUsesSlim: flag on → slim path (renders "full
issue.", a slim-only one-liner) AND must NOT render legacy
"Get full issue details.".
- TestBuildMetaSkillContentSlimKindMatrix: locks the per-kind
section set; heading match is line-anchored so inline references
don't trip absence assertions.
- TestSlimQuickCreateAvailableCommands: locks the minimal-variant
content for quick-create (issue create present, every other
Core command absent).
- TestSlimBriefIsSubstantiallyShorter: ≥ 30% reduction guard so
a future change can't accidentally re-bloat the slim path back
to legacy levels.
- cmd/server/main.go: now calls `execenv.SetFeatureFlags(flags)`
immediately after constructing the feature flag service.
Measured impact (slim vs legacy, claude provider, realistic fixture
with 2 repos + 2 skills + member initiator):
legacy = 19567 chars
slim = 11868 chars Δ = -7699 (-39.3%)
Verification:
- go vet ./internal/daemon/... ./cmd/server/... ok
- go test ./internal/daemon/... ok
- go test ./pkg/featureflag/... ok
- TestSlimBriefIsSubstantiallyShorter logs the 39.3% ratio
- TestSlimFlagOffUsesLegacy + TestSlimFlagOnUsesSlim pass both
directions, so the dispatcher is locked in code.
The pre-existing `internal/handler` test failures
(TestLeaveWorkspace_RevokesOwnRuntimes,
TestDeleteMember_CancelsTasksFromAgentReassignment,
TestDeleteMember_NoRuntimes_DeletesMember) reproduce on plain
`origin/main` with the same `relation "channel_user_binding" does
not exist` SQL error — they are a missing-migration bug from the
recent channels foundation PR (
|
||
|
|
b92e4a53fb |
DH-106 为飞书接入补上 /new 会话指令 (MUL-3503) (#4396)
Lark/飞书入站消息新增 /new 首行指令,解析为 force_fresh_session,复用既有 daemon 会话续接门控。 Co-authored-by: Wilson-G <Wilson-G@users.noreply.github.com> |
||
|
|
4a8210912a |
feat(featureflag): framework-level feature flag system (MUL-3615) (#4496)
* feat(featureflag): framework-level feature flag system (MUL-3615) Introduces a reusable feature flag framework so future features can adopt flags without writing infrastructure code. Backend: server/pkg/featureflag (Go) - Service / Provider / Decision separation per Martin Fowler's Toggle Point / Toggle Router / Toggle Configuration pattern. - Providers: StaticProvider (rules in source control), EnvProvider (FF_<KEY> overrides for ops kill switches), ChainProvider (first-hit-wins composition). - EvalContext carried through context.Context with WithEvalContext / EvalContextFrom; supports user_id, workspace_id, free-form attributes. - PercentRollout via deterministic FNV-1a bucketing; same user always lands in the same bucket so experiments do not flap between requests. - Nil-safe Service: a nil *Service or missing flag returns the caller's default so business code never panics on a missing flag. - 100% unit-test coverage with -race; go vet clean. Frontend: packages/core/feature-flags (TypeScript) - Same vocabulary as the Go side (Decision, EvalContext, Rule, PercentRollout). FNV-1a parity ensures cross-tier bucket agreement. - FeatureFlagService + StaticProvider + ChainProvider in pure TS. - React glue: FeatureFlagsProvider, useFlag(key, default), useVariant(key, default). Hooks fall back to the default when no provider is mounted so Storybook / unit tests stay simple. - Vitest tests for service, providers, hash, and React hooks. Docs: docs/feature-flags.md — wiring, EvalContext, toggle points, backend-protection note, and the standard best-practice checklist. The framework intentionally has no third-party Go deps and no API surface beyond what real callers will need. New providers (DB, remote config, LaunchDarkly) plug in by implementing Provider; no existing caller has to change. Co-authored-by: multica-agent <github@multica.ai> * fix(featureflag): cross-tier hash parity + variant only when enabled (MUL-3615) Two must-fix issues from the PR review on #4496: 1. TS hash had a trailing zero separator that Go did not emit, so the same (key, identifier) bucketed differently on the two tiers. The "user lands in the same bucket on server and client" promise was broken. For example billing_new_invoice/user-42 was bucket 97 in Go and bucket 11 in TS. Fix: TS fnv1a now emits the zero separator BETWEEN parts only, never after the last one, matching Go's hash.Write byte stream exactly. Verified by parallel golden tests on both sides that pin five (key, identifier) -> bucket triples; if either side drifts both tests fail and one must be brought back in sync. 2. StaticProvider returned `Rule.Variant` regardless of whether the rule evaluated to enabled=true. A 0%-rollout user, a deny-listed user, or a default-off user would see variant="experiment-v2", so callers branching on Variant() would route control users into the experiment arm. Fix: Rule.Variant is now the ON-variant only. When the rule evaluates to enabled=false the Decision's variant is the canonical "off", regardless of what Rule.Variant says. Documented as a behavior contract in the Rule godoc / JSDoc and covered by regression tests on both sides. Tests: - go test -race ./pkg/featureflag/... : all green (1.58s). - pnpm --filter @multica/core test : 661/661 (3 new). - pnpm --filter @multica/core typecheck: clean. Co-authored-by: multica-agent <github@multica.ai> * fix(featureflag): hash UTF-8 bytes on the TS side for cross-tier parity (MUL-3615) Follow-up review on PR #4496 caught that the previous hash fix was only correct for ASCII input. The TS side used `charCodeAt`, which returns UTF-16 code units, while the Go side hashes the UTF-8 byte representation. Any non-ASCII flag key or identifier — Chinese flag names, accented user IDs, emoji — would bucket differently on backend vs frontend, silently breaking the "same user, same bucket" promise the PR description makes. Concretely: flag/é Go 53 vs TS-old 68 flag/🦄 Go 82 vs TS-old 75 实验/user-1 Go 90 vs TS-old 4 flag/用户-1 Go 95 vs TS-old 2 Fix: replace per-char charCodeAt with a module-level `TextEncoder` ('utf-8') and hash each encoded byte. After the fix all four cases above match Go exactly, and the existing ASCII cases continue to match. The cross-language golden tables on both sides now include the 5 new non-ASCII cases alongside the 5 ASCII cases, so any future regression that swaps UTF-8 for charCodeAt (or vice versa) will fail loudly on both Go and TS simultaneously. TextEncoder is part of WHATWG Encoding and is available in every evergreen browser, in Node 11+, and in Hermes (React Native) >= 0.74, which covers every runtime that imports @multica/core/feature-flags. Tests: - go test -race ./pkg/featureflag/... : all green. - pnpm --filter @multica/core test : 661/661. - pnpm --filter @multica/core typecheck : clean. Co-authored-by: multica-agent <github@multica.ai> * feat(featureflag): wire into main app config — YAML file + env override (MUL-3615) Follow-up requested by Yushen on PR #4496: make the feature flag framework configurable through the existing main-program config system instead of requiring Go code edits. multica's main app is purely env-var driven (see .env.example) with optional MULTICA_*_FILE knobs for richer config; feature flags now follow the same pattern. server/pkg/featureflag/config.go - LoadRulesFromYAMLFile(path) parses a YAML rule set into runtime Rule structs. Empty files are a valid "no flags yet" state; missing or malformed files surface a hard error so operators see misconfig the same way DATABASE_URL parse errors do. - NewServiceFromEnv composes the standard provider chain: 1. EnvProvider("FF_") (runtime kill-switch path) 2. StaticProvider from YAML file (declarative rule set) When MULTICA_FEATURE_FLAGS_FILE is unset, only the env layer is active and every IsEnabled call falls through to the caller's default, so the server can boot before any flag is authored. server/cmd/server/main.go - Construct the Service once at startup right after env-var warnings, fail loudly on malformed YAML, log the loaded rule count via the Service logger. The Service is held in a local `flags` variable ready to be threaded into handler.Handler / service constructors when the first flag user lands. Threading is deferred to the PR that adds the first business consumer so this PR stays a pure framework + config layer. .env.example - New "Feature flags" section documents MULTICA_FEATURE_FLAGS_FILE and the FF_<KEY> override convention, with a minimal YAML schema example inline. docs/feature-flags.md - Replace the "build a provider manually" example with the NewServiceFromEnv pattern that now matches what main.go actually does. Show the YAML schema in one place. Note the on-variant / off semantics from the previous review round. server/pkg/featureflag/doc.go - Update package doc to mention the gopkg.in/yaml.v3 dependency (already a server-level dep) instead of the now-inaccurate "no third-party dependencies" claim. Tests: - go test -race -count=1 ./pkg/featureflag/... all green; new config_test.go covers: simple YAML, full-shape YAML, empty file, missing file, malformed YAML, no env var, file-only, env-beats-file, bad file surfaces error. - go test -race -count=1 -run TestHealth ./cmd/server/... sanity check that the main.go boot path with the new wiring still passes. - go vet ./... clean. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
00b9668cd2 |
fix(autopilot): cold-start planner honors trigger.last_fired_at (MUL-3551) (#4495)
Post-deploy of the new scheduled-dispatch scheduler (PR #4444), an autopilot configured for "weekdays 17:10 Asia/Shanghai" fired at ~12:30 Beijing the day after deploy — ~4h 38m before the next scheduled time the UI showed. Traced to a cold-start regression in the planner hook: Old behaviour ------------- On the first tick after migration the hook found no `sys_cron_executions` row for the trigger (`latestPlan(...).Found == false`) and anchored on the trigger's `created_at`, then applied the 24h replay cap: after := cfg.CreatedAt if oldest := now.Add(-replayWindow); after.Before(oldest) { after = oldest // now - 24h } For a trigger created days/weeks earlier and last fired by the legacy goroutine at Mon 17:10 Beijing (= Mon 09:10 UTC), this set `after = Tue 04:13 UTC - 24h ≈ Mon 04:13 UTC`. The half-open enumeration `(Mon 04:13 UTC, Tue 04:13 UTC]` STILL contained Mon 09:10 UTC — the occurrence the legacy code had already handled — so the new scheduler dispatched it again the moment it took over. The result: a SCHEDULED-source autopilot_run with planned_at = Mon 17:10 Beijing but a wall-clock dispatch at Tue ~12:30 Beijing. Timezone math was correct; the bug was purely the cold-start anchor not respecting prior-fire history. Fix Co-authored-by: multica-agent <github@multica.ai> --- The `autopilot_trigger.last_fired_at` column is maintained by both the legacy goroutine and the new scheduler (via TouchAutopilotTriggerFiredAt), so it is the authoritative "most-recent successful fire" cursor across the migration boundary. The planner hook now anchors cold-start enumeration on it: case latest.Found: after = latest.PlanTime case lastFiredAt != zero: after = lastFiredAt default: after = cfg.CreatedAt For the regressed case, `after = Mon 17:10 Beijing`, the next enumeration window is `(Mon 17:10, Tue 12:30]`, and Tue 17:10 is in the future — the hook returns nothing and the trigger waits quietly for Tue 17:10 as the UI promised. For brand-new triggers (last_fired_at NULL), the original `created_at` path still applies. For long-dormant triggers the `replayWindow` cap remains. Changes ------- * `ListSchedulableAutopilotTriggers` SQL now returns `last_fired_at`. * `autopilotTriggerConfig.LastFiredAt` is populated by the scope provider on every tick. * `autopilotPlansForScope` cold-start branch uses the new anchor. Tests ----- * TestAutopilotScheduleJobColdStartHonorsLastFiredAt — seeds the exact dev-environment shape (created 3 days ago, last_fired_at 5 hours ago, no sys_cron_executions row), runs a tick, asserts zero exec rows AND zero autopilot_run rows. Without the fix this test produces one of each at a historical plan_time. * TestAutopilotScheduleJobColdStartBrandNewTriggerStillFires — asserts a brand-new trigger (last_fired_at NULL) still fires its first due occurrence on cold start. All existing `TestAutopilotScheduleJob*` tests still pass. Refs MUL-3551 Co-authored-by: Eve <eve@multica-ai.local> |
||
|
|
ce28d0aa0e |
feat(integrations): add platform-agnostic channel foundation (MUL-3515) (#4412)
* feat(integrations): add platform-agnostic channel foundation Introduce server/internal/integrations/channel — the contract every inbound IM integration implements, so the core never learns a platform's event JSON. Four pieces: - Channel interface (Type/Connect/Disconnect/Send/Capabilities) + Factory + Config (channel_type + opaque JSON blob, maps to channel_installation). - Normalized InboundMessage/OutboundMessage envelopes + Source/MediaRef/ ReplyCtx/MsgType/ChatType. Envelope holds only cross-platform-true fields; platform specifics live in Raw, read only by the adapter. - Capability bitmask: declaration only, no degrade logic in core. - Registry: Type->Factory map, last-writer-wins, concurrency-safe. Pure package (no DB/network/platform deps). Foundation for MUL-3515; the lark cutover + lark_*->channel_* generalization land in follow-up PRs. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * feat(channel): generalize lark_* tables into channel_* (DB layer) Migration 123 creates channel_installation / channel_user_binding / channel_chat_session_binding / channel_inbound_message_dedup / channel_inbound_audit / channel_outbound_card_message / channel_binding_token. Each carries a channel_type discriminator and a JSONB config for platform-specific identifiers/credentials; cross-platform columns stay flat. Existing Feishu rows are backfilled (channel_type= 'feishu', app_secret_encrypted via base64). NO foreign keys / cascades (MUL-3515 §4) — integrity moves to the app layer in the cutover. queries/channel.sql ports the lark query surface to channel_*, JSONB-aware, plus DeleteChannelUserBindingsByWorkspaceMember / DeleteChannelChatSessionBindingBySession for the app-layer cleanup that replaces the removed cascades. lark_* tables/queries are left in place here and removed once the Go cutover lands, so this commit ships green on its own. Verified: sqlc generate, go build ./..., full migrate chain (1..123) on Postgres 17, and a real-data backfill spot-check (base64 round-trip, NULL-strip, functional unique index on (channel_type, app_id)). MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * fix(channel): name app_id query param + multi-IM install key + null-safe binding merge Addresses review on MUL-3515 (PR #4412): - GetChannelInstallationByAppID: explicitly name params and cast app_id to ::text so sqlc emits AppID string. A bare $2 next to `config ->> 'app_id'` was mis-attributed to the JSONB config column, generating Config []byte. - channel_installation uniqueness -> (workspace_id, agent_id, channel_type), with the UpsertChannelInstallation conflict key matched. Lets one agent hold one installation per IM (feishu + slack + ...) instead of a later install clobbering an earlier one. Behaviorally identical in the current feishu-only world; "one agent, at most one IM overall" stays an app-layer rule per MUL-3515 §4, not a DB constraint. - CreateChannelUserBinding merges jsonb_strip_nulls(EXCLUDED.config) so a re-bind carrying {"union_id": null} no longer erases an already-captured union_id, restoring the old COALESCE(EXCLUDED.union_id, ...) semantics. Regenerated with sqlc v1.31.1. Verified on PG17: re-install replaces in place, feishu+slack coexist, null re-bind keeps union_id, real union_id wins. Co-authored-by: multica-agent <github@multica.ai> * feat(lark): channel-backed Feishu store + fix base64 backfill wrapping Cutover step 1 of switching the lark Go code from lark_* onto the channel_* tables (MUL-3515). Introduces the JSONB config boundary the rest of the cutover sits on, and fixes a latent backfill bug surfaced while building it. - migration 123: strip newlines from the app_secret_encrypted base64 backfill. PostgreSQL encode(...,'base64') MIME-wraps at 76 chars, and a secretbox- sealed ~72-byte secret exceeds that. Go's encoding/json decodes a JSON string into []byte with base64.StdEncoding, which rejects embedded newlines, so without the strip every migrated installation would fail to decrypt its app secret once reads move to channel_installation.config. - store.go: flat domain types (Installation / UserBinding / ChatSessionBinding) with field parity to the retired db.Lark* rows, plus the feishu config codec. Row->domain mappers decode the JSONB config; the secret decoder is whitespace-tolerant so legacy MIME-wrapped data still round-trips, while the encoder emits unwrapped base64. Binding config encodes an absent union_id as "{}" so the upsert's jsonb_strip_nulls merge never clobbers a stored union_id. - store_test.go: 72-byte secret round-trip, MIME-wrapped tolerance, optional null-strip, and flat-column preservation. Verified on PG17. Field parity keeps the upcoming ~190 db.LarkInstallation call sites a mechanical rename. No call sites switched yet; behavior unchanged. Co-authored-by: multica-agent <github@multica.ai> * feat(lark): route inbound integration onto channel_* + explicit membership checks Cutover step 2 (MUL-3515): switch the Feishu Go code from the lark_* queries to channel_* via a ChannelStore adapter, and replace the removed member foreign key with explicit application-layer membership checks. No user-visible behavior change. - channel_store.go: ChannelStore embeds *db.Queries and SHADOWS the ~24 lark query methods with channel_*-backed equivalents, keeping the db.Lark* signatures so the dispatcher/hub/services and their ~20k lines of tests stay untouched; the feishu JSONB config is (de)coded by store.go. Adds IsWorkspaceMember and a tx-aware WithTx. Only production wiring swaps *db.Queries for *ChannelStore. - Membership re-check (§4 removed the lark_user_binding -> member FK, so a binding row no longer proves current membership): * the dispatcher inbound identity step verifies membership after the binding lookup; a former member's stale binding is dropped as non_workspace_member + audited and never reaches chat_session (§4.3 safety property). * RedeemAndBind and BindInstallerTx replace the now-dead FK (23503) branch with an explicit IsWorkspaceMember gate, preserving the existing ErrBindingNotWorkspaceMember outcome without burning the token. - router wires the ChannelStore into the patcher, typing indicator, dispatcher, hub, and the union_id/region backfills; constructor-based services wrap *db.Queries internally so their signatures and nil-check tests are unchanged. Verified: go build ./... ; go vet ; gofmt ; go test -race ./internal/integrations/... (full lark suite green unchanged + new membership drop/error tests). Adapter field mappings (secret base64, union_id RMW, chat-id/open-id remaps, dedup, token, card) checked end-to-end against a PG17 channel_* schema. lark_* tables and queries remain (unused at runtime) until the S3 cleanup-hooks and S4 drop-tables/rename commits. Co-authored-by: multica-agent <github@multica.ai> * fix(channel): renumber generalization migration 123 -> 124 main merged 123_issue_stage after this branch forked, so the branch's 123_channel_generalization now collides on the migration number. The runner keys schema_migrations by full version string and would still apply both, but a duplicate number is a merge hazard and convention violation, so move the channel migration to the next free slot (124). issue_stage (ALTER issue ADD COLUMN stage) and the channel generalization touch disjoint tables; verified on PG17 that 123_issue_stage applies cleanly on a DB already carrying 124_channel_generalization, so the two are order-independent. sqlc regenerated (v1.31.1): only the migration-number comment changed. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * feat(channel): prune channel bindings on member removal + chat session delete MUL-3515 §4 dropped every channel_* foreign key, so the old ON DELETE CASCADE that cleared a user's channel_user_binding when they left a workspace, and a chat's channel_chat_session_binding when its chat_session was deleted, no longer fires. Re-establish that integrity in the application layer, inside the existing transactions: revokeAndRemoveMember -> DeleteChannelUserBindingsByWorkspaceMember, DeleteChatSession -> DeleteChannelChatSessionBindingBySession. Adds real-DB tests for both paths, including a scoping check that a remaining member's binding survives the prune. Verified on PG17: both new tests plus the existing revocation tests and the full handler package pass. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * fix(channel): scope Lark/Feishu store reads to channel_type='feishu' The S2 cutover routed the Feishu integration onto channel_*, but the Lark-facing ChannelStore wrappers read installation / chat-session-binding / outbound-card rows across ALL channel_type values. Once a second IM exists, that would let the Lark hub supervise a non-Feishu installation, the Lark install list show it, /lark/installations/{id} revoke another channel's row, and the outbound patcher / typing indicator act on a non-Feishu chat binding or card. Add a channel_type predicate to the six read/list channel queries and pass channelTypeFeishu from every wrapper: GetChannelInstallation, GetChannelInstallationInWorkspace, ListChannelInstallationsByWorkspace, ListActiveChannelInstallations, GetChannelChatSessionBindingBySession, GetChannelOutboundCardByTask. The S3 cleanup deletes (DeleteChannelUserBindingsByWorkspaceMember / DeleteChannelChatSessionBindingBySession) stay all-channel on purpose: a member leaving or a chat_session being deleted should clear every IM's binding. Adds a real-DB test that seeds a Slack installation/binding/card next to the Feishu ones and asserts the Lark wrappers never return them. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * refactor(channel): replace db.Lark* translation layer with lark domain types S2 introduced ChannelStore as a translation layer that read/wrote channel_* but kept the retired db.Lark* struct/param shapes so the dispatcher/hub/services and their ~20k lines of tests did not have to change. This collapses that layer: the store now takes and returns the package's flat domain types (Installation, UserBinding, ChatSessionBinding, InboundMessageDedup, BindingTokenRow, OutboundCardMessage) and the *Params types in params.go, with channel-neutral field names (ChannelUserID / ChannelChatID / ...). All call sites, fakes, and tests move to the domain types. No behavior change: only channel_* is read/written (as before); db.Lark* is now unused, and the lark_* tables + queries/lark.sql are removed in the next commit. Verified on PG17: go build / vet / gofmt clean, go test -race ./internal/integrations/... green (the ~20k-line fake suite), and the lark + handler suites pass. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * refactor(channel): drop lark_* tables and queries (remove old path) The Go cutover (previous commit) moved the lark package entirely onto channel_* and the domain types, leaving the lark_* tables, queries/lark.sql, and the generated db.Lark* models unused. Remove them per the design (§5: replace, do not keep both): migration 125 drops the seven lark_* tables (data already lives in channel_* since migration 124), and queries/lark.sql is deleted + sqlc regenerated, removing the db.Lark* models and lark query methods. The 125 down recreates the authoritative pre-drop schema (bot_union_id, region, per-installation dedup PK, thread-reply columns). Verified on PG17: fresh migrate up ends with lark_* gone + channel_* present; isolated 125 down/up round-trips correctly; go build / vet / gofmt clean; go test -race ./internal/integrations/... and the handler suite pass. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * fix(migrations): remove trailing blank line at EOF of 125 down migration git diff --check flagged a blank line at EOF of 125_drop_lark_tables.down.sql (a pg_dump-generation artifact). Whitespace only; the recreate SQL is unchanged. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * refactor(channel): defer lark_* table drop to a follow-up migration Preflight deploy review: dropping lark_* in the same release that cuts over (old migration 125) is not rollback/rolling-safe — the v0.3.27 release still reads lark_*, so a rolling deploy or a post-deploy code rollback would hit "relation does not exist". Remove the drop and keep the old tables for one release (standard expand/contract): migration 124 already backfilled lark_* -> channel_*, the new code reads/writes only channel_*, and the physical drop moves to a separate cleanup migration once this ships and is observed. The lark_* tables remain in the schema, so sqlc regenerates the (now unused) db.Lark* models; queries/lark.sql stays deleted (the new code uses channel_*). No code path reads lark_* — only the destructive drop is deferred, keeping the design's no-compat-layer / no-dual-write rule while being deploy-safe. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * fix(channel): skip orphaned installations in hub-boot active scan Preflight deploy review: channel_installation dropped the workspace/agent FK (MUL-3515 §4), so unlike lark_installation it does not cascade away when its workspace is deleted or its agent is hard-deleted (e.g. runtime teardown). The hub-boot query then keeps opening a WebSocket for a bot whose owner is gone. JOIN ListActiveChannelInstallations to live workspace + agent so an orphaned installation is never connected, uniformly for every deletion path. The JOIN matches the old ON DELETE CASCADE semantics (row existence, not agent archival), so an archived-but-present agent's installation is still listed; the orphaned row's encrypted secret is thereby never decrypted/used. Tests: a real-DB handler test asserts a deleted-workspace/agent installation and a non-Feishu one are both excluded; the lark scope test's active-list assertion moved there since the JOIN now needs real workspace/agent fixtures. (Physically deleting dormant orphaned channel rows on workspace/agent deletion is a separate app-layer-cleanup follow-up.) MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * docs(channel): document non-rolling cutover constraint for the lark->channel migration Elon deploy review: keeping the lark_* tables (deferred drop) stops old v0.3.27 code from crashing, but is not full expand/contract. Migration 124 is a one-time backfill; afterwards new code runs on channel_* (lease + dedup on channel_*) while pre-cutover code runs on lark_* (lease + dedup on lark_*). If both run concurrently during a rolling deploy, each side claims the same Feishu bot's WS lease on its own table and double-processes inbound events. This release therefore requires a NON-ROLLING cutover (stop the old hub before applying migration 124 + starting new code; rollback is not lossless once new code writes channel_*). Documented where deployers/reviewers see it: migration 124 header gains a ROLLOUT note; the channel_store.go header is corrected (lark_* tables are retained one release for rollback safety, not "gone"; the store still never touches them). Comment-only — no schema/codegen/behavior change. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * feat(lark): add MULTICA_LARK_HUB_DISABLED switch for the channel cutover The lark_*->channel_* cutover needs a way to make the Feishu bot briefly unavailable WITHOUT taking down the whole multica-api process — the Lark hub is a goroutine inside it, not a separate Deployment. MULTICA_LARK_HUB_DISABLED=true parks the hub at startup: the API serves HTTP normally but never claims a WS lease or opens a Feishu connection. Rollout (see migration 124 ROLLOUT note): ship the new release with the flag SET so new pods run API-only while old pods (hub on lark_*) drain during the rolling deploy — the two hubs never overlap. After the old pods are gone and migration 124 has run, flip the flag off; the new hub comes up on channel_*. The old backend does NOT need this switch — its hub stops when k8s terminates the old pods, not via a flag. Nil-ing LarkHub reuses the existing not-configured path so both the startup start and the shutdown join skip it. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * docs(channel): point migration 124 ROLLOUT note at the hub-disable switch Refine the rollout note to use MULTICA_LARK_HUB_DISABLED for a bot-only cutover (new pods serve API with the hub parked while old pods drain; flip the switch off after the migration), instead of the earlier whole-API recreate. Comment-only. MUL-3515 Co-authored-by: multica-agent <github@multica.ai> * docs(channel): fix migration 124 rollout order and document self-host cutover The previous ROLLOUT note shipped the new (channel_*) build before running migration 124, so the channel_*-backed HTTP paths (installation list/install/revoke, chat-session delete, member revoke) would 500 in the window between new-pod boot and the deferred migration. Restate the runbook around two explicit invariants — channel_* must exist before the new build serves those paths, and the old/new hubs must never overlap — and order the steps so channel_* is created first (park old hub -> snapshot -> deploy parked new build -> unpark). Document that default self-host (entrypoint migrate + single-replica Recreate) satisfies both invariants automatically and needs no manual steps; only prd / multi-replica rolling self-host needs the switch procedure. Clarify in main.go that the hub-park switch is generation-agnostic (parks whichever hub the build carries), which is what enables the preparatory release. Refs MUL-3515 Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
3103ed1082 |
fix(agent): surface Antigravity provider log failures (#4494)
Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> |
||
|
|
131ca80a6c |
refactor(autopilot): migrate scheduled dispatch to scheduler.Manager (3/3 MUL-3551) (#4444)
* refactor(autopilot): migrate scheduled dispatch to scheduler.Manager
PR 3 of 3 for the scheduled-Autopilot refactor on MUL-3551.
Replaces the legacy cmd/server/autopilot_scheduler.go goroutine
(30 s app-clock polling, app-time cron advancement, weak crash
recovery) with a JobSpec registered on the existing
scheduler.Manager. sys_cron_executions is now the lease + audit
table for scheduled Autopilot occurrences, and the unique key on
(job_name, scope_kind, scope_id, plan_time) is the primary
guarantee that the same planned fire time cannot produce two runs.
What changed
* server/internal/scheduler/jobs_autopilot.go
New AutopilotScheduleDispatchJob factory:
- scope_kind = "autopilot_trigger", scope_id = trigger.id
- PlansForScope hook (from PR 1) enumerates cron occurrences
in (lastPlan, dbNow] and collapses missed fires to the most
recent one (CatchUpLatestOnly — same policy the legacy
goroutine had, now provable via a one-row-per-tick audit).
- Handler re-loads trigger + autopilot inside the handler so a
between-tick state change (paused, disabled, deleted) takes
effect immediately and is recorded as a no-op SUCCESS row
with skipped_reason in the result JSON.
- Calls AutopilotService.DispatchAutopilotForPlan (from PR 2)
for the actual run creation; that path is itself idempotent
on (trigger_id, planned_at), so a stale-steal retry reuses
the run created by the prior attempt instead of duplicating.
- RunTimeout=2m, StaleTimeout=5m, HeartbeatInterval=30s,
AllowStaleReentry=true, MaxAttempts=3, RetryBackoff
[1m, 5m, 15m], MaxPlansPerTick=5 (safety cap).
* server/internal/scheduler/manager.go
Manager.runOnce promoted to RunOnce (exported) so external test
packages can drive deterministic ticks; existing call sites in
this package + cmd/server tests updated.
* server/internal/service/cron.go
NextOccurrenceAfterUTC and NextOccurrencesUTC: cron evaluators
that take an explicit "now" instant. Callers pass dbNow() so
schedule decisions stay consistent across app instances with
clock skew. Legacy ComputeNextRun is preserved (delegating to
NextOccurrenceAfterUTC with time.Now()) for the display-only
autopilot_trigger.next_run_at write path — scheduling decisions
no longer use it.
* server/pkg/db/queries/autopilot.sql
ListSchedulableAutopilotTriggers replaces the legacy
ClaimDueScheduleTriggers (the new path no longer mutates
autopilot_trigger.next_run_at on claim). RecoverLostTriggers
removed — sys_cron_executions lease theft now handles crash
recovery without an in-handler restart sweep.
* server/cmd/server/main.go
The "go runAutopilotScheduler(...)" line is gone. The new
JobSpec is registered alongside TaskUsageHourlyJob on the
existing schedulerMgr (still using sweepCtx for lifecycle).
* server/cmd/server/autopilot_scheduler.go DELETED.
Tests
* server/internal/service/cron_test.go — unit tests for the cron
helpers: timezone-aware enumeration, half-open (after, until]
window, plan_time-exclusive "after", invalid inputs surface
parse errors, and the "ignores wall clock" property the
scheduler relies on.
* server/cmd/server/autopilot_schedule_job_test.go — DB-backed
integration tests:
- DispatchesOnce: one tick → 1 SUCCESS exec row + 1
autopilot_run with planned_at set; a second tick does not
regress the count.
- MissedSchedulesCollapse: an hour of missed */5 fires
produce a single autopilot_run, not 12.
- CrashRecovery: simulated stale RUNNING lease at the same
plan_time → second tick reclaims it and DOES NOT duplicate
autopilot_run.
- TwoRunnersSingleWinner: two concurrent
scheduler.Manager instances on the same trigger →
per-plan_time uniqueness holds (sys_cron_executions never
has two RUNNING rows at the same plan_time, autopilot_run
count == exec row count).
- DisabledTriggerSkips: a trigger disabled between
scope-list and tick produces no exec row.
- PausedAutopilotSkipsAtHandler: an autopilot paused after
the first tick does not produce a new exec row.
- BadCronFailsLoudly: an invalid cron expression never fires
dispatch (parse error surfaces in the plan hook).
Existing autopilot listener / squad / dispatch tests still
pass.
* server/internal/scheduler/plans_for_scope_test.go from PR 1
still passes (RunOnce rename only).
Verification
* go build ./...
* go vet ./...
* go test ./internal/scheduler ./internal/service ./cmd/server
./internal/handler — all green.
Rollback
* Reverting this commit re-introduces the legacy goroutine.
Migration 124 (PR 2) and the scheduler hook (PR 1) stay in
place. Autopilot data on disk is forward- and backward-
compatible: planned_at columns are nullable, the legacy
goroutine never reads planned_at and the new job never reads
autopilot_trigger.next_run_at.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
* fix(autopilot): scheduler hook retries FAILED plans + tighten tests
Review fix for #4444 (MUL-3551).
Blocker: hook planner skipped the FAILED-with-retry plan_time
`autopilotPlansForScope` unconditionally set
`after = latest.PlanTime` when `latest.Found`, then enumerated cron
occurrences in the half-open interval `(after, dbNow]`. That
EXCLUDED the FAILED plan_time itself, so `tryClaim`'s
"FAILED-with-retry" branch — which only fires when the planner
returns the same plan_time — never ran. A claim + crash sequence
left the FAILED row stuck at attempt<max_attempts forever and the
scheduled occurrence was lost (MUL-3551 acceptance ③).
Fix: hook now branches on `latest.RetryEligible(now)` BEFORE
computing `after`. When the most recent stored row is FAILED with
attempts remaining and next_retry_at <= dbNow, the hook returns
`[latest.PlanTime]` unchanged. tryClaim's retry-from-FAILED path
fires, attempt increments, the run is retried, and the audit row
reaches SUCCESS at the same plan_time. Mirrors the cadence
planner's `info.RetryEligible(now)` branch in manager.plansForTick.
Tests
* TestAutopilotScheduleJobCrashRecovery rewritten to actually
pin the retry contract instead of just "no duplicate run":
- assert first attempt completes at attempt=1 with a real
task_id linkage (the "complete" snapshot the retry must
reuse);
- simulate a crash mid-dispatch (status=RUNNING, expired
stale_after, ghost lease_token);
- assert tick 2 transitions the SAME exec row (same plan_time)
to status=SUCCESS at attempt=2 (proving the planner did
NOT skip past the FAILED bucket);
- assert autopilot_run stays at exactly one row, reused from
the first attempt — proving DispatchAutopilotForPlan's
complete-run reuse path is what closes the loop.
* TestAutopilotScheduleJobPausedAutopilotSkipsAtHandler rewritten
to invoke `job.Handler` directly (the previous version drove
`mgr.RunOnce` which short-circuited at the scope-list SQL
filter and never reached the handler). The new test pauses the
autopilot AFTER setup, calls the handler with a fabricated
HandlerInput, and asserts the handler returns
skipped_reason=autopilot_inactive without creating an
autopilot_run.
* TestAutopilotScheduleJobBadCronFailsLoudly renamed to
TestAutopilotScheduleJobBadCronStaysSilent and updated to
match the real implementation: a parse error in the plan hook
surfaces as a manager-level warning log, NOT a
sys_cron_executions row (no plan_time was ever claimed). The
test now asserts zero exec rows AND zero autopilot_run rows,
documenting that bad cron is a permanent configuration error
(caught at HTTP create/update time first), not a transient
failure that belongs in the retry envelope.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
---------
Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
|
||
|
|
eca6102365 |
refactor(autopilot): autopilot_run.planned_at + DispatchAutopilotForPlan (2/3 MUL-3551) (#4443)
* refactor(scheduler): add PlansForScope hook for non-cadence jobs
The current Manager.plansForTick assumes a uniform Cadence grid:
plan_times are derived via FloorPlan(db_now, Cadence). That works for
rollup_task_usage_hourly but not for the upcoming Autopilot schedule
dispatch job, where each trigger has its own cron expression and the
plan_times do not snap to a single global grid.
This change adds an optional JobSpec.PlansForScope hook. When set:
* Manager loads the latest stored plan for (job, scope) and passes
a new LatestPlanInfo to the hook (exported from the previously
private latestPlanInfo). The hook returns the plan_times to attempt
this tick.
* Cadence, CatchUpMode and CatchUpWindow are bypassed; the hook is
in full control of plan_time selection.
* MaxPlansPerTick still acts as a safety cap on the hook's output.
* All other timing fields (RunTimeout / StaleTimeout /
HeartbeatInterval / MaxAttempts / RetryBackoff / AllowStaleReentry)
and the lease/heartbeat/terminal-write SQL primitives are reused
unchanged.
JobSpec.validate now allows Cadence=0 when PlansForScope is set, and
makes the every_plan MaxPlansPerTick > 0 invariant fire only on
Cadence-driven every_plan jobs. Existing rollup_task_usage_hourly
behaviour is unchanged — that JobSpec leaves PlansForScope nil.
Tests:
* TestJobSpecValidatePlansForScopeRelaxesCadence — validate() rules.
* TestManagerPlansForScopeHookDrivesPlans — end-to-end hook delegation
through the manager (DB-backed), proving that hook-returned
plan_times go through the same tryClaim path, MaxPlansPerTick
truncates without erroring, and LatestPlanInfo is populated on the
second tick.
* TestManagerPlansForScopeHookEmptyIsNoOp — empty hook output is a
valid no-op.
No behaviour change for callers that don't set PlansForScope.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
* refactor(autopilot): add planned_at + DispatchAutopilotForPlan for occurrence idempotency
PR 2 of 3 for the scheduled-Autopilot refactor on MUL-3551.
Adds dispatch-layer idempotency for scheduled triggers. This is the
second line of defence behind the primary uq_sys_cron_execution
guarantee in sys_cron_executions: if a runner crashes between
"create autopilot_run" and "write SUCCESS in sys_cron_executions",
the next stale-steal retry re-enters dispatch with the SAME
(trigger_id, planned_at). Without a row-level guard, that retry
would create a duplicate autopilot_run, issue, and task.
Changes:
* Migration 124: ALTER TABLE autopilot_run ADD COLUMN planned_at
TIMESTAMPTZ + partial unique index on (trigger_id, planned_at)
WHERE both are NOT NULL. Manual / webhook / api dispatch leaves
planned_at NULL so they keep the existing semantics unchanged.
* autopilot.sql: CreateAutopilotRun now takes planned_at;
GetAutopilotRunByTriggerAndPlanned is the fast-path lookup used
by DispatchAutopilotForPlan to detect a prior attempt's row
without burning an INSERT.
* service.DispatchAutopilotForPlan: new entry point for scheduled
triggers that already know the canonical UTC plan_time of the
occurrence they are firing. Looks up an existing run for
(trigger_id, planned_at) and reuses it on a stale-steal retry;
otherwise dispatches normally with planned_at stamped on the
new run.
* service.DispatchAutopilot keeps its current signature for
manual / webhook / api callers (planned_at stays NULL).
* recordSkippedRun also threads planned_at so the skip path
participates in the same partial-unique guarantee.
* sqlc v1.31.1 regenerated autopilot.sql.go + models.go.
Unrelated workspace.sql.go drift restored.
Tests (against local Postgres):
* TestDispatchAutopilotForPlanIsIdempotent — first call creates a
run; second call with same (trigger, planned_at) reuses it
(autopilot_run row count stays at 1); third call with a different
planned_at on the same trigger creates a second run (proves we
are not collapsing legitimate occurrences).
* TestDispatchAutopilotForPlanRejectsZeroArgs — invalid trigger_id
and zero planned_at both fail loudly so callers cannot silently
disable the idempotency guard.
* Existing autopilot listener / squad / dispatch tests all still
pass.
This PR has no scheduler / handler / UI behaviour change on its own:
the new entry point exists but is not yet wired into the schedule
goroutine. PR 3 will register the autopilot_schedule_dispatch
JobSpec that consumes it and remove the legacy
cmd/server/autopilot_scheduler.go path.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
* fix(autopilot): DispatchAutopilotForPlan recovers partial-state runs
Review fix for #4443 (MUL-3551).
Before this change, DispatchAutopilotForPlan returned ANY existing
autopilot_run for (trigger_id, planned_at), including the
half-written rows produced when a runner crashed between
"CreateAutopilotRun" and "create downstream issue/task". The
scheduler handler would then write SUCCESS in sys_cron_executions
even though no issue or agent task was ever created, silently
losing the scheduled occurrence.
Fix:
* New isAutopilotRunComplete helper classifies an existing run:
- terminal status (completed / failed / skipped) → reuse.
- issue_created with valid issue_id → reuse (the issue
listener owns task creation from here).
- running with valid task_id → reuse (the task is queued).
- anything else → partial; must NOT short-circuit.
* New SQL RecoverPartialAutopilotRun marks a partial row FAILED
with a recovery reason AND clears its planned_at. The cleared
planned_at releases the partial-unique slot in
uq_autopilot_run_trigger_planned, letting the fresh dispatch
INSERT a new row at the same (trigger_id, planned_at) without
conflict.
* DispatchAutopilotForPlan now branches on the lookup:
complete run → return; partial run → recover-then-fresh-
dispatch; not-found → fresh dispatch. The fresh dispatch path
still goes through dispatchAutopilot, so the new row carries
the real issue_id / task_id by the time the handler returns.
* Tests: TestDispatchAutopilotForPlanRecoversPartialRun seeds a
partial run (status='running', task_id=NULL for run_only;
status='issue_created', issue_id=NULL for create_issue) and
asserts the retry:
- returns a DIFFERENT run row (no false reuse);
- leaves the partial row in status='failed', planned_at=NULL,
with a non-empty failure_reason for ops;
- produces a fresh row with planned_at preserved AND the
appropriate downstream linkage (task_id for run_only,
issue_id for create_issue);
- exactly one live row at (trigger_id, planned_at) after
recovery, so the partial-unique constraint is honoured.
Existing TestDispatchAutopilotForPlanIsIdempotent and
TestDispatchAutopilotForPlanRejectsZeroArgs still pass — the
complete-reuse path is unchanged for the realistic SUCCESS-state
case.
Refs MUL-3551
Co-authored-by: multica-agent <github@multica.ai>
---------
Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
|
||
|
|
f9ed62f075 |
refactor(scheduler): add PlansForScope hook for non-cadence jobs (#4442)
The current Manager.plansForTick assumes a uniform Cadence grid:
plan_times are derived via FloorPlan(db_now, Cadence). That works for
rollup_task_usage_hourly but not for the upcoming Autopilot schedule
dispatch job, where each trigger has its own cron expression and the
plan_times do not snap to a single global grid.
This change adds an optional JobSpec.PlansForScope hook. When set:
* Manager loads the latest stored plan for (job, scope) and passes
a new LatestPlanInfo to the hook (exported from the previously
private latestPlanInfo). The hook returns the plan_times to attempt
this tick.
* Cadence, CatchUpMode and CatchUpWindow are bypassed; the hook is
in full control of plan_time selection.
* MaxPlansPerTick still acts as a safety cap on the hook's output.
* All other timing fields (RunTimeout / StaleTimeout /
HeartbeatInterval / MaxAttempts / RetryBackoff / AllowStaleReentry)
and the lease/heartbeat/terminal-write SQL primitives are reused
unchanged.
JobSpec.validate now allows Cadence=0 when PlansForScope is set, and
makes the every_plan MaxPlansPerTick > 0 invariant fire only on
Cadence-driven every_plan jobs. Existing rollup_task_usage_hourly
behaviour is unchanged — that JobSpec leaves PlansForScope nil.
Tests:
* TestJobSpecValidatePlansForScopeRelaxesCadence — validate() rules.
* TestManagerPlansForScopeHookDrivesPlans — end-to-end hook delegation
through the manager (DB-backed), proving that hook-returned
plan_times go through the same tryClaim path, MaxPlansPerTick
truncates without erroring, and LatestPlanInfo is populated on the
second tick.
* TestManagerPlansForScopeHookEmptyIsNoOp — empty hook output is a
valid no-op.
No behaviour change for callers that don't set PlansForScope.
Refs MUL-3551
Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
|