multica

mirror of https://github.com/multica-ai/multica.git synced 2026-06-17 03:38:32 +02:00

Author	SHA1	Message	Date
Bohan Jiang	c8ab73d38d	MUL-3244: Bind quick-create attachments to created issues (#4062 ) * fix: bind quick-create attachments to created issues Co-authored-by: multica-agent <github@multica.ai> * test: use real image markdown in quick-create attachment test Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-12 16:45:38 +08:00
Bohan Jiang	c510515da7	fix: suggest daemon profiles for empty disk usage - suggest other profile workspace roots when disk-usage sees an empty selected root - include the default profile in reverse suggestions and shell-quote profile arguments - keep JSON output and explicit --workspaces-root behavior unchanged MUL-3232	2026-06-12 13:37:35 +08:00
Bohan Jiang	e4ec9dc425	MUL-2802: add skill import conflict strategies (#3997 ) * feat(skills): structured conflict + overwrite path for local skill re-import Local-skill re-import previously failed (or silently skipped) on a same-name collision and, on delete+reimport, changed the skill UUID and dropped agent bindings. This adds a structured conflict result and a creator-only overwrite write path so a re-import can update the existing skill in place. - New terminal import status `conflict` carrying { existing_skill_id, existing_created_by, can_overwrite }; can_overwrite = requester is the skill creator (canOverwriteSkillByLocalImport — intentionally narrower than canManageSkill: admins edit in-app, not via re-import). - Conflict is detected at daemon-report time (the effective name is only known once the bundle arrives) via GetSkillByWorkspaceAndName, with the unique constraint as a race backstop. - Import requests carry action=overwrite + target_skill_id, persisted through both the in-memory and Redis LocalSkillImportStore (the heartbeat → daemon payload is unchanged; overwrite is resolved server-side). - overwriteSkillWithFiles updates by target_skill_id in one tx: re-checks existence (workspace-scoped) and creator permission, then replaces description/content/config and fully replaces files (pruning files absent from the new bundle). Preserves id, created_by, created_at, name, and agent_skill bindings. Publishes skill:updated (not skill:created). - Boundaries: target deleted or permission lost → failed (no fallback to create-by-name); any mid-write error rolls back the tx, leaving the original skill untouched. Retrying a terminal request is a no-op. Tests cover: creator/non-creator conflict (can_overwrite), overwrite preserves UUID + agent binding + prunes removed files, non-creator overwrite fails, deleted target fails without create fallback, retry idempotency, and Redis round-trip of the new fields. Backend half of MUL-2701. Contract change: same-name local imports now return status `conflict` instead of `failed` — the Desktop/core client must be updated to consume it (sibling task). MUL-2800 Co-authored-by: multica-agent <github@multica.ai> * fix(skills): gate structured conflict behind client opt-in; guard overwrite target name Addresses review feedback on PR #3498 (MUL-2800). Backward compatibility: a same-name local import now returns the new `conflict` status only when the initiating client opts in via `supports_conflict` (an overwrite request implies it). Older clients — already-installed Desktop builds whose poll loop only understands `failed`/`timeout` — keep the legacy `failed` + "a skill with this name already exists" behavior, so upgrading the backend ahead of the client no longer regresses the import UX. This is the installed-app API-compat boundary the repo's CLAUDE.md calls out. Also: the overwrite write path now verifies the incoming effective name matches the target skill's current name (errSkillOverwriteNameMismatch -> failed), preventing a stale/wrong target_skill_id from writing one skill's content onto another. Creator-only + workspace scoping already prevent privilege escalation; this narrows the API so it can't be misused. Refactored LocalSkillImportStore.Create to a LocalSkillImportRequestInput params struct (the signature had grown to 8 positional args; the opt-in flag pushed it over). supports_conflict is persisted in both the in-memory and Redis stores. Tests: conflict tests now opt in; added a legacy-client test (no flag -> failed + legacy message) and an overwrite name-mismatch test. MUL-2800 Co-authored-by: multica-agent <github@multica.ai> * feat(skills): resolve local import conflicts in desktop Co-authored-by: multica-agent <github@multica.ai> * fix(skills): preserve bulk flow after conflict resolution Co-authored-by: multica-agent <github@multica.ai> * feat(cli): add skill import conflict strategies Co-authored-by: multica-agent <github@multica.ai> * fix(i18n): sync skill import locale keys Co-authored-by: multica-agent <github@multica.ai> * docs: explain skill import conflict handling Co-authored-by: multica-agent <github@multica.ai> * docs: refresh skill import source map anchors Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-11 13:00:56 +08:00
Naiyuan Qing	906f70a3e2	Add comment trigger preview suppression (#3792 ) * Add comment trigger preview suppression Co-authored-by: multica-agent <github@multica.ai> * Use TanStack Query for trigger preview Co-authored-by: multica-agent <github@multica.ai> * Test note comments skip create triggers Co-authored-by: multica-agent <github@multica.ai> * feat(issues): redesign comment trigger chips as avatar chips Single agent renders as avatar + presence dot + full sentence; several agents collapse to an overlapping stack + active count, mirroring the header working chip. Per-agent skip moves into a click-opened popover (hover layers stay read-only tooltips); suppression reads as brightness, not a ban glyph. Loading and preview errors render nothing. Also: share one tooltip body across chip and popover rows, invalidate cached previews after a comment lands (the enqueued task changes the dedup answer), move the preview query key into issueKeys, and drop the now-unconsumed status field from useCommentTriggerPreview. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(server): drop comment trigger wrappers kept only for tests enqueueMentionedAgentTasks and shouldEnqueueSquadLeaderOnComment had no production callers after the compute/enqueue split — the comment path goes through computeCommentAgentTriggers. Tests now exercise the compute functions directly via package-local helpers, so the legacy adapters cannot drift from the real path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(skills): sync mentioning/squads source maps with shared trigger computation The squads source map still pointed the comment-trigger contract at the pre-refactor call chain (comment.go:940 -> shouldEnqueueSquadLeaderOnComment), and the mentioning skill referenced the deleted wrapper. Re-anchor both to computeCommentAgentTriggers / computeAssignedSquadLeaderCommentTrigger / computeMentionedAgentCommentTriggers with current line numbers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: multica-agent <github@multica.ai> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:27:07 +08:00
LinYushen	70ccbd9bce	Revert "MUL-3132: harden /uploads/* (auth, no listing, nosniff, tight CSP) (#…" (#3944 ) This reverts commit `8ff68502fc`.	2026-06-09 14:50:56 +08:00
Bohan Jiang	998ebe97e4	fix(autopilot): fail create-issue runs on any terminal task failure (#3943 ) Generalize SyncRunFromLinkedIssueTask beyond Codex no-progress: any terminal create-issue task failure with no retry still in flight now fails the linked autopilot run, so it can no longer hang in issue_created (invisible to the failure-rate auto-pause monitor). - fail the linked run for any terminal task failure, gated by the existing HasActiveTaskForIssue wait-for-retry guard - remove the isNoProgressTaskFailure classifier (subsumed; drops duplicated pkg/agent marker literals) - drop the redundant GetIssue/origin lookup; GetAutopilotRunByIssue leads and short-circuits ordinary failures in one query - tests: keep no-progress regression, add agent_error (non-retryable) and retry-pending cases Follow-up to #3927. VEN-661 / VEN-662 / MUL-3164	2026-06-09 14:48:20 +08:00
Multica Eve	13e9485a3b	MUL-3130: persist stable /api/attachments/<id>/download URL in comment markdown (#3937 ) * MUL-3130: persist a stable attachment download URL in comment markdown Comment image attachments rendered as broken placeholders ~30 minutes after upload because the editor was persisting a short-lived HMAC-signed URL into the comment body. After PR #3903 (MUL-3132) hardened /uploads/* with auth, `attachmentToResponse` started signing `attachment.url` as `/uploads/<key>?exp=<unix>&sig=<HMAC>` for LocalStorage so token-auth clients could keep loading inline images. The signature has a 30-min TTL by design — but `useFileUpload` was returning that signed value as `link` and the editor was writing `![file](signed-url)` straight into the markdown, so the comment permanently captured a URL that stopped working as soon as the signature expired. The fix is to persist a stable per-attachment URL that the server can re-sign on every request: * `useFileUpload` now returns `link = /api/attachments/<id>/download` (avatar uploads without an id still fall back to `att.url` so the pre-attachment-row code paths keep working). * `DownloadAttachment` self-resolves the workspace from the attachment row instead of reading X-Workspace-Slug / X-Workspace-ID headers, and the route is registered under the auth-only group so a native browser <img>/<video> resource load (which cannot attach those headers) succeeds. Membership is checked inside the handler with a 404 deny shape so the route does not act as an IDOR oracle. * A new `GetAttachmentByIDOnly` SQL query supports the workspace- derivation step. * `AttachmentDownloadProvider` now extracts the attachment id from the stable URL when matching markdown refs to attachment records, with a fallback to the existing url-equality check for legacy comments (and S3/CloudFront markdown that points straight at the CDN). * `contentReferencesAttachment` covers both URL shapes for the composer / standalone-list dedup paths so an attachment uploaded before the fix and one uploaded after both deduplicate cleanly. Tests: - New unit tests for the URL helpers (16 tests, packages/core). - Backend regression test: bare `<img src>`-style request without workspace headers now succeeds for a member (200) and 404s for a non-member, replacing the previous "400 without workspace context" contract. - Existing TestDownload, TestServeLocalUpload, TestAttachmentTo Response* and the 1220 frontend views tests all pass. Refs: MUL-3130, GitHub issue #3891 Co-authored-by: multica-agent <github@multica.ai> * MUL-3130: address PR review — split markdown link from upload link, swap render src Two follow-ups from GPT-Boy's review on PR #3937. (1) Don't reroute every upload consumer through the workspace-gated download endpoint. The previous change made `useFileUpload`'s `link` field unconditionally return `/api/attachments/<id>/download` whenever the upload had an id. But `useFileUpload` is also used by avatar / logo pickers (account-tab, workspace-tab, agents/avatar-picker, squads/squad-detail-page) that persist `result.link` directly into `avatar_url`. Avatars are referenced cross-workspace (mention chips, member lists, inbox items), so binding their URL to a workspace-membership-gated endpoint would silently break cross-workspace avatar visibility. The fix splits the URL into two semantically distinct fields: - `link` — same as `att.url` (legacy contract). Avatar / logo callers continue to use this and remain on whatever URL semantics the storage backend dictates. - `markdownLink` — the stable per-attachment URL `/api/attachments/<id>/download`. Only the editor's markdown-persisting flow consumes this. Falls back to `link` for the no-workspace upload branch (where there is no attachment-row id to address). `editor/extensions/file-upload.ts` switches `image.src` and `fileCard.href` to `markdownLink ?? link` so comment markdown gets the stable shape while avatar callers stay on `link` unchanged. (2) Make the render-time img src loadable for token-mode clients. Persisting the stable `/api/attachments/<id>/download` URL fixes the expiry problem but the path itself sits behind `middleware.Auth`, which expects either a `multica_auth` cookie or a Bearer token in `Authorization`. Native `<img>`/`<video>` resource loads from token-mode clients (Electron's default mode, the mobile app, legacy-token web sessions) cannot attach the Authorization header, so the bare URL would 401 immediately rather than 30 minutes later. `Attachment.normalize` now runs the resolved record through a new `pickInlineMediaURL` helper that returns: - `record.download_url` when it's an absolute URL with a recognised CDN signature query (CloudFront-signed `Signature` / `Expires` / `Key-Pair-Id`, or `X-Amz-Signature` for raw S3 presigns) — these load as native resource src in any client. - else `record.url`, which on the LocalStorage backend carries a freshly-minted `/uploads/<key>?exp&sig` query whose signature IS the auth (token-mode-loadable). On non-CF S3 backends this is the raw stored URL — same behaviour as today. - else the original input URL (legacy / unresolved markdown keeps its existing path). This gives the same effect for both `kind: "record"` and `kind: "url"` attachment inputs: once a record is in hand, the rendered media src is whichever URL the current backend exposes a working signature on. Tests: - New `file-upload.test.ts` regression pinning that `markdownLink` is what lands in the markdown body when the upload result returns both a short-lived storage URL and a stable download path. - Updated `attachment.test.tsx` to reflect the new render-time swap (the rendered img src now follows the freshly signed URL, not the raw storage URL) and added a record-mode regression pinning the LocalStorage default — when `download_url` is the bare /api/attachments/<id>/download path, the renderer must fall through to the signed `record.url`. - Updated `chat-input.test.tsx` makeUpload helper for the new `markdownLink` UploadResult field. - 1222 frontend views tests + 507 core tests + typecheck across @multica/{core,ui,views} all pass. Refs: MUL-3130, GitHub issue #3891. Builds on `a740f7a35`. Co-authored-by: multica-agent <github@multica.ai> * MUL-3130: chat upload map keys on persisted markdownLink, not the short-lived link GPT-Boy's second-round review on PR #3937 caught a chat-only blocker left over from the previous fix. After the previous commit split `UploadResult.link` into `link` (legacy avatar/logo URL) and `markdownLink` (stable per-attachment URL persisted into markdown), the comment editor's image src + file card href correctly switched to `markdownLink ?? link`. But chat input still kept the upload-map key on the old `link`: uploadMapRef.current.set(result.link, result.id) … if (content.includes(url)) activeIds.push(id) In the LocalStorage backend `link` is the short-lived `/uploads/<key>?exp=&sig=` URL. The editor persists the stable `/api/attachments/<id>/download` URL into the message body, so `content.includes(url)` never matches and the send call drops `attachment_ids`. The attachment ends up bound only to the chat session, not to the message — agents reading message-level metadata see no attachments. Fix: key the upload map on the same value the editor actually wrote into the markdown body (`markdownLink \|\| link`). The `content.includes(url)` check then matches and the attachment id is correctly forwarded on send. Tests: - Updated the chat-input mock editor to insert `markdownLink \|\| link` into its value, mirroring the real editor's persisted-URL choice (uploadAndInsertFile in editor/extensions/file-upload.ts). Without this the mock would silently paper over the bug. - Added a regression test where the upload result returns a short-lived `link = /uploads/...?exp&sig` and a stable `markdownLink = /api/attachments/<id>/download`. Asserts (a) the message body carries the stable URL and never the signed query, and (b) the bound `attachment_ids` includes the attachment id. All 1223 frontend views tests pass (was 1222, +1 new regression). Typecheck and 507 core tests still green. Refs: MUL-3130, PR #3937 review by GPT-Boy. Builds on `f66a522d0`. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai>	2026-06-09 14:26:36 +08:00
stevenayl	ee6200de25	fix(autopilot): fail no-progress issue runs (#3927 ) Fail create-issue autopilot runs that hang in issue_created after a Codex no-progress / semantic-inactivity task failure, so they surface as failed and count toward the failure-rate auto-pause monitor. - route failed create-issue issue tasks (no direct autopilot_run_id) into linked run sync - fail linked runs only for Codex no-progress / semantic-inactivity failures - wait when an active retry task still exists for the issue - add classifier coverage + a DB-backed listener regression VEN-661 / VEN-662 / MUL-3164	2026-06-09 14:25:04 +08:00
Bohan Jiang	42251b42fc	fix(cli): honor MULTICA_SERVER_URL in setup self-host (#3912 ) (#3938 ) * fix(cli): honor MULTICA_SERVER_URL in setup self-host `multica setup self-host` resolved the backend URL only from the --server-url flag, falling back to http://localhost:8080 when the flag was absent. It never consulted MULTICA_SERVER_URL, even though that env var is documented on the root --server-url flag and in `multica --help`, and is honored by every other command via resolveServerURL. A self-host user who set the env var instead of the flag still hit localhost and got "Server at http://localhost:8080 is not reachable". Route server-url and app-url through cli.FlagOrEnv so the documented env vars (MULTICA_SERVER_URL / MULTICA_APP_URL) are honored when the matching flag is not set, with the flag still taking precedence. userProvided now reflects flag-or-env, so an env-sourced remote URL still triggers the explicit app_url prompt. Not platform-specific despite the report. Fixes GitHub #3912. Co-authored-by: multica-agent <github@multica.ai> * fix(cli): normalize MULTICA_SERVER_URL in setup self-host MULTICA_SERVER_URL is documented as a ws:// daemon address (ws://localhost:8080/ws) and every other command normalizes it via NormalizeServerBaseURL before use. setup self-host consumed the resolved value raw and probed <url>/health, so a self-hoster who set the documented ws:// form would still fail the reachability check. Run the flag/env value through normalizeAPIBaseURL (ws->http, wss->https, strip /ws) so the documented form works and the stored server_url stays a clean http(s) base. Add a normalization test case and a focused test for the MULTICA_APP_URL env path (review nit). Co-authored-by: multica-agent <github@multica.ai> * docs(self-host): note setup self-host honors MULTICA_SERVER_URL / MULTICA_APP_URL Document that `setup self-host` reads the env vars when the matching flag is omitted (flag wins), and that MULTICA_SERVER_URL accepts the ws://…/ws daemon form. Added to en/zh/ja/ko quickstart for parity. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-09 14:02:06 +08:00
Bohan Jiang	7dc05d28bc	fix(projects): validate project status/priority — return 400 instead of 500 (#3925 ) (#3939 ) * fix(projects): return 400 (not 500) for invalid project status/priority CreateProject/UpdateProject passed an unvalidated status/priority straight to the INSERT, so an unknown value (e.g. --status active) tripped the table's CHECK constraint and surfaced as a blanket 500 'failed to create project' with no server-side log to diagnose it (#3925). Pre-validate both enums against the column CHECK lists and return a 400 with the allowed values. Back it with isCheckViolation -> 400 for any other constrained column, and log the underlying error on genuine 500s so transient DB failures are diagnosable. MUL-3153 Co-authored-by: multica-agent <github@multica.ai> * fix(cli): validate project --status in create/update project create and project update forwarded --status to the server without checking it, while project status already validated. Share a single validateProjectStatus helper across all three so a typo fails fast with the valid list instead of a server round-trip. MUL-3153 Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-09 13:54:53 +08:00
LinYushen	9ff801f926	docs(cli): error-message conventions + sign-in copy (PR3, MUL-3104) (#3900 ) * docs(cli): add Error Messages conventions + refine sign-in copy (PR3) Final pass of the CLI error-message work (MUL-3104). - CLI_AND_DAEMON.md: new "Error Messages" section documenting the user-facing contract — friendly single-line messages, server validation passthrough, English default with automatic Chinese on a zh locale, the tiered exit codes (0/1/2/3/4/5), --debug / MULTICA_DEBUG for the full chain, and MULTICA_HTTP_TIMEOUT. - cmd_auth.go: clarify three high-frequency sign-in errors so the message states what failed and the next step — local login-callback server start (hints at port/firewall), access-token creation, and token verification (suggests retrying `multica login` and checking the token is valid/not expired). All keep %w so exit-code tiering and --debug detail are preserved. cmd_id_resolver.go is left as-is — its not-found / ambiguous-prefix messages already point at `list --full-id` and need no change. The user-facing FormatError layer is unchanged, so its existing PR1/PR2 test coverage still applies; no test asserted the old verb strings. Refs MUL-3104. PR3 of 3 (final). Co-authored-by: multica-agent <github@multica.ai> * fix(cli): make login failure guidance visible via typed user-message wrapper Addresses 张大彪's PR3 review: the refined sign-in copy was wrapped with %w, so FormatError returned the centralized HTTPError/NetworkError copy and the new guidance only appeared under --debug. - Add cli.UserMessageError + cli.WithUserMessage: a typed wrapper carrying a user-facing message that FormatError surfaces by default, recognized before the network/http branches. Unwrap() is preserved, so ExitCodeFor still classifies by the underlying typed error and --debug still prints the full original chain. - cmd_auth.go: wrap the OAuth access-token-creation and PAT-verification failures with WithUserMessage (OAuth copy no longer mentions a passed token, since that flow has none), and move the token-specific 'valid / not expired' hint to the real Enter your personal access token: verification site (was the generic 'invalid token: %w'). - Focused tests: under a wrapped HTTPError(401) the default FormatError shows the login hint, ExitCodeFor returns ExitAuth, and --debug retains the raw chain; a wrapped NetworkError still classifies as ExitNetwork. - CLI_AND_DAEMON.md: narrow 'every error' to command errors returned to the top-level handler, noting commands like setup's fast /health probe bypass it. Refs MUL-3104, PR #3900. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-06-09 13:15:51 +08:00
Multica Eve	8ff68502fc	MUL-3132: harden /uploads/* (auth, no listing, nosniff, tight CSP) (#3903 ) * MUL-3132: harden /uploads/* (auth, no listing, nosniff, tight CSP) Closes the open hardening items from the SVG XSS disclosure (security-findings-2026-06-02). The primary chain (PR #3023 / #3050) is intact; this PR addresses every remaining recommendation from the disclosure's hardening list except 'serve uploads from a separate origin' (a structural change beyond this fix). Changes: - /uploads/* now requires authentication. The route is wrapped in middleware.Auth so anonymous internet users can no longer fetch workspace attachments by guessing the URL. A new ServeLocalUpload handler then enforces the second layer: - workspaces/{wsID}/* paths require membership in wsID (uses MembershipCache for the hot path); - users/{userID}/* paths allow any authenticated user (avatars are referenced cross-workspace); - any other prefix returns 404, so a future feature cannot drop content under /uploads/<other-prefix>/ and inherit a relaxed policy by accident. Non-members see 404 (not 403) so the route does not act as an IDOR oracle for workspace IDs. - Directory listing on /uploads/* is rejected at the storage layer: empty keys, trailing-slash keys, and any key that resolves to a directory return 404 before http.ServeFile would render an HTML index. UUID filenames were obscurity, but enumerating them shouldn't be free. - Every successful /uploads/* response carries X-Content-Type-Options: nosniff and a tight per-response CSP (default-src 'none'; sandbox; frame-ancestors 'none'), overriding the application-wide CSP. This is belt-and-suspenders if a future regression weakens the Content-Disposition: attachment path. - UploadFile rejects HTML-family uploads at the edge (.html, .htm, .xhtml, .shtml, .xht, .phtml, plus a content-type denylist for text/html and application/xhtml+xml so renamed payloads cannot bypass the extension check). SVG and JS remain allowed because their existing serve-side defenses neutralize them and source-code attachments preview as text/plain via /api/attachments/{id}/content. Tests: - storage: TestLocalStorage_ServeFile_RejectsDirectoryListing, TestLocalStorage_ServeFile_HardeningHeaders. - handler: TestIsUploadDenied (pure), TestUploadFile_RejectsHTMLByExtension, TestUploadFile_RejectsHTMLByContentType, TestUploadFile_AllowsLegitimateImage, and the full ServeLocalUpload matrix (RequiresAuth, MemberCanRead, NonMemberDenied, RejectsDirectoryInPath, UnknownPrefixDenied, UserPrefixAllowsAnyAuthedUser). - Full server test suite passes. Co-authored-by: multica-agent <github@multica.ai> * MUL-3132: HMAC-signed query auth for /uploads/* (token-auth client compat) Addresses J's Request Changes review on PR #3903. Problem: PR #3903 wrapped /uploads/* in middleware.Auth, but native <img>/<video>/<iframe> resource loads cannot attach Authorization headers. Token-auth clients (Desktop default, legacy-token Web sessions, mobile) were breaking on inline attachment rendering even though the API itself authenticated fine. Fix: implement HMAC-signed query parameters for /uploads/, mirroring S3 + CloudFront presigned URLs. - storage.SignLocalUploadURL(rawURL, key, secret, expiry) appends '?exp=<unix>&sig=<HMAC-SHA256(key\|exp)>' query params; signature is bound to one specific key, has a TTL matching CloudFront mode (defaultAttachmentDownloadURLTTL = 30 min), constant-time compared on verify. - storage.VerifyLocalUploadSignature(key, exp, sig, secret, now) rejects expired, tampered, wrong-secret, and key-mismatched signatures. - ServeLocalUpload now has two auth paths: signed-query (no Auth middleware needed; signature itself is the authority) and Bearer/cookie (membership-gated as before). Partial signed-query fails closed. - The route in router.go dispatches between the two: if both exp+sig query params are present, route to inner handler unwrapped; else wrap in middleware.Auth. - attachmentToResponse appends signed query to URL when the storage backend is LocalStorage. CloudFront-signed download URLs and S3 paths are unchanged. Tests: - storage: TestSignAndVerifyLocalUploadURL_RoundTrip, TestVerifyLocalUploadSignature_RejectsExpired, _RejectsTamperedSig, _BoundToKey, _RejectsWrongSecret, TestSignLocalUploadURL_PreservesExistingQuery, TestLocalUploadSignatureFromQuery_EmptyOnAbsence (7 pure tests). - handler: TestServeLocalUpload_{SignedQueryBypassesAuth, SignedQueryRejectsExpired, SignedQueryRejectsTampered, SignedQueryBoundToOneKey, PartialSignedQueryFailsClosed}, TestAttachmentToResponse_LocalStorageMintsSignedURL. Full server test suite passes. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai>	2026-06-09 11:59:00 +08:00
chyax98	26ca943d45	feat(lark): add typing indicator lifecycle for inbound messages (#3860 ) When a message is successfully ingested, send a Typing reaction to the user's message. When the agent replies (EventChatDone) or fails (EventTaskFailed), clear the reaction before the reply is visible. - Add AddMessageReaction / DeleteMessageReaction to APIClient - Implement reaction HTTP calls in httpAPIClient - Introduce TypingIndicatorManager for per-session state tracking - Wire into Hub (add on ingest) and Patcher (clear before reply) - Skip typing for messages older than 2 minutes (WS replay guard) Co-authored-by: miaolong001 <miaolong@xd.com>	2026-06-08 19:27:08 +08:00
LinYushen	b83b41ff44	feat(cli): per-status error copy with actionable hints (PR2, MUL-3104) (#3897 ) * feat(cli): refine per-status error copy with actionable hints (PR2) Builds on PR1's translation layer. Each HTTP-status message now carries an actionable next step, in both English and Chinese: - 401: run `multica login`; plus a self-hosted / non-OAuth fallback telling the user to ask their administrator for valid credentials - 403: check the workspace / ask an admin to grant access - 404: check the ID or run the matching `list` command - 409: re-fetch the latest state and retry - 422: check values / run with --help - 429: wait and retry; reduce call frequency if it persists - 5xx: retry, contact support, and re-run with --debug for the raw response Also adds ErrorKind.String() (stable snake_case identifiers) and uses it in --debug output instead of the raw int, and clears the pre-existing gofmt dirt Eve flagged in cmd_config.go, cmd_version.go, and help.go. Tests: TestErrorKindString (all kinds + uniqueness + out-of-range fallback) and TestFormatErrorActionableHints (locks the per-status hints in EN and ZH). Refs MUL-3104. PR2 of 3. Co-authored-by: multica-agent <github@multica.ai> * test(cli): cover validation (400/422) actionable hint TestFormatErrorActionableHints omitted KindValidation, so deleting the 400/422 hint would have gone unnoticed. Add 400 and 422 cases (no server message, so the generic validation copy is used) asserting EN contains --help / expected format and ZH contains --help / 格式 / 参数. Refs MUL-3104, PR #3897. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-06-08 16:02:09 +08:00
LinYushen	28de8b8bde	feat(cli): central error translation layer (PR1, MUL-3104) (#3892 ) * feat(cli): add central error translation layer (PR1) Introduce server/internal/cli/errors.go, a single user-facing error translation layer that collapses raw transport errors, HTTP status errors, and internal verb-wrapped chains into clear, localized messages. - ErrorKind classification (network timeout/DNS/refused/TLS/offline, 401/403/404/409/400+422/429/5xx, unknown) - NetworkError wraps transport errors and strips the raw URL from the user-facing message; classifyNetworkError categorizes via errors.As/Is with string fallbacks - HTTPError.Kind() maps status codes onto ErrorKind - FormatError: bilingual output (English default, auto-switch to Chinese on a zh LC_ALL/LC_MESSAGES/LANG locale), validation errors surface the server message; --debug / MULTICA_DEBUG appends the full raw chain - ExitCodeFor: tiered exit codes (network=2, auth=3, 404=4, validation=5, other=1) - client.go: default HTTP timeout 15s -> 30s, overridable via MULTICA_HTTP_TIMEOUT; wrap every transport Do() error as NetworkError - main.go: route errors through FormatError + ExitCodeFor, add persistent --debug flag Unit tests cover every ErrorKind, classification, language detection, exit codes, server-message extraction, and timeout parsing. Refs MUL-3104. PR1 of 3; PR2/PR3 (status-code copy refinement and per-command customization) follow separately. Co-authored-by: multica-agent <github@multica.ai> fix(cli): address review — unify command timeouts and classify all helper errors Must-fix 1: command-level contexts no longer truncate MULTICA_HTTP_TIMEOUT. Added cli.APITimeout/AtLeastAPITimeout/APIContext (budget = transport timeout + small grace, honoring MULTICA_HTTP_TIMEOUT) and replaced the hardcoded 15s context.WithTimeout in every API command (14 files, 92 sites) with cli.APIContext. The issue-create/comment path now uses APITimeout() with a 60s floor for attachment uploads. Must-fix 2: all API helpers now return HTTPError on status >= 400. Added a shared newHTTPError(method, path, resp) and routed GetJSON, GetJSONWithHeaders, PostJSON, PutJSON, PatchJSON, DeleteJSON, DeleteJSONWithBody, UploadFile, UploadFileWithURL, DownloadFile (and HealthCheck) through it, so issue update/status/metadata (PUT), comment list (GetJSONWithHeaders), project/label/ comment delete (DELETE) and agent/workspace/autopilot update (PUT/PATCH) all get HTTPError.Kind() classification, friendly copy, and the tiered exit code instead of the raw string + exit 1. Tests: new errors_integration_test.go drives the real helpers against a fake server and asserts FormatError copy + ExitCodeFor for 401/403/404/422/500 across all 10 helpers, plus a slow-server test proving the command context does not cancel before the transport timeout. Updated the UploadFileWithURL assertion to check for HTTPError. Refs MUL-3104, PR #3892. Co-authored-by: multica-agent <github@multica.ai> * fix(cli): make remaining fixed-timeout API commands honor MULTICA_HTTP_TIMEOUT Closes out the timeout work: the last API command paths still used a hardcoded context deadline that capped MULTICA_HTTP_TIMEOUT. Converted them to cli.AtLeastAPITimeout(<original floor>) so the env override scales them up while preserving each original lower bound: - cmd_autopilot.go autopilot trigger 30s -> AtLeastAPITimeout(30s) - cmd_attachment.go attachment download 60s -> AtLeastAPITimeout(60s) - cmd_agent.go avatar upload 60s -> AtLeastAPITimeout(60s) - cmd_skill.go skill import / search 60s -> AtLeastAPITimeout(60s) - cmd_runtime.go runtime update 150s -> AtLeastAPITimeout(150s) - cmd_login.go workspace-creation poll 10s -> AtLeastAPITimeout(10s) The login poll keeps a short 10s floor to stay responsive within its 5-minute loop, but it is NOT a silent exception: AtLeastAPITimeout means it still scales with MULTICA_HTTP_TIMEOUT. Documented in code and covered by a new subtest in TestAPITimeoutRespectsEnv. Refs MUL-3104, PR #3892. Co-authored-by: multica-agent <github@multica.ai> * style(cli): gofmt cmd_attachment.go to unblock backend CI Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-06-08 15:34:59 +08:00
xiaoyue26	10076ae773	MUL-3123 fix(realtime): support X-Forwarded-Host in WebSocket checkOrigin	2026-06-08 14:43:20 +08:00
LinYushen	b89b9cb4d6	test(migrate): concurrent migration race test using real Postgres (MUL-2956) (#3712 ) * test(migrate): add concurrent migration race test using real Postgres (MUL-2956) Follow-up to MUL-2923 / #3658, which added a Postgres advisory lock to serialize the migration loop across concurrent runners (multi-replica backend startup, scale-up, manual `migrate up` overlap). That PR shipped without a test because cmd/migrate/ had no harness; this commit adds it. Refactor: extract runMigrations(ctx, pool, runOptions) from main(), with the lock key, the bookkeeping table, and the file list now injectable. main() behavior is unchanged. Identifier interpolation goes through pgx.Identifier{}.Sanitize so callers can pass "schema.schema_migrations" safely. Tests (cmd/migrate/migrate_concurrent_test.go) — every case isolates itself in a unique throwaway schema and a unique lock key, so they never touch the real schema_migrations table or block real production runners that share the database. Skip cleanly when DATABASE_URL is unreachable, matching the pattern already used in internal/handler/handler_test.go and internal/metrics/business_sampler_pgsleep_test.go. - TestRunMigrationsConcurrentPending: 16 goroutines apply 5 deliberately non-idempotent migrations (bare CREATE TABLE + ALTER TABLE ADD COLUMN). Without the lock, concurrent CREATE TABLE races trip "duplicate key value violates unique constraint pg_type_typname_nsp_index" — proving the lock is doing its job. - TestRunMigrationsConcurrentAlreadyApplied: 16 goroutines hit the EXISTS no-op path against a pre-populated bookkeeping table; the state must be unchanged. - TestRunMigrationsAdvisoryLockSerializes: an external connection holds the same advisory lock; we assert that zero of the 16 runners complete during a 1 s observation window, then release the side lock and let them all finish. Catches the original MUL-2923 bug where the lock got attached to a random pooled connection. - TestRunMigrationsConcurrentMixedPoolStress: same pending case but with a deliberately small pool (runners/2), forcing pgxpool.Acquire contention to overlap with pg_advisory_lock contention. Verified locally: `go test -race -count=10 ./cmd/migrate/` passes in ~15 s. Mutation test (lock acquire/release replaced with `SELECT 1`) confirms the pending and lock-serializes tests both fail loudly, catching the regression they were written to detect. go.mod tidy promotes golang.org/x/sync to a direct dependency (now imported by the test for errgroup) and incidentally fixes a stale `// indirect` annotation on prometheus/client_model, which is already imported directly by internal/metrics/testutil.go. Co-authored-by: multica-agent <github@multica.ai> * test(migrate): gofmt + address review nits (MUL-2956) - gofmt -w cmd/migrate/migrate_concurrent_test.go: fixture struct field alignment. - quoteQualifiedIdentifier: actually reject identifiers with more than one dot (the previous version split on the first dot only and would silently sanitize "a.b.c" into "a"."b.c", contradicting the comment). Inline the splitter via strings.Split now that we explicitly check the component count. - Soften the test's lock-key comment from "never collide" to the accurate probabilistic statement (~1 in 2^62 collision odds with the production constant). go test -race -count=10 ./cmd/migrate/ still passes (~15 s). Co-authored-by: multica-agent <github@multica.ai> * test(migrate): direction whitelist + tidy go.mod (MUL-2956) Address two follow-ups from review: - runMigrations now whitelist-checks opts.Direction up-front and returns an error for anything that is not "up" or "down". The previous shape relied on `opts.Direction == "up"` and an else branch, so a typo or empty string would silently fall through to the rollback path. Add TestRunMigrationsRejectsInvalidDirection covering the empty string, "UP"/"DOWN" case mismatches, "rollback", and a whitespace-padded value; the check fires before any pool work, so the test runs without Postgres. - go mod tidy: promotes google.golang.org/protobuf to a direct dependency (it is imported directly elsewhere in the module and was stale-marked indirect). go test -race -count=10 ./cmd/migrate/ green (~15.7 s, 50/50). Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: wei-heshang <wei-heshang@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-08 13:33:16 +08:00
Prince Pal	d6e00e0909	fix(daemon): fail loudly when self-restart spawn fails (#2503 ) * fix(daemon): fail loudly when self-restart spawn fails * fix(daemon): surface log reopen failures on restart	2026-06-08 13:02:31 +08:00
Xinmin Zeng	270d177475	fix: broken "Add a computer" command on Multica Cloud + two CLI amplifiers (MUL-3087) (#3817 ) * fix(server): recognize official cloud by frontend host in daemon setup config The 'Add a computer' dialog builds its command from /api/config's daemon_server_url/daemon_app_url, falling back to 'multica setup' when both are empty. The official cloud is meant to omit them, but the omission only fired when MULTICA_PUBLIC_URL=https://api.multica.ai. When that env is unset the server URL defaults to the frontend origin and the old guard (which required serverURL host == api.multica.ai) didn't match, so the dialog emitted 'multica setup self-host --server-url https://multica.ai' — pointing the daemon backend at the frontend (no /health, no WebSocket proxy). Identify the official cloud by its frontend host alone (multica.ai / app.multica.ai) so a missing or misconfigured MULTICA_PUBLIC_URL can no longer leak the broken self-host command. Regression from #3474. * fix(cli): probe before persisting self-host config to preserve auth on failure setup self-host wrote a fresh CLIConfig{ServerURL, AppURL} (a full overwrite that drops the saved token) and only then probed the server, returning early on failure. A failed probe therefore logged the user out and left them unconnected, with no recovery in the same command. Probe first via persistSelfHostConfigIfReachable: an unreachable server leaves the existing config — and its token — untouched (failed setup = no-op). The prober is injected so both branches are unit-tested. * fix(daemon): serve health before preflight so daemon start readiness is accurate The CLI's 'daemon start' polls the health endpoint for 15s expecting status=running, but the daemon only began serving health after preflightAuth, whose initial workspace sync detects every configured agent's version by exec'ing it (~20s cold with 8 agents). Health served too late, so a perfectly healthy daemon printed 'may not have started successfully'. Start the health server right after resolveAuth (which still fails fast on a missing token) and before the slow preflight, so readiness reflects the daemon core being up rather than agent-version detection finishing. * fix(daemon): gate /health readiness so daemon start can't report a false start Serving health before preflightAuth fixed the false-negative (a healthy daemon printed "may not have started"), but health still returned status:"running" unconditionally — before preflight (PAT renew + workspace sync + runtime registration) had completed. `daemon start` and the desktop treat "running" as ready, so a slow or failing preflight could be misreported as a started daemon: setup prints "connected", then the process exits or hangs in agent-version detection with no runtime registered. That is harder to diagnose than the original false-negative. Split liveness from readiness: bind/serve the health port early (so callers see a live "starting" daemon instead of connection-refused), but report status:"starting" until d.ready is set after preflight, then "running". - daemon.go: add d.ready (atomic.Bool); set it true after the background loops launch, before pollLoop. - health.go: healthHandler reports "starting" until ready, else "running". - cmd_daemon.go: `daemon start` waits for "running" with a deadline raised to 45s (covers cold-start agent detection) and a clearer "still starting" message; new daemonAlive() helper treats both "running" and "starting" as a live daemon, so the already-running guard, restart, and stop act on a starting daemon and don't double-spawn or race its listener; `daemon status` shows "starting" distinctly. Older CLIs/desktop that only know "running" safely treat "starting" as not-ready (status != "running"), so no boundary break. Tests: health reports starting-then-running; daemonAlive truth table. Co-authored-by: multica-agent <github@multica.ai> * fix(desktop): handle daemon "starting" health status in lifecycle The daemon now reports /health status:"starting" until preflight completes (liveness/readiness split). That made "starting" a new external contract of /health, but the Desktop daemon-manager only knew "running", so the readiness fix would have moved the CLI's false-negative into a Desktop start regression: - `daemon start` now blocks up to 45s waiting for readiness, but the Desktop spawned it via execFile({ timeout: 20_000 }). On a cold start (the ~20s agent detection this PR targets) Electron killed the CLI supervisor at 20s and reported a start failure, even though the detached daemon child kept booting — the UI flashed "stopped" then "running". Raise the timeout to 60s (must exceed the CLI's 45s startupTimeout). - The Desktop treated only raw status === "running" as a live daemon, so a daemon that was still "starting" (booting on its own or started via the CLI) showed as "stopped", and startDaemon() would spawn a second one — which the new CLI rejects as "already running", surfacing as a start error. Add daemonStatusAlive() (shared, pure, unit-tested) mirroring the Go daemonAlive() and use it for liveness: fetchHealth() surfaces a daemon-reported "starting" as state "starting" regardless of our own currentState; startDaemon()'s already-running guard and the restart-on-user-switch guard treat "starting" as an existing daemon. version-decision stays gated on "running" (readiness, not liveness) — unchanged. Verified: desktop typecheck, eslint, full vitest suite (193 tests) all pass. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-05 17:01:23 +08:00
Bohan Jiang	6d0b9e3918	feat(lark): prefetch surrounding group context on @-mention (MUL-3084) (#3819 ) * feat(lark): prefetch surrounding group context on @-mention (MUL-3084) In Feishu group chats the Bot only saw the single message that @-mentioned it — never the surrounding conversation — because the inbound enricher only inlined context the user explicitly attached (a quoted reply or a merge_forward), and the API client had no way to list a chat's history. Add APIClient.ListChatMessages (GET /open-apis/im/v1/messages, container_id_type=chat, ByCreateTimeDesc, page_size clamped to Lark's 50 cap) and, for a group message addressed to the Bot, prefetch a bounded window of recent messages and inline them as a <recent_context> block ahead of the user's own message. The trigger and any quoted parent are excluded so nothing is duplicated; speakers are labeled positionally (User 1/2 / Bot); failures degrade to a visible placeholder and never block ingestion. Window size is configurable via InboundEnricherConfig.RecentContextSize (<=0 disables); production wires DefaultRecentContextSize (20). One list call per addressed turn keeps the fetch within the inbound ACK / EnrichTimeout budget. Co-authored-by: multica-agent <github@multica.ai> * feat(lark): anchor group context window to trigger time, default 10 Address review feedback on MUL-3084: - Anchor the recent-context prefetch to the trigger message's time: thread the message create_time through InboundMessage and pass it as the list end_time (millis -> seconds), so the window is the conversation up to the @-mention rather than whatever is newest when the slightly-later prefetch HTTP call runs. end_time is omitted when the time is missing/unparseable (falls back to newest N). - Lower DefaultRecentContextSize from 20 to 10. Co-authored-by: multica-agent <github@multica.ai> * docs(lark): clarify recent-context persistence stance and fetch-window semantics Co-authored-by: multica-agent <github@multica.ai> * fix(lark): region-aware doJSON for ListChatMessages after rebase origin/main merged #3815 (Lark dual-region support), which changed doJSON to take a per-call baseURL resolved via resolveBaseURL(creds). Adapt the new ListChatMessages call to that signature so the backend build passes against latest main, and refresh the now-stale ListMessagesParams comment (EndTime is exposed). Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-05 16:37:49 +08:00
Bohan Jiang	6ac8314711	feat(lark): support both Feishu and Lark from one deployment (MUL-3083) (#3815 ) * feat(lark): serve Feishu and Lark from one deployment, per installation The Lark integration was locked to a single open-platform host chosen deployment-wide (MULTICA_LARK_HTTP_BASE_URL / _CALLBACK_BASE_URL, defaulting to open.feishu.cn), so one deployment could talk to only the mainland Feishu cloud OR Lark international — never both. Teams on the other tenant could not use the integration at all. Make the host per-installation. The device-flow installer already auto-detects the tenant (Lark emits tenant_brand="lark" mid-poll); we now persist that as lark_installation.region, carry it on InstallationCredentials.Region, and resolve the open-platform host per call (REST + WS bootstrap) from the region. An explicit cfg.BaseURL (env / httptest) still overrides every region, so existing tests and staging/proxy setups keep working. - migration 116: lark_installation.region TEXT NOT NULL DEFAULT 'feishu' CHECK (region IN ('feishu','lark')) — existing rows are all mainland. - lark.Region enum + OpenPlatformBaseURL/RegionOrDefault helpers. - registration: thread the detected region into finishSuccess so the install-time GetBotInfo hits the right cloud AND the row records it. - every credential-build site (patcher, replier, WS provider, union_id backfill) copies region off the installation row. - region is part of the WS supervisor fingerprint so a re-install that switches cloud restarts the connection. - API: surface region on the installation listing DTO. MUL-3083 Co-authored-by: multica-agent <github@multica.ai> * feat(lark): surface installation region in settings UI Read the per-installation region off the listings response: build the "Manage in Lark" dev-console host from it (open.feishu.cn vs open.larksuite.com instead of a hardcoded mainland host) and render a Feishu / Lark badge on each connected bot. The field is optional and defaults to Feishu when an older server omits it (API-compat). Adds the region_feishu / region_lark labels to all four locales. MUL-3083 Co-authored-by: multica-agent <github@multica.ai> * docs(lark): document simultaneous Feishu + Lark support The cloud each bot belongs to is now auto-detected at install and stored per installation, so one deployment serves both. Replace the old "point MULTICA_LARK_HTTP_BASE_URL at larksuite for international tenants" guidance (now just an optional override) in all four locales. MUL-3083 Co-authored-by: multica-agent <github@multica.ai> * fix(lark): repair legacy Lark-international installs on upgrade Review follow-up (MUL-3083). Migration 116 backfilled every existing lark_installation to region='feishu', assuming all historical rows were mainland. But self-host deployments could already run Lark international via the deployment-wide MULTICA_LARK_HTTP_BASE_URL override, so those rows are really Lark — clearing the override after upgrade (which the new docs invite) would route them to open.feishu.cn and break them. Add a one-shot startup repair, BackfillRegionFromLegacyOverride, fired off the hot path like BackfillBotUnionIDs: when the deployment's global base-URL override targets open.larksuite.com, relabel the still-default 'feishu' rows to 'lark'. Gating on the deployment-wide override is what makes it safe — every pre-existing install on such a deployment was Lark. Idempotent; no-op on mainland / fresh deployments. Verified end-to-end against a scratch DB (flip then 0-row idempotent re-run). Also document that a Lark/飞书 app_id is globally unique across both clouds, which is what makes the app_id-keyed token cache and the UNIQUE(app_id) constraint safe across regions (review nit). MUL-3083 Co-authored-by: multica-agent <github@multica.ai> * docs(lark): fix ops guidance to match auto per-installation region Review follow-up (MUL-3083). .env.example and docker-compose.selfhost.yml still told operators that international Lark requires pointing both base URLs at open.larksuite.com — now wrong, and it would push a fresh deployment back into a single-cloud override. Rewrite them: the base URLs are optional deployment-wide overrides; normal dual-cloud operation keeps them empty. Document the first-boot auto-relabel for deployments migrating off the old single-cloud override, across the integration docs (en/zh/ja/ko). MUL-3083 Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-05 16:03:13 +08:00
Bohan Jiang	3708fb0f07	fix(daemon): inactivity-based agent run timeout, no wall-clock guillotine (MUL-3064) Active long-running sessions are no longer killed by a fixed wall-clock deadline. Liveness is delegated to the idle watchdog (MULTICA_AGENT_IDLE_WATCHDOG, default 30m) with a larger in-flight-tool budget (MULTICA_AGENT_TOOL_WATCHDOG, default 2h). MULTICA_AGENT_TIMEOUT is an opt-in absolute cap (default 0 = no cap). The server-side 2.5h sweeper is unchanged as a coarse backstop. Fixes #3745.	2026-06-05 15:06:07 +08:00
Bohan Jiang	62925b97f1	chore(cli): remove the --from-template flag from agent create (#3805 ) * chore(cli): remove the --from-template flag from agent create The `--from-template` CLI flag was an untaught, immature surface (the built-in skill's source-map explicitly marked the template path "out of scope"). It also silently ignored sibling create flags (--custom-env, --mcp-config, etc.) by short-circuiting before body assembly. Remove the flag and its runAgentCreateFromTemplate handler from the CLI. Scope is CLI-only. The agent-template product feature stays intact: - registry server/internal/agenttmpl/ (embedded curated templates) - handler server/internal/handler/agent_template.go - routes GET /api/agent-templates, GET /api/agent-templates/{slug}, POST /api/agents/from-template - the onboarding "create from template" flow (packages/views/onboarding) The onboarding flow calls the API directly and does not depend on the CLI flag, so removing the flag does not affect it. Updates the multica-creating-agents source map accordingly. MUL-3070 Co-authored-by: multica-agent <github@multica.ai> * fix: correct source-map note on agent-template usage + guard --from-template Review of #3805 (MUL-3070) flagged a factual error in the source-map note: it claimed onboarding uses the agent-template backend. It does not. `packages/views/onboarding/steps/step-agent.tsx` builds four hardcoded local presets (i18n-resolved) and creates via plain `POST /api/agents` (`createAgent`), never `POST /api/agents/from-template`. The whole agent-template stack (registry, handler, routes, `packages/core` client + query wrappers) is orphaned — the removed CLI flag was its only non-test caller. Rewrite the note to say so. Also add a regression test asserting `agent create` exposes no `--from-template` flag, so it can't be silently re-added. MUL-3070 Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-05 14:21:11 +08:00
LinYushen	3caba86b09	feat(scheduler): DB-backed execution-record scheduler [MUL-2957]	2026-06-05 13:46:26 +08:00
Bohan Jiang	18a5224fe8	feat(cli): add --mcp-config flags to agent create/update (#3799 ) Agents already support an mcp_config field (consumed by the daemon → provider at task time) and the agent-settings UI exposes an MCP tab, but the CLI had no way to set it. This adds the missing CLI surface, mirroring the existing custom-env pattern: - `agent create` and `agent update` gain --mcp-config / --mcp-config-stdin / --mcp-config-file. The stdin/file channels keep MCP server tokens out of shell history and 'ps'; the three channels are mutually exclusive. - The value is validated as a JSON object (or the literal `null` to clear, on update), matching the agent-settings MCP tab. Empty stdin/file input errors instead of silently clearing a secret-bearing field. - Unlike custom_env, mcp_config IS settable via `agent update` — it is persisted through the generic UpdateAgent endpoint (no dedicated audited endpoint), so both create and update expose the flags. Adds parser/resolver unit tests (incl. secret-leak sanitization) and updates the multica-creating-agents built-in skill + source map. MUL-3070 Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-05 13:39:38 +08:00
Bohan Jiang	f3ab29cdfc	fix(lark): publish lark_installation:created at row-commit, not on status poll (#3770 ) The agent Integrations tab's "已连接到飞书" connection badge only updated after a manual page refresh. lark_installation:created had a single emit site — the status-poll handler GetLarkInstallStatus — so it only fired while a browser was actively polling the install dialog to success. Every other surface (a second admin, the inspector sidebar, the Settings panel, or the installer whose dialog closed before the success poll) never received the invalidation frame, and under the QueryClient defaults (staleTime: Infinity) the installations cache stayed stale until a full page refresh. Publish the event from RegistrationService.finishSuccess at the row-commit point, mirroring the already-correct revoke path, so every workspace client refreshes the moment the install lands. Wire the bus via an optional SetEventBus (keeps the constructor and its validation tests untouched, nil-safe) and remove the now- redundant poll-handler emit. MUL-3059 Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-04 19:29:01 +08:00
LinYushen	5e1a6c4853	fix(cli): degrade 'issue metadata list' to {} on /metadata 404 (#3757 ) Self-hosted backends without the per-issue metadata route (older builds, unapplied 105_issue_metadata migration, or proxy/ingress misroutes) reply 404 to GET /api/issues/:id/metadata. The agent runtime bootstrap calls 'multica issue metadata list <issue> --output json' best-effort, but a non-zero exit was being escalated by Hermes into a failed agent run even when the rest of the work succeeded. This makes only the 'list' verb best-effort: a 404 from /metadata now prints {} (or an empty table) and exits 0. Other status codes (401, 500, etc.) keep real error semantics, and 'metadata get / set / delete' are unaffected — those represent explicit caller intent. To support the status-code check without changing the user-facing error string, GetJSON now returns *cli.HTTPError on HTTP failures (the format 'GET <path> returned <code>: <body>' is preserved by HTTPError.Error()). Refs GitHub issue #3711. Co-authored-by: multica-agent <github@multica.ai>	2026-06-04 16:13:27 +08:00
Multica Eve	ae27058b0a	fix(attachments): unified download endpoint with mode + presign + proxy (MUL-2976) (#3747 ) Fix attachment download for self-hosted deployments using private S3-compatible buckets without CloudFront. Closes #3721. Server - New unified `GET /api/attachments/{id}/download` endpoint that picks CloudFront / S3 presign / server proxy at request time. - `ATTACHMENT_DOWNLOAD_MODE=auto\|cloudfront\|presign\|proxy` and `ATTACHMENT_DOWNLOAD_URL_TTL` env knobs; `auto` routes Docker hostnames / localhost / private IPs through the proxy and public S3 endpoints through presign. - `Storage.PresignGet` capability; S3 implementation generates presigned GET URLs. - `attachmentToResponse` returns the unified relative endpoint instead of leaking raw unsigned S3 URLs when CloudFront is not configured. Proxy path streams via `io.Copy` with `Content-Disposition` / `Content-Length` / `Cache-Control: no-store` / `X-Content-Type-Options: nosniff`. Clients - CLI / Desktop / Mobile resolve relative `download_url` values against the configured API base. Desktop covers the Electron native download bridge and the media preview modal; Mobile covers `Linking.openURL`, the markdown image RN loader, and the composer's completed non-image file chip. - Mobile gains a minimal Node-environment vitest lane wired into `mobile-verify.yml`. Docs - `.env.example`, `docker-compose.selfhost.yml`, `SELF_HOSTING_ADVANCED.md`, and the `environment-variables` doc set updated with the new env keys and the `ATTACHMENT_DOWNLOAD_MODE=proxy` recommendation for Docker / VPC-internal object stores. Tests - `internal/storage`, `internal/cli`, `internal/handler` (download endpoint, mode selection, proxy header, `/content` non-regression), `cmd/server` (trusted proxy parser). - `packages/views/editor/use-download-attachment.test.tsx` and `attachment-preview-modal.test.tsx` exercise relative URL resolution + absolute pass-through. - `apps/mobile/lib/attachment-url.test.ts` covers every helper branch plus the composer non-image chip case.	2026-06-04 14:52:57 +08:00
Bohan Jiang	6e004149a8	feat(lark): debounce inbound run trigger per chat session (MUL-2968) (#3742 ) A forwarded transcript plus a follow-up note arrive as two separate Lark messages, each of which synchronously called EnqueueChatTask — so the bot ran twice (once on the bare forward, before the note arrived). The chat task already reads the whole session history at run time, so the messages never needed stitching; only the run TRIGGER did. Introduce pendingBatcher: a per-chat_session debouncer that collapses a burst into one agent run on a 3s silence window. Each message is still appended, deduped, and ACKed synchronously and individually; step 8 of the dispatcher now schedules a debounced flush instead of enqueuing inline. Because EnqueueChatTask's agent-offline / agent-archived verdict is now only known at flush, the dispatcher emits that notice itself via an injected FlushReply (wired to OutcomeReplier.Reply) rather than returning it synchronously to the hub. Infra failures are logged, not surfaced — the inbound frame was ACKed long ago. The hub drains the batcher on graceful shutdown so a normal restart does not drop a pending window. Out of scope (owner-aligned): group-chat multi-speaker batching, restart recovery for the in-process window, and forwarded-sender real-name resolution. Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-04 12:48:02 +08:00
Bohan Jiang	5eba94ee25	feat(lark): inbound context enrichment — post / merge_forward / quoted-reply (MUL-2951) (#3724 ) Expand an inbound Lark bot message's body before dispatch with the context a user explicitly attached, so the agent sees a semantically complete conversation instead of a bare "@bot 总结一下". - post: flatten rich-text (title + paragraphs, links, @-mentions) to plain text synchronously in the decoder. - merge_forward: inline the forwarded transcript via a single GetMessage — GET /open-apis/im/v1/messages/{id} returns the forward sentinel plus the bundled children. (The issue's container_id_type=merge_forward query is undocumented; this avoids it and also handles a forwarded quoted parent.) - quoted reply: prepend the parent_id message as a <quoted_message> block; a parent that is itself a forward nests a <forwarded_messages> block. - new InboundEnricher runs in the WS connector between decode and emit, bounded by EnrichTimeout and degrading to "[unable to fetch]" placeholders so it never blocks the ~3s long-conn ACK budget. /issue stays parseable on a quote-reply by parsing the command from the user's own text (CommandBody) rather than the enriched body. Short-window debounce batching (issue item #4) is tracked as a follow-up. Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-04 11:58:16 +08:00
Bohan Jiang	598a6c51f2	refactor(server/lark): collapse HTTP_ENABLED + WS_ENABLED into the SECRET_KEY gate (MUL-2671) (#3717 ) MULTICA_LARK_HTTP_ENABLED and MULTICA_LARK_WS_ENABLED were staging knobs from the multi-PR rollout of the Lark MVP — they let the DB schema + inbound dispatcher land before the HTTP wire was real, and before the WS long-conn protocol was wired. Now that the MVP has shipped end-to-end, "I set SECRET_KEY but I don't want to talk to Lark" is not a useful production state: setting the at-rest master key is the operator's opt-in for the integration as a whole. Collapse the gate down to MULTICA_LARK_SECRET_KEY alone. When the key is present, wire the real HTTPAPIClient + the real WSLongConnConnector. CI / integration tests that want stub-style behaviour can point MULTICA_LARK_HTTP_BASE_URL at a mock server (already supported) instead of toggling a separate flag. Host overrides (HTTP_BASE_URL, REGISTRATION_DOMAIN, CALLBACK_BASE_URL) stay — those are real ops needs for international tenants / staging. stubAPIClient + NoopConnectorFactory remain exported because the test suite uses them directly; only the router boot path stops reaching for them. The connector factory keeps its noop fallback for the case where the endpoint fetcher fails to construct, so a malformed MULTICA_LARK_CALLBACK_BASE_URL degrades gracefully (visible as "connector=noop" in the boot log) instead of panicking the server. Lark integration + handler tests still pass; go vet clean. Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-03 19:36:49 +08:00
Bohan Jiang	8c98940b79	Lark Bot integration MVP: migration + service boundary (MUL-2671) (#3277 ) * feat(db): add Lark integration migration (MUL-2671) Introduces seven tables for the 飞书 Bot integration MVP — per-agent PersonalAgent installations, user/chat bindings, inbound dedup + non-content drop audit, outbound card mapping, and short-lived single-use member binding tokens. Schema notes: - chat_session schema unchanged; Lark routes through a separate binding table rather than adding a metadata JSONB column. - Outbound card mapping is task/message scoped so multiple runs on the same session can't stomp each other's cards. - lark_inbound_audit stores routing / identity / drop_reason ONLY, never message body — the audit channel for unbound users and group messages that don't address the Bot. - app_secret stores ciphertext (encryption helper lands in a follow-up commit on this branch); DB never sees plaintext. Co-authored-by: multica-agent <github@multica.ai> * feat(util): add secretbox AES-256-GCM helper for at-rest secrets First consumer is lark_installation.app_secret (MUL-2671 §4.4), but the helper is intentionally generic — future per-tenant secrets that must not appear in a DB dump can reuse it. Construction: AES-256-GCM with a per-message random nonce, providing authenticated encryption. Tampered ciphertext fails Open instead of silently decrypting to garbage. Master key loaded from a base64 env var via LoadKey; key rotation is not in scope yet. Co-authored-by: multica-agent <github@multica.ai> * refactor(issues): extract IssueService.Create as single create entry (MUL-2671) Establishes the service-layer boundary mandated by Elon's 二审 of MUL-2671 §4.8: issue creation no longer lives inside the HTTP handler. Both the HTTP POST /issues handler and the future Lark /issue command call into service.IssueService.Create, so duplicate guard, issue numbering, attachment linking, broadcast, analytics, and agent/squad enqueue stay aligned. Handler responsibilities shrink to parsing the HTTP request, doing actor resolution / validation (transport-specific), and converting service results into the IssueResponse + 201. The transaction-wrapped core, attachment link, event publish, analytics capture, and agent/squad enqueue all move into service.IssueService.Create. A BroadcastPayload callback on the service keeps the WS broadcast shape (the full IssueResponse) without forcing the service to depend on handler-layer response types. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations): add Lark package skeleton (MUL-2671) Establishes the architectural boundaries Elon's 二审 mandated as first-PR blockers without dragging in OAuth, WebSocket, or card-patching code (those land in follow-up PRs): - ChatSessionService interface — channel-aware chat-session entry point for Lark, deliberately separate from the HTTP SendChatMessage handler. The HTTP handler's single-creator guard (creator_id == request user_id) is correct for the browser client but rejects group chat_sessions by construction; Lark needs its own service. - AuditLogger interface — the only path for recording dropped events. Its signature deliberately omits message body, enforcing the drop-audit policy (MUL-2671 §4.7) at the type level: unbound users and non-addressed group messages can't accidentally end up in chat_session. - Typed IDs (OpenID, ChatID) prevent UUIDs from being conflated with Lark-side identifiers at compile time. - DropReason constants align dashboard/audit queries across callers. Co-authored-by: multica-agent <github@multica.ai> * refactor(issues): move parent/project workspace check into IssueService (MUL-2671) Parent existence and project workspace membership now live inside IssueService.Create, inside the same transaction as the duplicate guard and counter increment. The HTTP handler stops re-implementing the lookup; every future create entry (Lark /issue, MCP, API keys) inherits the same boundary without copy-pasting the SQL. Adds two error sentinels (ErrParentIssueNotFound, ErrProjectNotFound) so transports can translate to their own error shapes. Handler-level cross-workspace tests guard the boundary against future regressions. Co-authored-by: multica-agent <github@multica.ai> * fix(db): harden Lark migration safety底座 — TTL cap + workspace FK (MUL-2671) Two storage-layer hardenings that move the must-fix line off "the app layer enforces it" and onto the schema itself, so future write paths or hand-inserted rows cannot regress the invariants. 1) lark_binding_token TTL cap. The DB CHECK was 1 hour as defense-in-depth while the app constant was 15 minutes; the CHECK now matches the product cap (15 minutes). Application constant docstring updated to reflect that storage enforces the same bound. 2) lark_user_binding workspace membership. The table previously only FK'd to workspace / user / installation independently, so a binding could exist for a user no longer in the workspace, or claim a workspace different from its installation's. Two composite FKs close the gap structurally: * (installation_id, workspace_id) → lark_installation(id, workspace_id) — guarantees a binding's workspace_id always matches its installation's workspace_id. A new UNIQUE (id, workspace_id) on lark_installation is added as the FK target. * (workspace_id, multica_user_id) → member(workspace_id, user_id) with ON DELETE CASCADE — when a user is removed from the workspace, the binding cascades away in the same transaction. There is no longer a path where lark_user_binding outlives workspace membership. These two FKs are the schema-level proof for §4.3's "unbound or non-workspace members cannot leak content into chat_session" invariant. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): inbound services + /issue dispatcher (MUL-2671) Lands the inbound service layer for the Lark Bot MVP, sitting on top of the migration + service-boundary scaffold from the previous commits. What ships: - sqlc queries for all seven lark_* tables (idempotent dedup insert, CAS WS-lease, single-use binding-token consume, etc.) plus GetMostRecentUserChatMessage for the /issue fallback. - AuditLogger backed by lark_inbound_audit; signature deliberately body-free so callers cannot leak content into the drop log. - ChatSessionService: find-or-create chat_session via the binding table (winner-takes-all on the UNIQUE race), append-with-dedup, /issue parser, "previous user message" fallback for bare `/issue` invocation. - Dispatcher orchestrates the inbound pipeline in one place: installation routing → group-mention filter → identity check → ensure session → append+dedup → /issue → enqueue chat task. Group sessions use the installer as creator (stable workspace identity); p2p uses the sender. Agent-offline path falls through with OutcomeAgentOffline so the WS adapter can reply with the offline notice from §4.6. - BindingTokenService: random URL-safe token, SHA-256 stored hash, 15-min TTL pinned at the application AND the DB CHECK; Redeem returns the same opaque error for all rejection cases (no timing oracle on replay). - Unit tests for the parser (13 cases), dispatcher (8 cases via fake Queries/Chat/Audit/IssueCreator/Enqueuer), and binding-token hash/entropy. Real-DB integration tests for OAuth + token redeem land alongside the HTTP handlers in the next commit. Out of scope for this commit (next ones on the same feature branch): OAuth callback, HTTP routes, WebSocket hub, outbound card patcher, frontend. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): installation HTTP surface + secretbox-gated wiring (MUL-2671) Lands the HTTP boundary on top of the inbound services from the previous commit. What ships: - InstallationService.Upsert: the only path that writes lark_installation. Encrypts app_secret with the secretbox passed in at construction time; refuses to fall back to plaintext storage (returns an error from the constructor if no Box is supplied), so a misconfigured dev environment cannot accidentally land a row with cleartext credentials. Revoke flips status without DELETE so audit trail survives. - HTTP handlers under /api/workspaces/{id}/lark/: * GET /installations — member-visible (Integrations tab renders for non-admins). Soft 200 with empty list + configured:false when MULTICA_LARK_SECRET_KEY is unset, so the tab does not error on self-host that has not opted in. * POST /installations — admin-only; 503 when not configured. Re-validates agent_id ∈ workspace before accepting credentials so a cross-workspace agent UUID is rejected. * DELETE /installations/{id} — admin-only; workspace-scoped lookup so one workspace cannot revoke another's installation by UUID guess. - POST /api/lark/binding/redeem (user-scoped, no workspace context): the only path that mints a lark_user_binding row from user action. Redeemer identity comes from the session, not the token, so a stolen link cannot bind an open_id to an attacker's Multica user. The composite FK on lark_user_binding cascades the binding away if the user is not (or no longer) a workspace member, so a non-member who steals the link gets 403 at the DB layer. - Two new event-bus types in protocol.events: EventLarkInstallationCreated, EventLarkInstallationRevoked. - Router wiring: MULTICA_LARK_SECRET_KEY drives a conditional initialization of h.LarkInstallations + h.LarkBindingTokens. When unset, the integration disables itself with an INFO log and the rest of the server boots normally. - Handler tests cover all four not-configured short-circuits. Happy-path integration tests (real DB, full create→list→revoke cycle and token mint→redeem) ship alongside the WS hub PR. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): close binding-token rebind & typed task errors (MUL-2671) Two must-fixes from PR review on HEAD `87ad15e1`: 1. Binding-token redeem could be used to grab an already-bound Lark open_id. Two changes harden the path: - lark.sql `CreateLarkUserBinding` now gates ON CONFLICT DO UPDATE on `multica_user_id = EXCLUDED.multica_user_id`, so a cross-user rebind via a second valid token returns zero rows instead of silently switching ownership. - `BindingTokenService.RedeemAndBind` consumes the token and writes the binding row inside one transaction. A failed bind no longer burns the token; a successful bind never leaves a consumed-but- unused token. Distinct typed errors: ErrBindingTokenInvalid (410), ErrBindingAlreadyAssigned (409), ErrBindingNotWorkspaceMember (403). The handler maps each to its own status code. 2. Dispatcher collapsed every `EnqueueChatTask` error to `OutcomeAgentOffline`, hiding infra failure and misusing the "offline" label for cases (e.g. archived agent) where it doesn't fit. Now: - `service.EnqueueChatTask` returns `ErrChatTaskAgentNoRuntime` and `ErrChatTaskAgentArchived` as sentinel errors; DB / load / insert failures stay wrapped as ordinary errors. - Dispatcher uses `errors.Is` to map only the productizable cases (`OutcomeAgentOffline`, new `OutcomeAgentArchived`); any other error is returned to the WS adapter so it can retry or page instead of disguising the outage as an offline card. A daemon that's merely disconnected is still NOT an error — as long as `agent.runtime_id` is set the chat task enqueues and waits for the daemon to claim it on next online (returns `OutcomeIngested`). Co-authored-by: multica-agent <github@multica.ai> * ci: re-trigger workflow on lark MVP must-fix HEAD Co-authored-by: multica-agent <github@multica.ai> * ci: re-trigger workflow on lark MVP must-fix HEAD (retry) Co-authored-by: multica-agent <github@multica.ai> * test(integrations/lark): guard binding-token sentinel contract (MUL-2671) Two unit tests that document and protect the must-fix invariants without requiring a DB: 1. TestRedeemAndBindRequiresTxStarter — if a future refactor wires up BindingTokenService without a TxStarter, RedeemAndBind must fail fast with a clear error rather than nil-panic on Begin. The atomicity contract (consume + bind commit together) depends on that transaction existing. 2. TestBindingErrorSentinelsAreDistinct — the HTTP handler maps ErrBindingTokenInvalid → 410, ErrBindingAlreadyAssigned → 409, ErrBindingNotWorkspaceMember → 403. Accidentally aliasing them (e.g. var ErrBindingAlreadyAssigned = ErrBindingTokenInvalid) would silently regress the response codes without any other test catching it. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): WS hub orchestrator + outbound card patcher (MUL-2671) The hub owns one supervisor goroutine per active installation. Each supervisor acquires the WS lease via the existing CAS query, runs an EventConnector (interface — real Lark wire protocol lands in a follow-up behind it), renews the lease on a tighter cadence than the TTL, and backs off (with jitter) on connector failure. Lease loss tears the connector down cleanly; revocation is reaped on the next sweep. Per- process node id satisfies §4.4 multi-replica safety: at most one Hub globally holds the lease for any installation. The patcher subscribes to task / chat-done events on the existing events.Bus and keeps the per-task Lark interactive card in sync (thinking → streaming → final \| error). Card binding is per-task as required by §4.5; throttled patches via an in-memory last-patched map; final / error transitions bypass the throttle so the user always sees the terminal state. The Renderer is plug-replaceable so the product card template can evolve without touching transport. The APIClient interface centralizes the Lark Open Platform surface this package needs (send card, patch card, send binding prompt, exchange OAuth code). The default stubAPIClient returns ErrAPIClientNotConfigured for every transport call so a misconfigured deployment fails loudly instead of dropping cards silently. Real implementation lands in a follow-up; OAuth callback + frontend entries land in the next commits on this branch. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): OAuth install start / callback (MUL-2671) OAuthService builds a signed-state Lark authorization URL the frontend can render as a QR (or open directly), then on callback verifies the HMAC-protected state, exchanges the OAuth code for installation credentials via APIClient.ExchangeOAuthCode, and persists the row via InstallationService.Upsert (which keeps app_secret encryption inside a single chokepoint). State token format: workspaceID.agentID.initiatorID.expiresUnix.nonce.sig — HMAC-SHA256 over the first five fields with a deployment-level secret. TTL defaults to 10 minutes (covered by tests). Three failure modes (invalid state / expired state / missing code) map to typed errors so the HTTP handler can emit a single lark_error= query param the frontend uses to pick copy. Both endpoints degrade cleanly: the at-rest key gate (already in place) returns 503 from /install/start when the InstallationService is nil, and the OAuth gate (MULTICA_LARK_OAUTH_APP_ID / _SECRET / _REDIRECT_URI / _STATE_SECRET) returns configured:false from /install/start so the frontend can render "configure manually instead" without an error banner. /install/callback always finishes with a redirect to /settings?tab=lark carrying either lark_installed=1 or lark_error=<code>. Tests cover signed-URL shape, missing-config rejection, tampered state, expired state, propagated exchange error, and the no-config redirect path on the HTTP handler. Co-authored-by: multica-agent <github@multica.ai> * feat(views/lark): settings tab + agent bind button + /lark/bind redemption page (MUL-2671) Adds the user-facing Lark surface across the shared packages: - packages/core/types/lark.ts — wire shapes that mirror server/internal/ handler/lark.go. Optional fields default to undefined so older desktop builds keep parsing if the server adds new keys (CLAUDE.md → API Response Compatibility). - packages/core/lark/{queries,index}.ts — Tanstack Query options keyed by workspace id; realtime sync invalidates `installations(wsId)` on `lark_installation:` events. - packages/core/api/client.ts — listLarkInstallations, getLarkInstallURL, deleteLarkInstallation, redeemLarkBindingToken. - packages/views/settings/components/lark-tab.tsx — Settings → Lark panel. Listing is member-visible (matches backend); disconnect is admin-only. Empty state points users at the per-Agent bind entry, matching the (workspace_id, agent_id) UNIQUE: there is no "pick an agent" UI here because the bind URL is per-agent. - LarkAgentBindButton (same file) is the per-Agent CTA the Agent detail page imports. Opens the OAuth URL in a new tab; the callback bounces back to /settings?tab=lark with a query param the panel reads for inline confirmation copy. - packages/views/lark/bind-page.tsx — the Bot's "you need to bind" destination. Requires session before redeeming, distinguishes the 410/409/403 backend responses into distinct copy. - apps/web/app/lark/bind/page.tsx — Next.js route wrapping the shared bind page in a Suspense boundary (Next 15 useSearchParams rule). i18n: all user-facing strings land in en/zh-Hans, settings tab nav includes a Sparkles-iconed Lark entry, bind-page copy lives under common.lark_bind so it works pre-workspace-context too. typecheck + lint clean. Co-authored-by: multica-agent <github@multica.ai> chore(integrations/lark): wire outbound Patcher into server bootstrap (MUL-2671) Constructs the Patcher next to the existing Installation/BindingToken wiring in router.go and Register()s it on the event bus. With the stub APIClient any actual transport call surfaces ErrAPIClientNotConfigured; once the real Lark client lands, swap NewStubAPIClient for the real implementation here without touching the Patcher's subscription logic. doc.go updated to reflect everything the package now contains (Hub, Patcher, OAuthService, APIClient interface). The Hub itself is NOT booted here yet — it needs an EventConnector implementation for the Lark long-connection wire protocol, which lands in a follow-up; the orchestrator code and its unit tests are in place so that follow-up can focus on the WS protocol rather than lifecycle plumbing. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): address Elon 二审 5 must-fix items (MUL-2671) - Hub: renewer cancels run ctx on lease loss so the connector exits even if its wire I/O is blocked, keeping the §4.4 ownership invariant intact under lease theft. - Hub: EventEmitter returns (DispatchResult, error) so the real connector can post the matching Lark-side card (needs_binding, agent_offline, agent_archived) and react to infra failures instead of silently logging at the seam. - Dispatcher: top-level message_id dedup runs before group filter and identity check, so a reconnect storm cannot re-fire binding prompts or re-spam not_addressed_in_group audit rows; the in- AppendUserMessage dedup is removed since the table-level UNIQUE is the ultimate backstop. - OAuth: HandleCallback auto-binds the installer via the new InstallerBinder seam (BindingTokenService implements it), so the §2.1 "scan to bind, you're done" promise holds end-to-end. validateExchangeResult now requires installer open_id; new error reason codes wired through the callback redirect. - Frontend / handler: install_supported listing field + StartLark- Install short-circuit on stub APIClient hide install entry points (Settings tab + per-agent button) while no real Lark HTTP client is wired, so users do not land in an OAuth flow that fails at exchange. Includes tests for each fix (lease-loss cancel, emit error propagation, dedup ordering, OAuth installer-bind contract, stub- client install gate) and i18n strings for the new preview state. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): two-phase dedup so infra failures do not swallow messages (MUL-2671) The pre-fix top-level dedup wrote the lark_inbound_message_dedup row before EnsureChatSession / AppendUserMessage. An infra error in either step left the row in place and a WS-adapter retry was mis-classified as a duplicate, so the user's Lark message was permanently lost without ever landing in chat_session. Make dedup two-phase: - ClaimLarkInboundDedup acquires an in-flight claim (processed_at NULL). Stale claims older than 60 s are re-takeable so a process crash does not strand the message_id. - MarkLarkInboundDedupProcessed flips processed_at on durable success (audit row OR chat_message + session touch). - ReleaseLarkInboundDedup deletes the in-flight row on infra failure before any durable side effect, so the retry can re-claim immediately. Dispatcher.Handle now finalizes the claim exactly once based on whether the inner pipeline reached a durable outcome — chat_message commit being the transition point (errors past it Mark, errors before it Release). Regression tests cover the two failure variants Elon flagged plus the inverse invariants (durable-error Marks, drops Mark, in-flight replays drop, stale claims re-claim). Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): owner-fence dedup claim to close the double-write windows (MUL-2671) The two-phase Claim/Mark/Release fix from the previous commit closed the "infra error swallows a replay" gap but left two windows that could still write a chat_message twice for the same Lark message_id: 1. Stale-reclaim race. Worker A claims at t=0, runs slowly past the 60 s staleness TTL but is still alive. Worker B sees the row as stale and re-takes the claim. A reaches AppendUserMessage and commits a second chat_message. 2. Mark window. Worker A commits chat_message but the post-pipeline MarkLarkInboundDedupProcessed fails (DB hiccup) or the process crashes before it runs. 60 s later a retry treats the in-flight row as stale, re-claims it, and writes a second chat_message. Close both with owner fencing + same-tx Mark: - lark_inbound_message_dedup now carries a `claim_token` UUID; ClaimLarkInboundDedup mints a fresh one on insert and on stale re-take, so a reclaim ROTATES the token. - MarkLarkInboundDedupProcessed and ReleaseLarkInboundDedup are fenced on (message_id, claim_token, processed_at IS NULL) and return rowsAffected. Zero means our token is no longer live, and the caller treats it as a no-op (not an error). - AppendUserMessage invokes MarkLarkInboundDedupProcessed INSIDE its chat_message+session tx (qtx). If the token has been rotated by a concurrent reclaim, the Mark matches zero rows and the method returns ErrClaimLost; the deferred Rollback unwinds the chat_message insert, so the other holder is the sole writer. The durable write and the Mark therefore commit (or roll back) atomically — there is no "committed but not yet Marked" window for a crash or retry to exploit. Dispatcher.processClaimed now returns a tri-state dedupFinalize directive (none / mark / release): finalizeNone for the in-tx Mark path (and ErrClaimLost), finalizeMark for audit-drop branches and the defensive post-Append-success fallback, finalizeRelease for pre-durable infra errors. ErrClaimLost is translated into OutcomeDropped + DropReason- Duplicate at the Handle boundary, matching what the WS adapter expects for a "another worker is the writer" outcome. Regression tests: - TestDispatcher_StaleReclaimRaceDoesNotDoubleWrite injects worker B's reclaim via a beforeAppend hook so the claim_token rotates between Claim and AppendUserMessage. Asserts worker A's AppendUserMessage returns ErrClaimLost (no chat_message committed), the dispatcher surfaces a duplicate drop, the token rotated to a value distinct from A's original, and a follow-up replay still duplicate-drops. - TestDispatcher_InTxMarkPreventsPostCommitReclaim verifies the "Mark window" case is unreachable: a successful in-tx Mark produces exactly one Mark call (no post-finalize duplicate), the row is terminal, and a retry with dedupReclaim=true still duplicate-drops without re-rotating the token. - TestDispatcher_InTxMarkSucceedsAndSkipsPostFinalize pins the positive contract: DedupMarked=true must make applyFinalize a no-op (no extra Mark, no Release). fakeQueries gains a fakeDedupRow model carrying (processed, token, rotations) so the test seam matches production's UPDATE-with-WHERE semantics; fakeChat gains a beforeAppend hook to inject race timing. go test ./... and go vet ./... pass. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): real Lark HTTP APIClient for IM v1 send/patch (MUL-2671) Lands the production Lark Open Platform HTTP APIClient that replaces the stub for outbound transport. The patcher's "thinking → streaming → final \| error" card lifecycle and the dispatcher's binding-prompt card both now reach Lark for real once MULTICA_LARK_HTTP_ENABLED=true. Scope of this stage: - tenant_access_token retrieval via /open-apis/auth/v3/ tenant_access_token/internal, cached in-process per app_id with a 60s safety margin against Lark's `expire` value. Sub-2-minute expires are clamped to 120s so we never cache an entry that's already past its safe window. - SendInteractiveCard: POST /open-apis/im/v1/messages?receive_id_type=chat_id returning the Lark message_id the Patcher persists in lark_outbound_card_message for later patches. - PatchInteractiveCard: PATCH /open-apis/im/v1/messages/:id with the full re-rendered card body (Lark's update endpoint replaces, not deep-merges). - SendBindingPromptCard: open_id-targeted interactive card with a primary "去绑定" CTA pointing at the redemption URL. Template is co-located with the transport so the dispatcher never has to know about Lark's card schema. - Token-error invalidation: Lark codes 99991663 (expired) / 99991664 (invalid) drop the cached token so the next call refreshes from /tenant_access_token/internal instead of looping on a stale entry. Out of scope (deferred to follow-up stages): - ExchangeOAuthCode stays unimplemented behind ErrAPIClientNotConfigured. The PersonalAgent install handshake's response shape (returning per-installation app credentials in a single call) is not yet verified against the production endpoint, and a silent mis-fill of OAuthExchangeResult would corrupt lark_installation rows past validateExchangeResult. Operators continue to use the manual-paste InstallationService path until the OAuth stage lands. - Inbound WS EventConnector — Hub's ConnectorFactory still needs a real wire-protocol implementation. Wiring: - MULTICA_LARK_HTTP_ENABLED=true switches router.go from the stub to the real client. MULTICA_LARK_HTTP_BASE_URL overrides the default open.feishu.cn host (set to open.larksuite.com for the Lark international tenant, or to an httptest URL for integration tests). - The OAuth handler now also receives the real client (its ExchangeOAuthCode still surfaces ErrAPIClientNotConfigured, so callback behavior is unchanged until that stage lands). Tests (19 new cases against an httptest.Server fake): - happy path send/patch/binding-prompt round trips, asserting URL query params, body shape, Authorization header - token cache: 3 sends share one /tenant_access_token/internal hit - token refresh after clock-driven expiry - sub-margin expire clamping (10s expire → cached for >= safety margin of wall-clock) - Lark error code surfacing (230001 send, 230002 patch, 10003 auth) - token-expired (99991663) invalidates the cache; caller's retry re-fetches and succeeds - non-2xx HTTP status surfaces "http 500: …" - input validation: missing chat_id short-circuits BEFORE auth round-trip, missing card json / open_id / bind url all fail pre-flight without hitting Lark - ExchangeOAuthCode still returns ErrAPIClientNotConfigured - binding-prompt template carries the BindURL and the localized "去绑定" CTA in valid JSON go build ./..., go vet ./..., and go test ./internal/integrations/lark/... pass. Pre-existing handler/router integration tests that require a real Postgres connection are unaffected by this change. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): split outbound vs OAuth-install capability + card update_multi (MUL-2671) Address Elon's two must-fix items from the HEAD `a09993b1` review: 1. HTTP outbound and OAuth-install are now distinct APIClient capabilities. The new SupportsOAuthInstall() reports whether the install flow can succeed end-to-end (i.e. ExchangeOAuthCode is implemented); the real httpAPIClient still returns IsConfigured() = true (send / patch / binding prompt work) but SupportsOAuthInstall() = false until the PersonalAgent install-time response shape is pinned. Handler-side `install_supported` and StartLarkInstall now gate on SupportsOAuthInstall, so a half-wired client never reveals the scan-to-bind UI. larkOAuthErrorReason also maps ErrAPIClientNotConfigured to a dedicated `oauth_exchange_unimplemented` reason so a raw callback hit no longer masquerades as `internal_error`. 2. defaultRenderer now emits config.update_multi=true on every Kind. Lark refuses to apply PatchInteractiveCard to a card whose initial config doesn't declare it shared/updatable, so the absent flag would make every patch after the first send silently no-op on the wire while the local outbound status row still flipped to streaming/final. Tests cover both halves of each fix: - TestHTTPClient_SupportsOAuthInstall_FalseUntilExchangeLands + TestHTTPClient_StubReportsBothCapabilitiesFalse pin the new capability surface. - TestStartLarkInstall_TransportOnlyClientReportsNotConfigured + TestListLarkInstallations_TransportOnlyClientReportsInstallNotSupported pin the handler gate at exactly the half-wired state. - TestLarkOAuthErrorReason_APIClientNotConfigured pins the mapping for both the bare sentinel and the fmt.Errorf-wrapped form HandleCallback produces. - TestDefaultRendererConfigCarriesUpdateMulti covers every CardKind. - TestHTTPClient_(Send\|Patch)InteractiveCard_DefaultRendererBodyHasUpdateMulti verify the wire body Lark actually receives carries update_multi through both send and patch transport paths. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): real OAuth code exchange + agent-detail bind entry (MUL-2671) Stages the install side of the MVP critical path on top of the real HTTP outbound work: - httpAPIClient.ExchangeOAuthCode runs the production Lark v2 OAuth flow: POST /authen/v2/oauth/token to swap the authorization code for the installer's open_id, then GET /bot/v3/info under the parent app's tenant_access_token to fetch bot_open_id. Result feeds InstallationParams unchanged so OAuthService.HandleCallback's auto-bind step lights up automatically. - HTTPClientConfig gains OAuthAppID/OAuthAppSecret, read from the same MULTICA_LARK_OAUTH_APP_ID/_APP_SECRET env vars the OAuthConfig consumes. SupportsOAuthInstall now mirrors that pair so the install capability gate is honest: outbound transport without OAuth creds reports configured-but-not-install-supported, exactly like before. - Agent detail inspector wires the LarkAgentBindButton in a new Integrations section, viewer-hidden by canEdit. The button still self-hides when SupportsOAuthInstall is false, so a deployment without OAuth creds renders the section empty rather than CTA-broken. - Capability wording cleaned across handler / router / lark-tab to say "OAuth-install capability" instead of "real APIClient wired", and the misleading TransportOnly... test was renamed/refocused on the early-return branch it actually exercises (Elon non-blocking note). Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): identity-only OAuth + atomic bind (MUL-2671) Addresses Elon's round-4 must-fix items on PR #3277: 1. OAuth v2 token → user_info chain now matches Lark's official user-OAuth shape. `httpAPIClient.ExchangeOAuthCode` POSTs /open-apis/authen/v2/oauth/token (RFC 6749: top-level access_token, NO open_id), then GETs /open-apis/authen/v1/user_info with the user_access_token as Bearer to obtain the installer's open_id / union_id. The test fixture now reflects the real wire shape (separate user_info handler; no synthetic open_id in the token response). 2. `OAuthExchangeResult` is identity-only — drops the synthesized shared-parent AppID / AppSecret / BotOpenID return that broke the UNIQUE(app_id) constraint and the dispatcher's per-app_id routing. `OAuthService.HandleCallback` no longer Upserts an installation row: it looks up the lark_installation already provisioned via the manual-paste POST /lark/installations route and binds the installer onto it. Two new typed errors — ErrInstallationNotProvisioned and ErrInstallationRevoked — map to `installation_not_provisioned` / `installation_revoked` reasons at the HTTP boundary so the UI can guide the admin. The PersonalAgent install API (which would deliver per-installation bot credentials at scan time) remains a follow-up; until it lands the OAuth flow is identity-binding only and the agent-detail bind button stays hidden on deployments without OAuth env (capability gate unchanged). 3. The installation lookup + installer bind run inside a single DB transaction so a concurrent revoke / re-provision between the read and the binding insert cannot leak a half-applied state. `InstallerBinder.BindInstaller` is renamed to `BindInstallerTx` and accepts the OAuth-service-owned transaction's qtx; the binding_token redemption path is unchanged. 4. `validateExchangeResult` is simplified to require only the installer's open_id; the obsolete ErrExchangeMissingAppID / AppSecret / BotOpenID sentinels are removed (no caller can trip them now). The oauth_test suite is rewritten to use a stub failTxStarter so tests covering state-token verification and exchange-error propagation remain DB-free, while a new TestOAuthCallbackOpensTxAfterValidExchange pins the post-must-fix order (state ok + exchange ok ⇒ Begin runs before any lookup or bind, and a Begin failure aborts cleanly with no bind). Verified locally: - go build ./... / go vet ./... clean - go test ./internal/integrations/lark/... ✓ - go test ./internal/handler -run 'Lark\|Binding\|OAuth' ✓ - go test ./internal/util/secretbox/... ./internal/service/... ✓ Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): device-flow scan-to-install (MUL-2671) Replaces the manual paste-credentials install path + identity-only OAuth callback (rejected in product review: too many steps before a user sees value) with a true single-step scan-to-install built on Lark's RFC 8628 device-flow registration endpoint (POST accounts.feishu.cn/oauth/v1/app/registration) — the same protocol the official larksuite/oapi-sdk-go/scene/registration package and zarazhangrui/feishu-claude-code-bridge use. User journey: admin clicks "Bind to Lark" on the Agent detail page → QR dialog opens → admin scans in the Lark app on their phone → authorizes the new PersonalAgent → dialog auto-closes with the new installation visible. No app_id / app_secret to copy, no Lark developer console visit, no Multica-side OAuth env to configure. Backend (server/internal/integrations/lark): - registration.go — inline ~280-line RFC 8628 client. Begin posts archetype=PersonalAgent / auth_method=client_secret / request_user_info=open_id; Poll follows the upstream SDK's state machine including the tenant-brand mid-stream domain swap to accounts.larksuite.com when a Lark-international account authorizes. SDK is NOT vendored — one endpoint isn't worth dragging the full oapi-sdk-go + transitive deps. - registration_service.go — owns the in-process session store + background polling goroutine. On success calls APIClient.GetBotInfo (the new IM-side endpoint added below) and writes lark_installation + the installer's lark_user_binding inside one DB transaction so a half-applied install can never land. Stable error_reason codes (expired / access_denied / lark_protocol_error / bot_info_failed / installation_conflict / installer_bind_failed / internal_error) drive the UI copy without parsing prose. - client.go / http_client.go — drops ExchangeOAuthCode and SupportsOAuthInstall (no longer applicable: device-flow returns identity alongside credentials in one response); adds GetBotInfo which mints a tenant_access_token from the freshly-minted client_id / client_secret and calls /open-apis/bot/v3/info for the bot_open_id. install_supported now gates on IsConfigured() (real HTTP client wired) instead of a separate OAuth capability. - binding_token.go — absorbs InstallerBindParams / InstallerBinder (previously in oauth.go), retargets the doc-comment from the OAuth caller to the device-flow caller. - Deletes oauth.go + oauth_test.go entirely. Handler & router (server/internal/handler, server/cmd/server): - POST /api/workspaces/{id}/lark/install/begin — opens a new registration session, returns {session_id, qr_code_url, expires_in_seconds, poll_interval_seconds}. Admin-only. - GET /api/workspaces/{id}/lark/install/{sessionId}/status — polling endpoint, returns {status, installation_id?, error_reason?, error_message?}. Workspace-scoped lookup so a stolen session_id cannot be polled from another workspace. Admin-only. - Removes POST /lark/installations (paste form), GET /lark/install/start (OAuth-redirect entry), and GET /api/lark/install/callback (OAuth redirect target). - Removes MULTICA_LARK_OAUTH_APP_ID / _APP_SECRET / _REDIRECT_URI / _STATE_SECRET / _AUTHORIZE_URL / _SUCCESS_URL env vars. Self-host operators no longer need a parent Lark app at all. Frontend (packages/core, packages/views): - New types BeginLarkInstallResponse / LarkInstallStatusResponse + matching API methods (beginLarkInstall / getLarkInstallStatus); drops getLarkInstallURL. - LarkAgentBindButton opens LarkInstallDialog instead of a window.open() to Lark's authorize page. The dialog uses react-qr-code (catalog) to render the verification_uri_complete inline as SVG (no external CDN image), polls status at the server-supplied cadence, auto-closes on success, offers "scan again" on terminal failure. Per CLAUDE.md "Enum drift downgrades, not crashes", error_reason switch has a default fallback so an older desktop build on a newer server still renders the generic failure copy. - Adds the device-flow strings to en + zh-Hans settings.json; removes the obsolete OAuth-not-configured copy. Verified locally: - go build ./... / go vet ./... clean - go test ./internal/integrations/lark/... — all green (existing tests + 15 new registration / GetBotInfo tests) - go test ./internal/handler -run 'Lark\|Binding' — all green - pnpm typecheck — all 6 packages clean - pnpm lint — 0 errors (15 pre-existing warnings, none in changed files) - pnpm --filter @multica/views test — 859/859 pass Pre-existing failures in server/internal/middleware (column "profile_description" missing from local test DB) reproduce against the parent commit and are unrelated to this change. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): gate bind CTA to workspace admins, terminate QR polling on 4xx (MUL-2671) Two frontend must-fixes from the PR #3277 二审: 1. LarkAgentBindButton now self-hides for non-admin viewers in addition to the existing install_supported check. The agent-detail page mounts the button under `canEdit`, which canEditAgent lets agent owners through even when they are not workspace admins — but the backend gates POST /lark/install/begin and the status poll on owner/admin (router.go:478-487), so the previous behavior shipped a CTA that was guaranteed to 403. The new gate reads workspace role from the same member list the settings tab already uses. 2. The status polling loop now terminates on 404 (session gone — server restarted, multi-instance routing, or in-process GC swept it) and 403/401 (permission revoked mid-session). Previously every error path scheduled another setTimeout, which trapped the user on a stale QR forever. ApiError gives us the HTTP status verbatim; terminal responses set status=error with stable error_reason codes (session_lost, forbidden) that flow through the existing dialog switch + retry/close affordances. 5xx + network blips still retry. i18n: new install_error_session_lost / install_error_forbidden in en and zh-Hans, with default fallback preserved per the enum-drift rule. Coverage: 6 new vitest cases — admin/owner allow, member deny, unsupported-install deny, and the two terminal-error polling paths using fake timers to assert the loop stops scheduling. Also clears a handful of stale OAuth/manual-install doc comments flagged in the review (non-blocker cleanup): doc.go's §10 now points at RegistrationService, installation.go's input-shape doc loses the OAuth-callback half, and client.go's stubAPIClient comments no longer reference OAuth callbacks. Co-authored-by: multica-agent <github@multica.ai> * docs(integrations/lark): describe gate as device-flow install in agent-detail integrations comment (MUL-2671) The comment block above the agent-detail Integrations section still described the capability gate as 'server-side OAuth-install'. The OAuth path is gone — install is now device-flow per RFC 8628 — so the comment now reads 'server-side device-flow install capability gate'. Pure comment change; behavior is unchanged. Cleans up the nit Elon called out in PR #3277 二审 (MUL-2671). Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): wire inbound pipeline + WS Hub at boot (MUL-2671) Stage 3.a of MUL-2671. Hub class, Dispatcher, ChatSessionService and AuditLogger have all been implemented and tested in prior PRs but none of them was constructed at boot, so the in-process plumbing was never exercised end-to-end. This change wires them together behind the same `MULTICA_LARK_SECRET_KEY` gate that already gates InstallationService / RegistrationService, and starts the Hub under the existing `sweepCtx` so it winds down alongside the other long-running workers after HTTP drain. The real long-conn EventConnector is still pending; the factory hands every supervisor a shared NoopConnector that holds the lease and emits nothing. That lets staging exercise the lease / supervisor / shutdown lifecycle against real DB rows without committing to the Lark wire protocol implementation. Swapping in the real connector is a single line change in the same router block; the Dispatcher / ChatSessionService / Hub seams stay frozen. ## Why a noop placeholder, not a stub-or-skip The Hub's value is mostly its lifecycle: §4.4 ownership lease, LeaseRenewInterval / LeaseTTL, supervisor reap on revoke, clean release on shutdown. None of that runs unless the Hub is actually started. Holding off until the real connector lands means the next PR has to debut both pieces simultaneously; wiring the supervisor loop first lets the real connector PR be a focused, reviewable swap. ## Changes - `internal/integrations/lark/noop_connector.go` — `NoopConnector` implementing `EventConnector`: blocks on ctx until the Hub cancels (lease loss / shutdown / revoke), emits no events, logs on enter/exit so operators see exactly which installation the supervisor is holding the lease for. - `internal/integrations/lark/noop_connector_test.go` — verifies the connector blocks until ctx cancel, returns nil on clean exit, never invokes the emit callback, and the factory shares a single connector instance across installations. - `internal/handler/handler.go` — new `LarkHub lark.Hub` field on `Handler`. Nil when the Lark integration is disabled. - `cmd/server/router.go` — inside the existing Lark wiring block, construct `AuditLogger`, `ChatSessionService` (with `pgxpool.Pool` for the in-tx dedup Mark), `Dispatcher` (wiring `h.IssueService` and `h.TaskService` so `/issue`-created issues share counter / duplicate guard / project boundary / broadcast / analytics with the rest of the product), and the `Hub` with the `NoopConnectorFactory`. `NewRouterWithOptions` now returns `(chi.Router, handler.Handler)` so main.go can drive Hub lifecycle; `NewRouter` discards the handler. - `cmd/server/main.go` — start the Hub under `sweepCtx` after the other background workers, and `Wait` on it after HTTP drain + sweep cancel so the lease renewer can issue a final release before exit. Skipped entirely when `h.LarkHub == nil`. ## Test plan - [x] `go build ./...` clean - [x] `go vet ./...` clean - [x] `go test ./internal/integrations/lark/...` (new noop tests + existing hub / dispatcher / chat_service / registration / binding_token / outbound / issue_command suites) — all pass - [x] `go test ./internal/handler -run 'TestLark\|TestRedeemLarkBinding'` pass — handler-side Lark surfaces unchanged - [x] `go test ./internal/service/... ./internal/util/secretbox/...` pass - [x] `pnpm --filter @multica/views exec vitest run settings/components/lark-tab` pass (6/6) — frontend lark surfaces unchanged - [ ] Local broad `go test ./internal/handler/...` still blocked by the pre-existing test DB schema drift Elon flagged in the previous round (`column "metadata" does not exist`, unrelated to this change); CI is the authoritative check. - [ ] Manual end-to-end deferred until the real long-conn EventConnector lands (next stage). MUL-2671 Co-authored-by: multica-agent <github@multica.ai> fix(integrations/lark): bound Hub lease release + shutdown wait (MUL-2671) Lease release used context.Background(); a stalled DB pool could pin shutdown indefinitely. Add LeaseReleaseTimeout (5s default) and ShutdownTimeout (15s default) to HubConfig, route releaseLease through a bounded context, and expose WaitWithTimeout for main.go so a wedged supervisor degrades to LeaseTTL expiry on the next replica instead of blocking process exit. Also correct the LarkHub field comment in handler.go: the Hub is wired whenever the at-rest secret key is set, independent of whether the outbound HTTP APIClient is configured. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): real WS long-conn connector + ctx-cancel-breaks-read (MUL-2671) Replaces NoopConnectorFactory with a production EventConnector that opens Lark's event-subscription WebSocket. Gated behind MULTICA_LARK_WS_ENABLED so staging boots stay on the noop path until operators opt in, and falls back to noop with a warning when the WS flag is set without MULTICA_LARK_HTTP_ENABLED (the real connector needs the cached tenant_access_token). Why this connector exists separately from the Hub: gorilla/websocket ReadMessage blocks on the underlying TCP socket and does not observe context. The watchdog goroutine inside WSLongConnConnector.Run closes the conn the moment ctx fires, so lease loss / shutdown breaks the blocking read in bounded time — exactly the invariant Hub renewLeaseUntil's runCancel depends on for the "at most one active WS per installation across replicas" guarantee. Tests cover this explicitly (TestWSConnectorRunReturnsOnCtxCancelEvenWhenReadIsBlocked). The Lark wire surface is split into three swappable seams so the transport layer stays tested in isolation: - EndpointFetcher (POST /event-subscription/v1/connection_token) resolves a one-shot wss URL per Run. No caching — replaying a one-shot token would look like a Lark outage. - FrameDecoder turns one raw JSON envelope into an InboundMessage or a "control / heartbeat / drop" verdict. Decoder errors log + drop the frame; they do NOT tear down the connection. - CredentialsProvider wraps InstallationService.DecryptAppSecret so plaintext app_secret lives in memory only during a Run. Also fixes the handler.go LarkHub comment: it still said "joins on Wait during graceful shutdown" but main.go has used WaitWithTimeout (bounded wait) for several commits. Comment now matches. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): align WS to official binary Frame protocol + DispatchResult outbound replies (MUL-2671) Two must-fix items from Elon's review of PR #3277: 1. WS protocol layer rewritten to match the official Lark Go SDK (`larksuite/oapi-sdk-go/v3/ws`): - Bootstrap is `POST /callback/ws/endpoint` with AppID/AppSecret in the body (no tenant_access_token bearer). Response carries wss URL + ClientConfig (PingInterval / ReconnectInterval / ReconnectNonce / ReconnectCount). - `service_id` is parsed from the wss URL query and used as Frame.Service on every outbound frame. - Wire envelope is the binary protobuf `pbbp2.Frame` (hand-rolled via protowire to avoid pulling the whole SDK in, byte-identical field tags). JSON payloads are nested inside Frame.Payload. - Inbound data frames are ACKed with a `Response{code:200,...}` JSON payload that reuses the inbound headers; infra failures produce code=500 so Lark retries. - Ping is the app-layer binary `NewPingFrame(serviceID)` at the server-supplied cadence; WebSocket protocol PING is removed (Lark ignores it). Server-initiated pings get a pong reply. - ctx-cancel-breaks-read invariant preserved via the watchdog goroutine that closes the conn on ctx.Done; the read loop and ping goroutine serialize their writes through a single mutex. 2. `DispatchResult` outbound replies wired via a new `OutcomeReplier`: - `OutcomeNeedsBinding` mints a one-shot binding token and sends the binding prompt card to the sender's open_id. - `OutcomeAgentOffline` / `OutcomeAgentArchived` push a notice card into the chat with the agent name + Chinese copy matching §4.6. - `OutcomeIngested` stays owned by the Patcher; `OutcomeDropped` is silent. - The replier is best-effort: outbound failures are logged and swallowed so a Lark outage cannot stall the inbound pipeline. - Hub installs the noop replier by default; router wires the production `LarkOutcomeReplier` when APIClient.IsConfigured(). PersonalAgent long-conn risk surfaced (open per Feishu docs: `长连接模式仅支持企业自建应用`). The implementation works for any app archetype; the open question is whether `/callback/ws/endpoint` accepts PersonalAgent credentials in practice. Surfacing the Lark code+msg verbatim from the bootstrap response so an operator running the smoke test sees the exact failure rather than a generic timeout. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): byte-compat Frame marshal, chunk reassembly, ACK off reply critical path (MUL-2671) Three protocol blockers from Elon's review of `9540008a`: 1. Frame.Marshal is now byte-identical to oapi-sdk-go/v3/ws/pbbp2.Frame: - SeqID/LogID/Service/Method (proto2 req) emit unconditionally even at zero - PayloadEncoding/PayloadType/LogIDNew emit unconditionally per gogo generated MarshalToSizedBuffer (no zero-guard) - Payload uses the SDK's `!= nil` guard (nil omits, []byte{} emits 0-length) - ACK payload JSON matches SDK's NewResponseByCode + json.Marshal output ({"code":N,"headers":null,"data":null}) Golden tests pin exact byte sequences for ping/pong/ACK/full/zero frames; verified against the real SDK pbbp2.pb.go MarshalToSizedBuffer producing identical bytes. 2. Multi-frame events (sum>1) are reassembled via the new chunkAssembler: - 5s sliding TTL (matches SDK combine() cache TTL) - Lazy GC on admit (no separate sweeper goroutine) - Out-of-order seq + duplicate seq idempotent - Partial chunks are NOT ACKed (SDK behaviour: only the final chunk's ACK confirms the whole event so Lark can retry on partial loss) - Connector wires assembler per-Run; state dies with the session 3. OutcomeReplier detached from ACK critical path: - HubConfig.ReplyTimeout default 2.5s, strictly under Lark's 3s ACK deadline - handleEvent dispatches synchronously (fast DB path), then spawns the replier under a fresh background ctx with WithTimeout(ReplyTimeout) - Hub.replyWg tracks in-flight replies; Hub.Wait / WaitWithTimeout drain them so shutdown is bounded - Noop replier short-circuits inline (no goroutine cost when outbound APIClient isn't configured) Proof tests: - TestHubScheduleReplyReturnsImmediately: scheduleReply with a 10s slow replier returns in <50ms - TestHubReplyTimeoutCancelsHungReplier: hung replier ctx fires at ReplyTimeout - TestHubWaitDrainsInFlightReplies: Wait blocks until replies finish - TestHubACKNotBlockedByOutboundReply: end-to-end through the connector — data-frame ACK lands within 500ms even when the replier hangs 5s PersonalAgent real-env smoke remains Bohan's decision; this PR closes the technical blockers Elon flagged. Co-authored-by: multica-agent <github@multica.ai> * docs(service/issue): narrow position concurrency claim to create-create (MUL-2671) Elon's review of the merge resolution flagged that the comment on the new NextTopPosition call promised more than the code guarantees: concurrent manual reorder via UpdateIssue(position) does NOT take the workspace row lock that IncrementIssueCounter holds, so a create racing a reorder can still land on the same position. Rewrite the comment to only claim create-create serialization, which is the behaviour the lock actually delivers. No code change. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): keep device-flow polling on RFC 8628 HTTP 400 (MUL-2671) Lark's device-flow polling endpoint returns HTTP 400 with the JSON body `{"error":"authorization_pending"}` while the user hasn't scanned the QR yet — this is the RFC 8628 spec, and the upstream oapi-sdk-go implements the same handling. Our previous doForm treated ANY non-2xx as a terminal protocol error, so every install session was killed by the first poll (~5s after begin) and the install dialog appeared silently empty: the frontend received status=error + lark_protocol_error before the user could even read the description. Fix: doForm now decodes the JSON body first; if it parses, the caller (Begin / Poll) routes on the body's `error` field, where the existing switch correctly maps authorization_pending / slow_down to "keep polling" and access_denied / expired_token to terminal failure. Only unparseable bodies (5xx HTML proxy pages, gateway timeouts) still surface as a typed http_NNN RegistrationError. Three regression tests pin the new behaviour: - HTTP 400 + authorization_pending → res.Status="authorization_pending" - HTTP 400 + access_denied → res.Err.Code="access_denied" (terminal) - HTTP 502 + HTML body → http_502 RegistrationError Verified against the live local env: install/begin -> 200, status stays "pending" through the first poll cycle, no longer flips to "error" within seconds. Co-authored-by: multica-agent <github@multica.ai> * fix(views/lark): reset closedRef on every mount so StrictMode double-mount renders QR (MUL-2671) Empty QR dialog body in the dev env: Bohan opened the bind dialog and got an empty white area where the QR should have been — no QR, no "starting" placeholder, no error text. Backend was returning the QR URL correctly; the bug was on the frontend. Root cause: React 19 / Next.js dev StrictMode mounts every component twice (mount → cleanup → mount). The component instance is REUSED across the simulated remount, which means useRef objects are preserved. The dialog's `closedRef` lifecycle: 1. Mount #1: closedRef={current:false}, beginSession() kicked off (HTTP request still in flight) 2. Cleanup runs: closedRef.current=true 3. Mount #2: beginSession() kicked off again, BUT the ref still reads {current:true} from step 2 4. Both promises resolve. Both hit the post-await guard `if (closedRef.current) return;` and bail out before setSession(). 5. Result: session stays null forever. Every conditional in the dialog body (beginning/session-pending/success/error) is false → empty body. Fix: reset closedRef.current=false at the START of the effect, not just at component construction. The cleanup-then-mount pair now re-arms the guard so subsequent setSession calls actually land. Regression test wraps the dialog in <StrictMode> and asserts the QR appears within 2s with the correct value — fails closed if anyone removes the reset. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): drop EventTaskCompleted subscription so the chat reply doesn't get overwritten by "Done." (MUL-2671) Bohan reproduced on the live dev env: agent replies show only a card saying "Done." in Lark, even though Multica's own chat panel has the real "Hello! I'm cc…" reply. Tasks succeed end-to-end, but the user loses the reply on the Lark side. Root cause: TaskService.CompleteTask publishes two events for every chat task IN ORDER: 1. broadcastChatDone(...) → ChatDonePayload{Content: "Hello!..."} 2. broadcastTaskEvent(Completed) → map[string]any{task_id, agent_id,...} (no `content` key) The Patcher subscribed to BOTH and routed each to finalize(). The first patch correctly rendered the reply text, the second patched the same card with an empty payload — chatDoneContent() returned "" and the renderer fell back to "Done." (default empty-body copy). The second patch wins because Lark stores whatever was last applied. Fix: stop subscribing to EventTaskCompleted in the Patcher and remove the corresponding switch arm. EventChatDone is the canonical "agent finished replying" signal for the Lark card path; EventTaskCompleted is still emitted to the bus for other listeners (web UI, analytics, task usage) where the lack of content doesn't matter. Regression test TestPatcherIgnoresEventTaskCompletedForChatTasks emits ChatDone followed by TaskCompleted on a streaming card and asserts: exactly one patch, body contains the agent reply, body does NOT contain "Done.". If anyone re-adds the EventTaskCompleted subscription, this fails immediately. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): chat replies as plain text IM messages, not card chrome (MUL-2671) Bohan reported on the live dev env that even with the agent's reply shown correctly, every message is wrapped in an interactive card with the agent name as the header — it feels like a system notification, not a normal chat reply. He wants the reply to land as a regular Lark text bubble. Changes: - Add APIClient.SendTextMessage backed by Lark's /open-apis/im/v1/messages with msg_type=text. JSON-encodes the {"text": ...} envelope Lark requires so callers pass raw strings. - Patcher.Register no longer subscribes to EventTaskQueued / EventTaskRunning. There is no more thinking → running → final card lifecycle on the success path: it added card chrome without buying anything for free-form chat. - On EventChatDone, the new sendChatReply path posts the assistant message content as plain text. Empty content is silently dropped rather than rendered as "Done." (the prior fallback that confused Bohan). - Failure path keeps a one-shot error card on EventTaskFailed — the visual distinction from a normal reply is genuinely useful, and failures are rare enough that the chrome isn't noisy. - Throttle / lastPatched map / MinPatchInterval / shouldPatch / markPatched / loadCardOrSkip are all removed; nothing in the new flow patches. Tests: - TestPatcherSendsPlainTextOnChatDone pins the new contract: exactly one SendTextMessage call, no card sends or patches, content matches the ChatDonePayload. - TestPatcherDropsEmptyChatReply pins the "no more Done. fallback" decision — empty content drops, period. - TestPatcherFailEventSendsErrorCard pins the failure path still uses a card (one-shot, no patching). - TestPatcherIgnoresEventTaskCompletedForChatTasks rewritten for text path: ChatDone then TaskCompleted yields exactly one text send, no duplicate. - TestPatcherSkipsWhenNoChatSessionBinding and TestPatcherSwallowsInstallationLoadErrors rewritten to drive EventChatDone (the new entry point) instead of TaskQueued. - TestPatcherSendsThinkingCardOnTaskQueued deleted (no more thinking card). Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): pre-fill PersonalAgent bot name as "<agent> - Multica" (MUL-2823) (#3520) The device-flow install left the bot at Lark's auto-generated "{用户姓名}的智能助手". Lark's registration scene supports pre-filling the name via a `name` query param on the verification/QR URL (mirrors the upstream SDK's AppPreset.Name) — a user-editable default that rides on the QR URL, not the begin POST body (which has no name field). BeginInstall already loads the agent for its ownership check, so we keep it and thread `<agent.Name> - Multica` through Begin → decorateQRCodeURL. A blank name degrades to plain "Multica". There is no post-install rename API (bot/v3 is read-only; no bot/v3/update), so the install-time pre-fill is the only programmatic lever; the user can still edit the name on the creation form. Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): restore /issue confirmation + pin SendTextMessage wire (MUL-2671) Two recovered/added contracts off Trump's review of HEAD `fe381a07`: 1) /issue confirmation in Lark was a casualty of the plain-text refactor. The pre-refactor `RenderInput.IssueNumber` field was declared but never actually rendered into the card body, so even in the original card-based flow the user never saw a "Created [MUL-42]" confirmation. Now the OutcomeReplier handles OutcomeIngested + IssueID.Valid by sending a plain text message: Created MUL-42 — fix login bug https://multica.example/issues/MUL-42 Composed from a new DispatchResult.IssueIdentifier + IssueTitle, populated by the Dispatcher from workspace.IssuePrefix + issue.Number / issue.Title. Workspace lookup is best-effort: a Postgres blip on workspace gets a "#42" fallback rather than silently dropping the confirmation. The agent's own chat reply (if any) continues to land separately via the Patcher on EventChatDone — these are two semantically distinct messages and the user benefits from seeing both. 2) SendTextMessage is the wire layer Trump flagged for missing coverage. Three new wire tests pin: - happy path: POST /open-apis/im/v1/messages?receive_id_type=chat_id, msg_type=text, Bearer <tenant_access_token>, double-JSON content envelope - special-character round trip: newlines, double quotes, backslashes, tabs, Chinese + emoji, JSON-lookalike strings. The inner {"text": ...} is encoded once at JSON.Marshal time and once again when the outer body serializes; losing either pass corrupts the message and the bug is invisible without a contract pin. - Lark error path: non-zero `code` surfaces as a wrapped error with the code embedded. Tests: - TestDispatcher_IssueCreationFromCommand asserts IssueIdentifier ("MUL-42") and IssueTitle propagate through DispatchResult. - TestDispatcher_IssueIdentifierFallsBackToNumberOnWorkspaceLookupErr pins the "#7" degrade-graceful fallback. - TestLarkOutcomeReplierIssueCreatedSendsConfirmation pins the text body (identifier + title + deep link) and asserts no card send on this path. - TestLarkOutcomeReplierOutcomeIngestedSilentWithoutIssue pins the silent-on-plain-chat default so we don't accidentally start emitting a confirmation for every message. - TestHTTPClient_SendTextMessage_* covers the wire contract. Frontend locale parity (en + zh-Hans, 53 tests) is currently green on this HEAD; no changes needed. Co-authored-by: multica-agent <github@multica.ai> * fix(views/locales): add missing ko keys for Lark MVP (MUL-2671) Trump flagged on PR #3277 review that the ko bundle was missing the Lark-MVP-only keys that en + zh-Hans both carry. The parity test caught it cleanly after main was merged in (Korean PR landed on main between the prior review and this one): common.lark_bind.* (13 keys) settings.page.tabs.lark (1 key) settings.lark.* (45 keys) agents.inspector.section_integrations (1 key) Korean translations are professional/concise — "Lark" stays as the brand name (matches how en keeps "Lark" + "(飞书)" parenthetically; ko/users searching for the product expect "Lark"), and product copy follows the zh-Hans tone where Multica nouns ("에이전트", "워크스페이스") are romanized loan words consistent with the rest of the ko bundle. Slot ordering preserved against EN: - page.tabs.lark sits between github and integrations - inspector.section_integrations sits right after section_skills Verified: pnpm exec vitest run locales/parity → 105/105 pass. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): /issue origin_type CHECK + Hub restart on credentials rotation (MUL-2671) Two live-env bugs Bohan reproduced: 1) /issue command crashed the WS connector. Dispatcher writes origin_type='lark_chat' on issues born from `/issue`, but the issue_origin_type_check CHECK constraint was last extended in migration 060 for quick_create — it doesn't list lark_chat, so every Lark /issue tripped SQLSTATE 23514 and bubbled up as an infra error. The infra error tore down the WS connector, Lark retried the same message, the new connector tripped the same constraint and crashed again. Repro in the live env: three crashes from the same /issue event over ~40s, each leaving the user with no confirmation in Lark. Migration 111 extends the CHECK list: CHECK (origin_type IN ('autopilot', 'quick_create', 'lark_chat')) 2) Re-scanning an already-bound agent silenced the bot. The device flow re-registers with Lark, which mints a brand-new bot (fresh app_id + app_secret); RegistrationService.finishSuccess upserts into lark_installation by agent_id, so the row's credentials rotate in place. But the running supervisor held the OLD inst struct by value and kept a WS open against the OLD bot's app_id — so all events to the NEW bot went nowhere. Bohan's "claude code 现在不能在飞书里回复了" symptom maps exactly to this: log timeline: 16:29:57 cc connector connected with app_id=cli_aa9398dd... (OLD) 16:34:07 lark registration: install complete (rotation) → row.app_id is now cli_aa93f36f... (NEW) → old WS still subscribed to OLD app_id; new app_id receives nothing Fix: Hub.sweep now compares each installation row's credentials fingerprint (app_id + bot_open_id + sha256(app_secret_encrypted)) against the snapshot the running supervisor was started with. On diff, cancel the old supervisor and start a fresh one inline. A monotonic gen counter on the supervisor entry disambiguates the old goroutine's deferred cleanup from the new entry the rotation path already swapped in. Tests: - TestHubRestartsSupervisorOnCredentialsRotation pins the new path: starts hub on app_one, rotates the row to app_two, asserts the connector factory is called again with the fresh AppID. - TestHubDoesNotRestartSupervisorOnUnchangedRow pins the negative case so an unchanged row doesn't degenerate into a per-sweep busy-loop. - Existing hub tests (lease, supervise, shutdown, ACK timing, noop replier) all green. Verification: - go test ./internal/integrations/lark/... -race -count=1 ok - go build ./... clean - migration applied locally; \d+ issue confirms lark_chat in CHECK Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): per-supervisor lease token to fence rotation handoff (MUL-2671) Elon flagged a race in HEAD be8d4cef's rotation path: both the old and the new supervisors of the same Hub used the hub-wide nodeID as their WS lease token, so an old supervisor's post-cancel releaseLease(nodeID) would CAS-match the lease row the successor had just acquired with the SAME token and DELETE it. Symptom would be a silently empty lease row a few hundred ms after every device-flow re-scan — no replica owning the install, no events delivered, the "bot goes quiet" pattern Bohan hit the first time but now from the fencing side rather than the credentials side. Fix: leaseToken(nodeID, gen) composes "<nodeID>-g<gen>", where gen is the monotonic counter already attached to each supervisorEntry. The nodeID prefix keeps cross-replica observability (an operator inspecting lark_installation.ws_lease_token can still map back to a process) while the -g suffix makes the OLD supervisor's release target the OLD row state. Once the rotation path swaps in the new supervisor, the row's CurrentToken is the new -g(N+1) token, so the old -gN release's WHERE clause no-ops instead of clobbering. acquireLease / renewLeaseUntil / releaseLease now take an explicit token argument; supervise threads its leaseToken through. The plumbing isn't pretty, but having an explicit argument at every call site is the only way the rotation invariant survives subsequent refactors — without it, a future caller could quietly reintroduce "just use h.nodeID" and the race is back. Two regression tests: - TestHubRotationStaleReleaseDoesNotClearSuccessorLease drives the fake lease state machine directly: 1. old acquires(tokenA) 2. rotation lands; new acquires(tokenB) 3. old's stale release(tokenA) fires Asserts owner ends up still tokenB. Hub-wide-nodeID code would fail step 3 by clearing the entry. - TestHubRotationEndToEndKeepsSuccessorLeased runs the same scenario through the live supervise loop: starts hub, rotates the row, waits for sup2 to take over with a distinct token, sleeps past sup1's unwind, asserts the row is still held by a non-sup1 token. Catches the bug even when the goroutine timing is non-deterministic. Verification: go test ./internal/integrations/lark/... -race -count=1 ok go build ./... clean go vet ./... clean Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): route group @-mentions via union_id, not open_id (MUL-2671) In a Lark group with multiple Multica bots installed, the bot whose WS received the event sometimes failed to recognize that it was the @-target while the OTHER bot's supervisor falsely fired. Bohan's controlled three- message test (only @A, only @B, @both) hit this: @A and @B alone went unanswered, @both got picked up by A only. Root cause: the `mentions[].id.open_id` field Lark puts on the WS event is structurally INVERSE to `/bot/v3/info`'s `bot.open_id` across the two WSes. From A's WS perspective, the wire-form open_id for "A was @-ed" is NOT equal to A's API-side open_id, but IS equal to what B's WS sees on its side, and vice versa. The decoder's `mention.open_id == inst.BotOpenID` match therefore fires on the wrong bot in multi-bot groups. Only `union_id` (the Lark-tenant-scoped stable identifier) is consistent across both WSes. Changes: - migration 112 adds nullable `lark_installation.bot_union_id` - sqlc query exposes UpsertLarkInstallation/CreateLarkInstallation with bot_union_id, plus a focused SetLarkInstallationBotUnionID for the backfill path - httpAPIClient.GetBotInfo now follows /bot/v3/info with /contact/v3/ users/{open_id}?user_id_type=open_id and returns both identifiers on BotInfo. Soft-fails on contact-scope denial: install still succeeds with an empty UnionID, and the decoder falls back to the legacy open_id match for single-bot deployments. - RegistrationService.finishSuccess persists union_id alongside open_id during the device-flow finalize. - ws_frame_decoder.containsMention prefers union_id and only walks open_id when the installation row has not been backfilled yet. - BackfillBotUnionIDs runs once at server boot for installations created before migration 112; bounded per-row 10s timeout and a pure soft-fail policy so a slow Lark round-trip cannot block startup. - regression tests cover the three decoder paths: union_id match wins over open_id mismatch, union_id mismatch overrides open_id match, and open_id fallback when union_id is unknown. Co-authored-by: multica-agent <github@multica.ai> * chore: drop trailing blank lines at EOF on four files (MUL-2671) git diff --check origin/main..origin/pr-3277 flagged these as new blank lines at EOF; clearing so the diff stays clean for review. Co-authored-by: multica-agent <github@multica.ai> * fix(views/locales): add missing ja keys for Lark MVP + section_integrations (MUL-2671) CI frontend job tripped on the ja locale parity check: ja is missing the lark_bind block in common.json, the lark block + page.tabs.lark in settings.json, and inspector.section_integrations in agents.json. The ko fix earlier covered Korean; ja was added separately on main and the merge surfaced these gaps. Translations mirror the en source and follow the same voice as the existing ja bundle. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): rewrite @_user_N placeholders into clean body (MUL-2671) When Lark dispatches a group `im.message.receive_v1`, the message text contains opaque `@_user_1`, `@_user_2`, … placeholders and the real identity is in `mentions[]`. We were forwarding the raw text to the agent, so a Bohan-typed "@Bot ping test" arrived as "@_user_1 ping test" — neither human-readable nor useful as LLM context, and the agent was paying tokens to figure out which `@_user_N` was even itself. The new resolveMentions pass: * strips the bot's own mention entirely (the dispatcher already routes the event on AddressedToBot; re-emitting @<self> in front of every message adds zero signal and pollutes context), * substitutes other participants with `@<displayName>` so a follow-up "@Alice" reads naturally, * collapses horizontal whitespace introduced by the strip while preserving original newlines. Bot identity check uses the same union_id-preferred + open_id fallback as containsMention, so the rewrite stays consistent with the routing path. Tests cover the four shapes: bot self-mention, mixed bot + other-user mention, multi-line body with stripped mention, and a no-mention body that should be left untouched. Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): union_id-first self mention strip + token-aware scan + local whitespace cleanup (MUL-2671) Three review blockers on the mention rewrite from PR review: 1. isBotMention now mirrors containsMention's union_id-first policy. When the installation row knows our union_id, we trust it exclusively (open_id is structurally inverted in multi-bot groups — matching on it would re-introduce the routing bug we fixed two commits ago). open_id fallback fires only when union_id is absent. New tests: @-ing both bots in one message correctly strips only self and renders the sibling as @<name>; open_id-matches-but-union_id-differs does NOT strip. 2. resolveMentions no longer collapses or trims whitespace globally. Indentation, tabs, code blocks, tables — all preserved verbatim. When the self mention is removed we eat exactly one adjacent horizontal space (the one after the placeholder, or, when the mention sits at end-of-input, a single space already emitted right before it). New test exercises a multi-line indented + tabbed body and asserts the whole shape survives. 3. Prefix-collision-safe replacement. A chat with 11+ participants exposes both `@_user_1` and `@_user_10`; naive ReplaceAll for `@_user_1` would mangle the substring of `@_user_10`. The resolver now does a single-pass token scan with the mention list sorted longest-key-first, so the longer placeholder always wins at any scan position. New test covers the @_user_1 / @_user_10 case explicitly. Also drops the temporary INFO-level diag logging the previous commit added — root cause was confirmed (union_id swap in the manual backfill; not a decoder bug). Co-authored-by: multica-agent <github@multica.ai> * fix(integrations/lark): scope inbound dedup per (installation_id, message_id) (MUL-2671) Root cause of the residual "@Cc gets dropped as not_addressed_in_group" even after the union_id swap landed: lark_inbound_message_dedup was keyed on `message_id` alone. In a Lark group chat where the workspace has multiple Multica bots installed, Lark delivers the SAME message_id to every bot's WS supervisor. Whichever WS claimed first then ran its own AddressedToBot check; the bot that was actually @-ed lost the dedup race, found the row already terminal (`processed_at IS NOT NULL`), and was dropped as `duplicate` BEFORE it could evaluate its own mention. Net: every @ silently disappeared if Lark happened to route the OTHER bot's WS first. The dedup gate's original purpose (idempotency against WS reconnect replay) is per-installation by definition, so the right key is composite (installation_id, message_id). Changes: - migration 113 drops + recreates lark_inbound_message_dedup with installation_id NOT NULL REFERENCES lark_installation(id) ON DELETE CASCADE and PRIMARY KEY (installation_id, message_id). The table is a 24h transient cache, so dropping existing rows is safe. - sqlc queries: ClaimLarkInboundDedup / MarkLarkInboundDedupProcessed / ReleaseLarkInboundDedup all now take installation_id. - AppendUserMessageParams carries InstallationID through to the in-tx Mark call so the chat_message+dedup atomicity stays intact. - Dispatcher passes inst.ID to claim + applyFinalize + AppendUserMessage. - Test fakes key dedup state on (installation_id, message_id) via a composite map key; all existing pre-seeded rows use a seedDedupKey helper bound to the default activeInstallation fixture so the prior staleness / token-rotation / in-tx mark tests still exercise the same regression they did before. - New regression TestDispatcher_DedupIsScopedPerInstallation pins the multi-bot invariant: a row pre-seeded for installation A does NOT block installation B's first delivery of the same message_id; B runs through its own group-filter / identity / ingest pipeline. Co-authored-by: multica-agent <github@multica.ai> * feat(integrations/lark): render markdown chat replies via schema-2.0 card (MUL-2671) The agent's chat replies were going out as msg_type=text, so every `bold`, fenced code block, list, table, and link in the body showed up as literal markdown characters in Lark — the user saw raw asterisks, hashes, pipes instead of formatted text. Bohan reported this and pointed at zarazhangrui/lark-coding-agent-bridge as the shape to emulate. The bridge repo uses Lark interactive cards with the schema-2.0 envelope and a `tag: "markdown"` body element; Lark's client renders that to formatted text (GFM-ish: bold/italic, headings, lists, links, fenced code blocks, tables, blockquotes). They expose multiple reply modes (card / markdown-as-post / text) gated by user config; we go a step simpler — auto-detect markdown syntax in the agent's body and route accordingly: - containsMarkdown(): cheap substring + regex pass for fenced code blocks, headings, list markers, bold/italic, tables, links, blockquotes, horizontal rules, inline code. Biases toward false- positive — wrapping prose in a card still renders fine, but missing a real markdown block leaves raw characters visible. - APIClient gains SendMarkdownCard / SendMarkdownCardParams. Implementation marshals the schema-2.0 envelope verbatim: {schema:"2.0", body:{elements:[{tag:"markdown", content: md}]}}. Stub returns ErrAPIClientNotConfigured. - Patcher.sendChatReply now branches on containsMarkdown: markdown → SendMarkdownCard, plain prose → SendTextMessage. A one-liner "sure, on it" stays as a normal IM bubble (no card chrome); anything with markdown gets the rendered card. Tests: TestContainsMarkdown pins the heuristic across plain prose and ten markdown shapes; TestPatcherRoutesMarkdownReplyToCard and TestPatcherRoutesPlainReplyToText cover the router; new HTTP wire test TestHTTPClient_SendMarkdownCard_HappyPath contract-pins the card envelope (msg_type=interactive, schema 2.0, markdown tag, verbatim body). Full lark suite passes. Co-authored-by: multica-agent <github@multica.ai> * fix(service/issue): route analytics.IssueCreated through obsmetrics.RecordEvent (MUL-2671) CI's TestNoNakedAnalyticsCaptureInHandlersOrServices guard caught the post-merge analytics call in IssueService.captureCreatedAnalytics that still used s.Analytics.Capture(...) directly. Main added that lint to prevent the Prometheus and PostHog sides from drifting — any new analytics.* event must go through obsmetrics.RecordEvent so the business-metrics collector and the PostHog client fire from the same call site. Fix mirrors how TaskService handles it: IssueService gains a Metrics obsmetrics.BusinessMetrics field (router wires it via h.IssueService.Metrics = opts.BusinessMetrics next to the existing TaskService line), and the in-service Capture call becomes obsmetrics.RecordEvent(s.Analytics, s.Metrics, ...). nil-safe by construction — RecordEvent treats a nil Metrics as PostHog-only. Co-authored-by: multica-agent <github@multica.ai> feat(views/lark): swap Bind CTA for Connected+Manage link when agent already has an installation (MUL-2671) Bohan reported the agent-detail Bind button keeps inviting the user to re-scan the QR even when the agent already has an active Lark PersonalAgent connected — and re-scanning silently upserts the installation row, leaving the previously-created Lark bot dangling as a zombie. Frustrating UX and an actual product footgun. Anti-zombie guard at the only entry point: LarkAgentBindButton now checks the cached installations listing for an active row pinned to this agent_id. When one exists, the install CTA is gone — replaced by a small Connected pill + an "Manage in Lark" link that opens the Bot's app page in Lark's developer console (open.feishu.cn/app/<app_id>) in a new tab. That's where scopes, display name, and additional permission requests actually live; re-scanning never was the right answer for managing an existing bot. Scoping is per-agent: an active installation on a DIFFERENT agent in the same workspace doesn't affect this agent's button, and a revoked installation falls back to the bind CTA so the user can re-create. Tests cover all four states (own-active / own-revoked / other-agent-active / no-installation) and pin the Manage link's href + target=_blank + noopener. i18n: three new keys in settings.json (en / zh-Hans / ja / ko): agent_bot_connected_label, agent_bot_manage_link, agent_bot_manage_tooltip. Locale parity test still 157/157. The dev console host is hardcoded to open.feishu.cn — operators on the Lark international tenant currently get the wrong host; future-proof fix wants the backend to surface a per-installation dev_console_url on the listings response, called out in a code comment. Co-authored-by: multica-agent <github@multica.ai> * feat(views/settings): collapse Lark into Integrations + render agent identity (MUL-2671) Lark was its own top-level workspace settings tab while Integrations sat empty next to it. As more integrations land, the sidebar would balloon with one tab per provider. Move the Lark surface into Integrations as the first hosted integration; the old ?tab=lark URL redirects through LEGACY_WORKSPACE_TAB_REDIRECTS so bookmarks still resolve. The Connected bots list was leaking the raw Lark app_id (cli_…) as the row title with bot_open_id (ou_…) underneath — meaningless to product users. Since the binding is 1:1 with a Multica Agent, join on agent_id and render the agent's avatar + name via the workspace-standard ActorAvatar + useActorName.getAgentName. Deleted agents fall back to "Unknown Agent" so the row is still actionable for cleanup. Tests: stub useActorName + ActorAvatar in lark-tab.test.tsx and add LarkTab connected-bot tests covering the agent identity render and the deleted-agent fallback. Drop the now-dead integrations.* + page.tabs.lark + lark.bot_open_id_label keys across all four locales — parity still 157/157, views suite 1141/1141. Co-authored-by: multica-agent <github@multica.ai> * feat(views/settings): wrap Lark in a named section inside Integrations (MUL-2671) Integrations is meant to host multiple providers (Slack, Linear etc. as they land), so the Lark content should sit under a Lark heading rather than fill the tab directly — otherwise the first additional integration would feel like it broke the IA. Add a "Lark" / "飞书" section heading above LarkTab using the same h2 chrome the other settings tabs use, and pin lark.section_title across all four locales (parity 169/169). Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai> Co-authored-by: J <j@multica.ai>	2026-06-03 19:12:14 +08:00
LinYushen	9c9afd4a66	feat(metrics): BusinessSamplerCollector for active users / queued / runtime gauges (MUL-2947) (#3706 ) * feat(metrics): scrape-time BusinessSamplerCollector for active users / queued / runtime gauges (MUL-2947) Adds an opt-in prometheus.Collector that runs a fixed set of read-only SQL queries on every /metrics scrape and exposes the results as gauges: - multica_active_users{window=5m\|1h\|24h} - multica_active_workspaces{window=...} - multica_agent_task_queued{source} - multica_agent_task_running{source,runtime_mode} - multica_agent_task_stuck_total{source} - multica_runtime_online{runtime_mode,provider} - multica_runtime_heartbeat_age_seconds{runtime_mode} (histogram) - multica_workspace_total Plus a self-introspection histogram multica_business_sampler_query_seconds{name=...} and a counter multica_business_sampler_query_errors_total{name=...} so the sampler's own behaviour is observable on /metrics. Production-safety contract per the PR4 brief: - every query runs in its own BEGIN READ ONLY tx with SET LOCAL statement_timeout = '500ms' (configurable) - the sampler takes a dedicated pgxpool.Pool option so operators can isolate it from business traffic - successful results are cached for 5–10s (default 8s) to absorb concurrent scrapes from multiple Prometheus replicas - every SQL has a hard LIMIT 100 fallback - all label values flow through the existing BusinessMetrics NormalizeTaskSource / NormalizeRuntimeMode / NormalizeRuntimeProvider whitelists, so a misbehaving runtime cannot inflate cardinality - sampler is OPT-IN via RegistryOptions.BusinessSampler — existing callers that only pass Pool keep their current behaviour and never start hitting the DB on /metrics Tests cover: emit shape, TTL cache (one DB call per N scrapes), bounded cardinality under malicious labels, opt-out (no leakage), and DB-hang isolation (unreachable host -> /metrics returns within 5s, query_errors_total advances). Refs MUL-2947 (depends on PR2 / MUL-2948, merged in #3695). Co-authored-by: multica-agent <github@multica.ai> fix(metrics): address PR4 review — wire sampler in main.go, fix LIMIT bug, add live-DB statement_timeout test Three fixes from 大彪's review on #3706: 1. main.go was building NewRegistry without the BusinessSampler option, so the collector was effectively dead code in prod. Now constructs a dedicated 2-conn pgxpool (newSamplerDBPool) from the same DATABASE_URL when METRICS_ADDR is set, plumbs it into RegistryOptions.BusinessSampler, and defers Close() at shutdown. A pool-build failure logs and disables the sampler instead of taking down the server. 2. queryActiveUsers / queryActiveWorkspaces previously wrapped the distinct-user/workspace subquery in a 'LIMIT 100', then COUNT()'d the result — capping the active-user gauge at 100 regardless of reality. Removed the inner LIMIT; the COUNT scalar is one row anyway, and metric cardinality is bounded by the fixed samplerWindows allow-list, not by the SQL shape. 3. The previous DB-hang test only exercised the acquire-fails path. Added business_sampler_pgsleep_test.go which connects to a live Postgres (skips cleanly when DATABASE_URL is not set), runs SELECT pg_sleep(2) inside a sampler-style tx with SET LOCAL statement_timeout = '500ms', and asserts: - the call returns in well under 1.5 s (proving the server-side cancellation, not just our caller-side context) - query_errors_total{name=pg_sleep_canary} advances - the duration histogram records the cancellation Verified locally: 550 ms, SQLSTATE 57014 'canceling statement due to statement timeout' — exactly the safety net the PR claims. Refs MUL-2947 / PR #3706. Co-authored-by: multica-agent <github@multica.ai> test(metrics): assert SQLSTATE 57014 on pg_sleep cancellation The previous assertion only checked that the query was cut off in well under the sleep duration, which a caller-side context cancellation would also satisfy. Capturing the inner pgconn.PgError and asserting Code == "57014" ("query_canceled") nails down that Postgres itself cancelled the statement because of the SET LOCAL statement_timeout — so a regression that drops the SET LOCAL line fails this test loudly instead of silently passing on context cancellation. Refs MUL-2947 / PR #3706 review nit. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-06-03 17:50:11 +08:00
LinYushen	de900b2ba6	feat(server): funnel/community/commercial business metrics + PostHog pairing (MUL-2949) (#3698 ) * feat(server): funnel/community/commercial business metrics + PostHog pairing (MUL-2949) PR3 of the Grafana board metrics split (parent MUL-2328). Adds 23 new Prometheus counter/histogram families to the PR2 BusinessMetrics collector covering the activation/community/commercial funnels, and binds every PostHog event emission to a matching metric increment so the two sides cannot drift. Funnel: signup, workspace_created, team_invite_sent/accepted, onboarding_, cloud_waitlist_joined. Content: issue_created, chat_message_sent, agent_created, squad_created, autopilot_created, issue_executed. Runtime: runtime_registered/ready/failed/offline + ready_seconds histogram, daemon_ws_message_received_total. Autopilot: autopilot_run_started/terminal/skipped. Webhook/GitHub: webhook_delivery_total, github_event_received_total, github_pr_review_total, github_pr_merge_seconds histogram. CloudRuntime: cloudruntime_request_total + duration histogram, wired through a small RequestRecorder interface so the cloudruntime package stays decoupled from metrics. Commercial: feedback_submitted, contact_sales_submitted. The pairing helper metrics.RecordEvent(client, m, ev) emits the PostHog event AND increments the matching counter via IncForEvent dispatch, reading labels from the analytics event Properties. Every existing h.Analytics.Capture(analytics.X(...)) call site has been migrated to the helper across handler/, service/, and cmd/server/runtime_sweeper.go. Lint enforcement (server/internal/metrics/business_pairing_test.go): - TestEveryAnalyticsEventHasPrometheusCounter: every Event constant in analytics/events.go either dispatches via IncForEvent or is in the taskMetricEvents allow-list (PR2 typed RecordTask* methods). - TestNoNakedAnalyticsCaptureInHandlersOrServices: AST-walks handler/ service/cmd-server for direct Analytics.Capture(...) calls — only service/task.go's captureTaskEvent helper is allow-listed. - TestEveryAnalyticsRecordEventTakesAnalyticsHelper: validates the third arg of every metrics.RecordEvent call is built from analytics.. Cardinality protection: all new label values pass through fixed allow-lists in labels_pr3.go; unknown values collapse to 'other'/'unknown'/'error'. Refs: - Spec MUL-2328 / MUL-2949. - Builds on PR2 (MUL-2948) — collectors registered through the same BusinessMetrics struct, no separate Registry. - Uses PR1's taskfailure.Reason (MUL-2946) for runtime_failed's failure_reason label via NormalizeFailureReason. Out of scope: Sampler-class metrics (PR4 / MUL-2947), pr_review_total emission point (no review event handler exists yet — counter is defined, TODO to wire up when /api/webhooks/github grows pull_request_review handling). Co-authored-by: multica-agent <github@multica.ai> fix(server): tighten PR3 review items — signup_source bucket, fill platform/kind/form_source enums, onboarding_started server emission, lint scope (MUL-2949) Addresses 张大彪's review on #3698: 1. signup_source: NormalizeSignupSource added to labels_pr3.go with a fixed allow-list bucket (direct/google/twitter/linkedin/.../other). Parses JSON cookie payload for utm_source/source/referrer fields, strips URL schemes, maps well-known hostnames to channel buckets. PostHog event still ships the raw cookie value for analytics; only the Prometheus label is bucketed. 2. Filled the unknown/other label gaps: - analytics.IssueCreated and analytics.ChatMessageSent now take a platform parameter sourced from middleware.ClientMetadataFromContext (X-Client-Platform header) at the handler. Autopilot-originated issues stamp PlatformServer. - analytics.FeedbackSubmitted now takes a kind parameter; CreateFeedback reads req.Kind (default "general") so the picker selection lights up the metric's kind label instead of long-term "other". - analytics.ContactSalesSubmitted now takes a formSource (page / onboarding / agents_page); CreateContactSales reads req.Source. The metric reads ev.Properties["form_source"] so the analytics CoreProperties.Source ("marketing_contact_sales") stays backward-compat for PostHog dashboards. 3. analytics.OnboardingStarted helper added; server-side emission lives in PatchOnboarding, fired exactly once per user on the first PATCH that carries a non-empty questionnaire payload (firstTouch logic compares prior bytes against {} / null). Frontend onboarding_started keeps firing on page open; the server emission is what guarantees the Prometheus counter exists so Grafana can be cross-checked against the PostHog funnel without depending on the SDK roundtrip. 4. business_pairing_test.go tightened: - TestNoNakedAnalyticsCaptureInHandlersOrServices now allow-lists at function granularity (just captureTaskEvent in service/task.go), not whole-file. Any future naked Capture in the same file fails CI. - TestEveryAnalyticsRecordEventTakesAnalyticsHelper now does def-use tracking inside the enclosing FuncDecl: when RecordEvent's third arg is an ast.Ident, the test walks the function body for the assignment that defined it and confirms the RHS is an analytics.<Helper>(...) call. Bare local idents that didn't originate from analytics are now caught. 5. gofmt -w applied across the touched files; gofmt -l clean. Tests: go test ./internal/metrics/... ./internal/analytics/... pass. Pre-existing TestClaimTask_/TestWebhook_MergedPR/TestDeleteIssueByIdentifier failures on origin/main are DB-environment-dependent and not regressions from this change. Co-authored-by: multica-agent <github@multica.ai> fix(server): normalise onboarding_started platform label + regression test (MUL-2949) Addresses 张大彪's last review nit: - IncForEvent's EventOnboardingStarted case now wraps the platform property with NormalizePlatform, matching every other platform-bearing metric. A misbehaving frontend can no longer leak a raw X-Client-Platform header value into the multica_onboarding_started_total{platform=...} series. - New labels_pr3_test.go covers every PR3 normalizer with both a happy-path value and an unknown value, asserting the unknown collapses to the documented fallback bucket. Includes a focused regression for onboarding_started: emits one event with an attacker-shaped platform string and asserts the metric only exposes web + unknown label values (no raw header bleed). - testutil.go gains a small GatherForTest helper so the regression test can pull the typed MetricFamily map without re-implementing the registry-walk dance. Co-authored-by: multica-agent <github@multica.ai> * fix(server): NormalizeTaskSource on workspace_created + document lint limitations (MUL-2949) Final review touch-ups before merge: - IncForEvent's EventWorkspaceCreated case wraps source through NormalizeTaskSource, matching the other source-bearing dispatches (issue_created, agent_created, issue_executed). Closes the last raw property leak in the dispatcher table. - business_pairing_test.go inline docstrings now spell out the two known limitations of the lint gate that 张大彪 / Eve flagged: analyticsBackedIdents matches by ident NAME (not SSA def-use, so a nested-scope shadow could pass) and isMetricsRecordEvent hard-codes the import alias set. PR description carries a Follow-ups section with the same two items so the work is visible after merge. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: 魏和尚 <agent+wei@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-03 16:39:06 +08:00
LinYushen	24ea169d89	fix(migrate): serialize startup migrations with pg advisory lock (#3658 ) cmd/migrate previously ran a check-then-apply loop on a *pgxpool.Pool with no locking, so two backend pods starting at the same time (multi- replica Deployment, scale-up, or a manual run overlapping with pod startup) could both pass the EXISTS check on a pending migration and race on the DDL or the schema_migrations INSERT, crashing the loser. Take a single connection from the pool, hold a session-level pg_advisory_lock for the entire migration loop, and release it on the way out. We use the blocking variant so a late arriver queues behind the current runner and then no-ops on the EXISTS checks instead of crash-looping. The loop deliberately stays outside a transaction so existing CREATE INDEX CONCURRENTLY migrations keep working. Also refresh the values.yaml / backend.yaml comments next to backend.replicas: the chart still ships replicas: 1 by default, but that is now a recommendation (Recreate strategy, no leader split), not a correctness requirement. Refs https://github.com/multica-ai/multica/issues/3647 Co-authored-by: multica-agent <github@multica.ai>	2026-06-03 15:51:03 +08:00
Bohan Jiang	5900d8b637	fix(issues): make start_date/due_date timezone-stable calendar days (#3618 ) (#3692 ) * fix(issues): store start_date/due_date as DATE, not timestamp (MUL-2925) These fields are calendar days (the pickers offer no time-of-day), but were stored as TIMESTAMPTZ. A client serializing local midnight via toISOString() folded its timezone into the instant, so the day shifted by the local offset (GH #3618). Migrate the columns to DATE and parse/serialize date-only "YYYY-MM-DD". ParseCalendarDate still accepts legacy RFC3339 (truncated to the UTC day) so older clients keep working. Co-authored-by: multica-agent <github@multica.ai> * fix(issues): render start_date/due_date as timezone-stable calendar days (MUL-2925) Pickers now emit date-only "YYYY-MM-DD" (local calendar day) instead of toISOString(), and every read formats via the shared @multica/core/issues/date helpers with timeZone:"UTC" so the day never shifts with the viewer's offset. The Gantt's existing UTC bucketing is now correct. Covers web/desktop pickers, quick-set menu, list/board/detail/activity, and the mobile due-date picker. Co-authored-by: multica-agent <github@multica.ai> * fix(issues): address date-only review — loud-fail ambiguous dates, finish display sweep (MUL-2925) Review follow-ups on #3692: - ParseCalendarDate no longer silently truncates a legacy non-midnight RFC3339 to the wrong UTC day; it accepts only YYYY-MM-DD or an exact UTC-midnight instant and rejects ambiguous ones loudly. Adds util unit tests. - migration 112 pins the TIMESTAMPTZ->DATE conversion to UTC explicitly via AT TIME ZONE 'UTC' (was session-timezone dependent); down migration too. - Convert remaining date-change display sites to formatDateOnly: inbox detail label (web) and mobile activity + inbox labels (were new Date()+local format). - CLI --start-date/--due-date help now says YYYY-MM-DD, not RFC3339. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-06-03 14:34:01 +08:00
Multica Eve	a72fb020de	Add business metrics collectors (#3695 ) Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai>	2026-06-03 14:32:44 +08:00
Naiyuan Qing	f2f17e3355	Optimize chat message loading (#3685 ) * Optimize chat message loading Co-authored-by: multica-agent <github@multica.ai> * Fix chat history cursor pagination Co-authored-by: multica-agent <github@multica.ai> * Fix chat session list remount key Co-authored-by: multica-agent <github@multica.ai> * fix(chat): fall back to legacy /messages when paged endpoint 404s Deployment-order compatibility: a backend deployed before the /messages/page endpoint existed returns 404 for the unknown route. The cursorless initial page now falls back to the legacy full-list /messages endpoint and wraps it in a single has_more:false page, so chat never white-screens regardless of which side deploys first. A 404 on a cursor request still propagates to avoid duplicating the full list. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 13:47:30 +08:00
Naiyuan Qing	e1a5310780	feat(cli): add skill content file and stdin input (#3652 ) * feat(cli): add skill content file and stdin input Co-authored-by: multica-agent <github@multica.ai> * test(cli): set skill server env for flag validation Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-06-02 17:25:37 +08:00
Naiyuan Qing	e36f874c86	feat: add additive agent skill assignment (#3642 ) * feat: add additive agent skill assignment Co-authored-by: multica-agent <github@multica.ai> * test: cover cross-workspace agent skill add Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-06-02 15:02:24 +08:00
Naiyuan Qing	dd4d58f20e	feat: add skill search CLI (#3601 ) Co-authored-by: multica-agent <github@multica.ai>	2026-06-01 15:19:42 +08:00
Naiyuan Qing	2b2888c23a	Handle duplicate skill imports as structured results (#3599 ) Co-authored-by: multica-agent <github@multica.ai>	2026-06-01 14:45:16 +08:00
Naiyuan Qing	3c8645e546	feat(cli): add squad member set-role (#3583 ) Co-authored-by: multica-agent <github@multica.ai>	2026-06-01 12:51:15 +08:00
Naiyuan Qing	cb2aab2f5c	feat(cli): list issue pull requests (#3581 ) Co-authored-by: multica-agent <github@multica.ai>	2026-06-01 09:44:59 +08:00
LinYushen	e024348c1f	fix(cli/login): accept mcn_ Cloud Node PATs alongside mul_ (MUL-2815) (#3518 ) * fix(cli/login): accept mcn_ Cloud Node PATs alongside mul_ (MUL-2815) multica login --token rejected anything not starting with mul_, so users with a Multica Cloud Node PAT (mcn_ prefix) hit "invalid token format: must start with mul_" even though the server middleware verifies both kinds. Replace the inline literal check with validateLoginTokenPrefix(), backed by a small loginTokenPrefixes list ({mul_, auth.CloudPATPrefix}) so the accepted set has one source of truth. Add unit-test coverage so adding a new prefix in future is an obvious one-line edit. Co-authored-by: multica-agent <github@multica.ai> * fix(cli/login): mention mcn_ Cloud Node PATs in --token help and comments Follow-up to `47e423c4`: the login command now accepts mcn_ tokens but the help string and surrounding comments still only documented mul_, so a user running 'multica login --help' couldn't tell that mcn_ was supported. Update the --token help string and the cobra Args / NoOptDefVal comments to list both mul_... and mcn_... prefixes. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>	2026-05-29 15:55:09 +08:00
Bohan Jiang	75b5be3f8e	feat(comments): roots-only thread stats + summary projection for comment list (MUL-2809) (#3505 ) * feat(comments): roots-only thread stats + summary projection for comment list Enrich the roots_only read so each root carries reply_count (recursive descendant count) and last_activity_at (MAX created_at over the subtree), letting an agent triage which thread to open without fetching any replies. Add an orthogonal summary=true projection (--summary) that clips each returned comment's content to a fixed budget and sets content_truncated, so an agent can scan a list cheaply before pulling a full body. It composes with every read mode (default, since, thread, recent, roots_only). New response fields are optional (omitempty) and only populated for the agent-facing query params, so the default response shape is unchanged for the desktop/web and existing CLI callers. Co-authored-by: multica-agent <github@multica.ai> * test(comments): cover roots_only + summary composition end-to-end The summary projection composing with roots_only is the spec's headline "table of contents" read, but it was only exercised at the CLI param- forwarding level — no handler test asserted that a roots_only response both clips content AND keeps reply_count / last_activity_at. A refactor moving the clip into a per-mode branch would silently break that composition with no failing test. Add TestListComments_RootsOnlySummaryComposes: a long root + a reply, read via roots_only=true&summary=true, asserting the root is clipped (content_truncated=true) while its subtree stats still surface. Co-authored-by: multica-agent <github@multica.ai> * refactor(comments): address review nits on roots stats + summary - ListRootComments[Since]ForIssue: scope the recursive membership walk to a selected_roots CTE (the @row_limit page, with the @since cut applied up front) so stats are only computed over the subtrees of the roots actually returned, instead of every thread in the issue. - summarizeContent: scan by rune and stop at the budget+1th rune instead of allocating a full []rune for the whole body, so a pathologically long comment costs only the budget under summary mode. Add a multi-byte (CJK) test to lock rune-boundary clipping. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-05-29 12:59:53 +08:00
Fangfei	c730e906b9	feat(cli): add roots-only issue comment listing (MUL-2805) (#3288 )	2026-05-29 12:03:38 +08:00
Bohan Jiang	90ddfb04e2	feat(self-host): DISABLE_WORKSPACE_CREATION env var (MUL-2777) (#3441 ) * feat(self-host): DISABLE_WORKSPACE_CREATION env var (MUL-2777, #3433) When self-hosters set DISABLE_WORKSPACE_CREATION=true, POST /api/workspaces returns 403 for every caller and the UI hides every "Create workspace" affordance (sidebar, modal, /workspaces/new page, onboarding Step 2). This closes the gap where ALLOW_SIGNUP=false still let any signed-in user open an isolated workspace the platform admin couldn't see. - server: new Config.DisableWorkspaceCreation, gate in CreateWorkspace, workspace_creation_disabled in /api/config, Go tests. - frontend: new workspaceCreationDisabled in configStore, hide sidebar entry, swap NewWorkspacePage / CreateWorkspaceModal / onboarding StepWorkspace to a "creation disabled, ask for invite" state when the flag is on, EN + zh-Hans locale strings. - ops: .env.example, docker-compose.selfhost, helm values + configmap, SELF_HOSTING.md, SELF_HOSTING_ADVANCED.md, environment-variables docs (EN + zh). Co-authored-by: multica-agent <github@multica.ai> * fix(onboarding): drive create path off workspaceCreationAllowed (#3433) PR #3441 review: when DISABLE_WORKSPACE_CREATION=true and the user already has a workspace, StepWorkspace still walked the resume copy (`headline_resume` / `lede_resume` mentioning "or start another") and `creatingActive` ignored the flag, leaving a stale clickable create CTA possible if /api/config arrived late. Refactor StepWorkspace to derive a single `workspaceCreationAllowed` boolean from the config store. It now drives: - Initial `mode` state (defaults to "existing" when disabled + reusing so the CTA is pre-armed for the only valid action). - `creatingActive` so the footer CTA cannot fall back into the create branch even mid-render. - Eyebrow / headline / lede strings — adds `creation_disabled_{eyebrow,headline,lede}_resume` (EN + zh-Hans) for the disabled + reusing variant. Tests: cover the three reachable shapes — flag off + no existing, flag on + no existing, flag on + existing. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-05-28 16:42:08 +08:00
LinYushen	3943358e67	feat(billing): proxy /api/cloud-billing/* + Stripe webhook to multica-cloud (#3434 )	2026-05-28 16:05:19 +08:00
Bohan Jiang	4864831721	MUL-2744: feat(auth): auto-renew daemon PAT in-place within 7-day window (#3360 ) * MUL-2744: feat(auth): auto-renew daemon PAT in-place within 7-day window Daemons currently hold a 90-day PAT and have no renewal path: once the token's expires_at passes, every request 401s and the user has to find the silent failure in the daemon log and re-run `multica login`. This adds an in-place renewal: - New `POST /api/tokens/current/renew` (Auth-protected, mul_ only). The server checks remaining lifetime: ≥ 7 days is a no-op; < 7 days bumps expires_at to now + 90 days via a guarded UPDATE that makes concurrent renews idempotent (the WHERE expires_at < $2 clause means only one writer wins; the loser sees pgx.ErrNoRows and reports the already- extended value). No raw token rotation — the same secret stays in every CLI/daemon process sharing the config. - Daemon-side `tokenRenewalLoop`: fires once on startup (covers machine-was-off cases) and then every 3 days. With a 7-day server threshold this gives at least two renewal attempts before the window closes, so a single network blip can't push the token out. - 401 fallback: when the renew call comes back 401 (token already revoked/expired), the daemon logs a user-actionable WARN telling the operator to run `multica login` — instead of the current silent failure mode. Loop keeps running so the warning repeats until fixed. PAT cache (auth.AuthCacheTTL = 10m) doesn't need invalidation: the next miss after the UPDATE re-reads the row and re-caches with the bumped TTL automatically. Co-authored-by: multica-agent <github@multica.ai> * MUL-2744: fix(auth): renew PAT before first sync; CAS against renewal threshold Addresses the two issues Elon raised on #3360. Must-fix: if the PAT is already revoked/expired when the daemon starts, syncWorkspacesFromAPI 401s and Run returns before the background tokenRenewalLoop ever fires its initial renewal. The operator only sees a generic auth failure in the workspace-sync log with no hint that 'multica login' is the fix. Now the startup path runs an inline tryRenewToken first, surfacing the existing 401 WARN before anything else gets a chance to fail. Pulled the renew + first-sync pair into preflightAuth so the ordering invariant is enforced at one site and tests can exercise the failure modes without spinning up the full Run setup. Removed the redundant initial tryRenewToken from tokenRenewalLoop — startup now owns the first call. Nit: the previous WHERE clause on ExtendPersonalAccessTokenExpiry (expires_at < $2) did not actually make concurrent renews idempotent the way the comment claimed. Two callers race-computing $2 = now + 90d produce strictly-different values, and the second writer's $2 always exceeds the row the first writer just wrote, so the UPDATE re-matches and bumps again. Switched to a CAS against the renewal threshold (expires_at <= $renew_threshold_at, i.e. now + 7d): once writer A pushes expires_at past the threshold, writer B's UPDATE matches zero rows and the loser falls back to reporting the already-extended value as a no-op. Tests: - TestPreflightAuth_RenewsBeforeWorkspaceSyncOnExpiredToken locks in the call ordering — renew endpoint is hit before workspaces, and the re-login WARN appears even though both endpoints 401. - TestPreflightAuth_SyncProceedsWhenRenewIsNoOp covers steady-state startup: a renew=false no-op must still progress to workspace sync. - TestPreflightAuth_TransientRenewFailureDoesNotBlockStartup covers a 500 from the renew endpoint — startup must continue, no WARN. - TestRenewPAT_ParallelRenewExtendsExactlyOnce fires N=8 concurrent renews at one row and asserts exactly one returns renewed=true with the others reporting the same already-extended expires_at, plus the DB carries only that single bumped value. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>	2026-05-27 22:22:26 +08:00

1 2 3 4 5 ...

436 Commits