Files
multica/server/internal/handler/onboarding.go
LinYushen de900b2ba6 feat(server): funnel/community/commercial business metrics + PostHog pairing (MUL-2949) (#3698)
* feat(server): funnel/community/commercial business metrics + PostHog pairing (MUL-2949)

PR3 of the Grafana board metrics split (parent MUL-2328).

Adds 23 new Prometheus counter/histogram families to the PR2 BusinessMetrics
collector covering the activation/community/commercial funnels, and binds
every PostHog event emission to a matching metric increment so the two sides
cannot drift.

Funnel: signup, workspace_created, team_invite_sent/accepted, onboarding_*,
cloud_waitlist_joined.
Content: issue_created, chat_message_sent, agent_created, squad_created,
autopilot_created, issue_executed.
Runtime: runtime_registered/ready/failed/offline + ready_seconds histogram,
daemon_ws_message_received_total.
Autopilot: autopilot_run_started/terminal/skipped.
Webhook/GitHub: webhook_delivery_total, github_event_received_total,
github_pr_review_total, github_pr_merge_seconds histogram.
CloudRuntime: cloudruntime_request_total + duration histogram, wired through
a small RequestRecorder interface so the cloudruntime package stays decoupled
from metrics.
Commercial: feedback_submitted, contact_sales_submitted.

The pairing helper metrics.RecordEvent(client, m, ev) emits the PostHog
event AND increments the matching counter via IncForEvent dispatch, reading
labels from the analytics event Properties. Every existing
h.Analytics.Capture(analytics.X(...)) call site has been migrated to the
helper across handler/, service/, and cmd/server/runtime_sweeper.go.

Lint enforcement (server/internal/metrics/business_pairing_test.go):
- TestEveryAnalyticsEventHasPrometheusCounter: every Event* constant in
  analytics/events.go either dispatches via IncForEvent or is in the
  taskMetricEvents allow-list (PR2 typed RecordTask* methods).
- TestNoNakedAnalyticsCaptureInHandlersOrServices: AST-walks handler/
  service/cmd-server for direct Analytics.Capture(...) calls — only
  service/task.go's captureTaskEvent helper is allow-listed.
- TestEveryAnalyticsRecordEventTakesAnalyticsHelper: validates the third
  arg of every metrics.RecordEvent call is built from analytics.*.

Cardinality protection: all new label values pass through fixed allow-lists
in labels_pr3.go; unknown values collapse to 'other'/'unknown'/'error'.

Refs:
- Spec MUL-2328 / MUL-2949.
- Builds on PR2 (MUL-2948) — collectors registered through the same
  BusinessMetrics struct, no separate Registry.
- Uses PR1's taskfailure.Reason (MUL-2946) for runtime_failed's failure_reason
  label via NormalizeFailureReason.

Out of scope: Sampler-class metrics (PR4 / MUL-2947), pr_review_total
emission point (no review event handler exists yet — counter is defined,
TODO to wire up when /api/webhooks/github grows pull_request_review handling).

Co-authored-by: multica-agent <github@multica.ai>

* fix(server): tighten PR3 review items — signup_source bucket, fill platform/kind/form_source enums, onboarding_started server emission, lint scope (MUL-2949)

Addresses 张大彪's review on #3698:

1. signup_source: NormalizeSignupSource added to labels_pr3.go with a
   fixed allow-list bucket (direct/google/twitter/linkedin/.../other).
   Parses JSON cookie payload for utm_source/source/referrer fields,
   strips URL schemes, maps well-known hostnames to channel buckets.
   PostHog event still ships the raw cookie value for analytics; only
   the Prometheus label is bucketed.

2. Filled the unknown/other label gaps:
   - analytics.IssueCreated and analytics.ChatMessageSent now take a
     platform parameter sourced from middleware.ClientMetadataFromContext
     (X-Client-Platform header) at the handler. Autopilot-originated
     issues stamp PlatformServer.
   - analytics.FeedbackSubmitted now takes a kind parameter; CreateFeedback
     reads req.Kind (default "general") so the picker selection lights up
     the metric's kind label instead of long-term "other".
   - analytics.ContactSalesSubmitted now takes a formSource (page /
     onboarding / agents_page); CreateContactSales reads req.Source.
     The metric reads ev.Properties["form_source"] so the analytics
     CoreProperties.Source ("marketing_contact_sales") stays
     backward-compat for PostHog dashboards.

3. analytics.OnboardingStarted helper added; server-side emission lives
   in PatchOnboarding, fired exactly once per user on the first PATCH
   that carries a non-empty questionnaire payload (firstTouch logic
   compares prior bytes against {} / null). Frontend onboarding_started
   keeps firing on page open; the server emission is what guarantees the
   Prometheus counter exists so Grafana can be cross-checked against the
   PostHog funnel without depending on the SDK roundtrip.

4. business_pairing_test.go tightened:
   - TestNoNakedAnalyticsCaptureInHandlersOrServices now allow-lists at
     function granularity (just captureTaskEvent in service/task.go), not
     whole-file. Any future naked Capture in the same file fails CI.
   - TestEveryAnalyticsRecordEventTakesAnalyticsHelper now does def-use
     tracking inside the enclosing FuncDecl: when RecordEvent's third
     arg is an *ast.Ident, the test walks the function body for the
     assignment that defined it and confirms the RHS is an
     analytics.<Helper>(...) call. Bare local idents that didn't
     originate from analytics are now caught.

5. gofmt -w applied across the touched files; gofmt -l clean.

Tests: go test ./internal/metrics/... ./internal/analytics/... pass.
Pre-existing TestClaimTask_/TestWebhook_MergedPR/TestDeleteIssueByIdentifier
failures on origin/main are DB-environment-dependent and not regressions
from this change.

Co-authored-by: multica-agent <github@multica.ai>

* fix(server): normalise onboarding_started platform label + regression test (MUL-2949)

Addresses 张大彪's last review nit:

- IncForEvent's EventOnboardingStarted case now wraps the platform
  property with NormalizePlatform, matching every other platform-bearing
  metric. A misbehaving frontend can no longer leak a raw X-Client-Platform
  header value into the multica_onboarding_started_total{platform=...}
  series.

- New labels_pr3_test.go covers every PR3 normalizer with both a happy-path
  value and an unknown value, asserting the unknown collapses to the
  documented fallback bucket. Includes a focused regression for
  onboarding_started: emits one event with an attacker-shaped platform
  string and asserts the metric only exposes web + unknown label values
  (no raw header bleed).

- testutil.go gains a small GatherForTest helper so the regression test
  can pull the typed MetricFamily map without re-implementing the
  registry-walk dance.

Co-authored-by: multica-agent <github@multica.ai>

* fix(server): NormalizeTaskSource on workspace_created + document lint limitations (MUL-2949)

Final review touch-ups before merge:

- IncForEvent's EventWorkspaceCreated case wraps source through
  NormalizeTaskSource, matching the other source-bearing dispatches
  (issue_created, agent_created, issue_executed). Closes the last raw
  property leak in the dispatcher table.

- business_pairing_test.go inline docstrings now spell out the two
  known limitations of the lint gate that 张大彪 / Eve flagged:
  analyticsBackedIdents matches by ident NAME (not SSA def-use, so a
  nested-scope shadow could pass) and isMetricsRecordEvent hard-codes
  the import alias set. PR description carries a Follow-ups section
  with the same two items so the work is visible after merge.

Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: 魏和尚 <agent+wei@multica.ai>
Co-authored-by: multica-agent <github@multica.ai>
2026-06-03 16:39:06 +08:00

347 lines
12 KiB
Go

package handler
import (
"encoding/json"
"log/slog"
"net/http"
"net/mail"
"strings"
"github.com/jackc/pgx/v5/pgtype"
"github.com/multica-ai/multica/server/internal/analytics"
"github.com/multica-ai/multica/server/internal/logger"
obsmetrics "github.com/multica-ai/multica/server/internal/metrics"
"github.com/multica-ai/multica/server/internal/middleware"
db "github.com/multica-ai/multica/server/pkg/db/generated"
)
// Upper bound on free-text fields. `cloudWaitlistReasonMaxLen` is a
// product cap ("we don't need an essay for a waitlist"); the body-size
// cap further down is defense in depth against arbitrary storage
// abuse via the JSON body.
const (
cloudWaitlistReasonMaxLen = 500
// PatchOnboarding body is a tiny JSON with at most a 3-question
// questionnaire. 16 KiB is ~10x the realistic ceiling — it's the
// minimum that keeps the door open for future fields without
// letting a malicious user stuff the JSONB column.
patchOnboardingBodyLimit = 16 * 1024
)
// completeOnboardingRequest carries the client's view of which exit the
// user took from the flow. Used purely as an analytics dimension — server
// state (onboarded_at) flips the same way regardless. Unknown / missing
// → OnboardingPathUnknown so legacy clients still complete cleanly, just
// without a funnel-ready label.
//
// `workspace_id` is retained for analytics enrichment; the v2 code path
// used it to seed an install-runtime issue inside the same transaction,
// but in v3 every workspace-content seeding lives in the frontend
// welcome hook (see packages/views/workspace/welcome-after-onboarding.tsx).
type completeOnboardingRequest struct {
CompletionPath string `json:"completion_path,omitempty"`
WorkspaceID string `json:"workspace_id,omitempty"`
}
var validCompletionPaths = map[string]struct{}{
analytics.OnboardingPathFull: {},
analytics.OnboardingPathRuntimeSkipped: {},
analytics.OnboardingPathCloudWaitlist: {},
analytics.OnboardingPathSkipExisting: {},
analytics.OnboardingPathInviteAccept: {},
}
// CompleteOnboarding marks the authenticated user as having completed
// onboarding. Idempotent: the underlying query uses COALESCE so the
// original timestamp is preserved if called more than once.
//
// Emits `onboarding_completed` exactly once — the first call that
// actually flips `onboarded_at` from NULL. Subsequent calls are still
// 200 OK (for client-side retries) but skip the event so the funnel
// counts honest first-completion.
//
// V3 has no in-handler seeding side effect: workspace content (Helper
// agent, starter issues, install-runtime guides) is created by the
// frontend welcome hook via the generic CreateAgent / CreateIssue
// endpoints. This handler does one thing: flip the field.
func (h *Handler) CompleteOnboarding(w http.ResponseWriter, r *http.Request) {
userID, ok := requireUserID(w, r)
if !ok {
return
}
// Body is optional — an empty body is a legal legacy call.
var req completeOnboardingRequest
if r.ContentLength > 0 {
if err := json.NewDecoder(r.Body).Decode(&req); err != nil && err.Error() != "EOF" {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
}
// Validate workspace_id if supplied; we don't write with it, but a
// malformed value should fail fast rather than silently land in
// PostHog as a junk dimension.
if req.WorkspaceID != "" {
wsUUID, ok := parseUUIDOrBadRequest(w, req.WorkspaceID, "workspace_id")
if !ok {
return
}
req.WorkspaceID = uuidToString(wsUUID)
}
before, err := h.Queries.GetUser(r.Context(), parseUUID(userID))
if err != nil {
writeError(w, http.StatusInternalServerError, "failed to complete onboarding")
return
}
firstCompletion := !before.OnboardedAt.Valid
user, err := h.Queries.MarkUserOnboarded(r.Context(), parseUUID(userID))
if err != nil {
slog.Warn("complete onboarding: mark user onboarded failed", append(logger.RequestAttrs(r), "error", err)...)
writeError(w, http.StatusInternalServerError, "failed to complete onboarding")
return
}
if firstCompletion {
path := req.CompletionPath
if _, ok := validCompletionPaths[path]; !ok {
path = analytics.OnboardingPathUnknown
}
onboardedAt := ""
if user.OnboardedAt.Valid {
onboardedAt = user.OnboardedAt.Time.UTC().Format("2006-01-02T15:04:05Z07:00")
}
obsmetrics.RecordEvent(h.Analytics, h.Metrics, analytics.OnboardingCompleted(
userID,
req.WorkspaceID,
path,
onboardedAt,
user.CloudWaitlistEmail.Valid,
))
}
writeJSON(w, http.StatusOK, userToResponse(user))
}
type patchOnboardingRequest struct {
Questionnaire *json.RawMessage `json:"questionnaire,omitempty"`
}
// questionnaireAnswers mirrors the frontend's `QuestionnaireAnswers`
// shape. `use_case` is multi-select (Step 3 allows picking several);
// `source` is single-select (primary acquisition channel) but kept
// as `stringOrSlice` for back-compat with v2 multi-select rows — the
// client now always commits a one-element array. `role` stays
// single-select.
//
// stringOrSlice also tolerates pre-array rows that wrote a bare
// string into the JSONB column — `json.Unmarshal` would otherwise
// fail on type mismatch when reading those back.
type stringOrSlice []string
func (s *stringOrSlice) UnmarshalJSON(data []byte) error {
// Empty / null both decode to nil slice.
if len(data) == 0 || string(data) == "null" {
*s = nil
return nil
}
// Try array first (current shape).
var arr []string
if err := json.Unmarshal(data, &arr); err == nil {
*s = arr
return nil
}
// Fall back to single string (pre-array shape from before this
// column held a slice). Empty string means "unanswered" — keep nil.
var single string
if err := json.Unmarshal(data, &single); err != nil {
return err
}
if single == "" {
*s = nil
return nil
}
*s = []string{single}
return nil
}
type questionnaireAnswers struct {
Source stringOrSlice `json:"source"`
SourceOther string `json:"source_other"`
SourceSkipped bool `json:"source_skipped"`
Role string `json:"role"`
RoleOther string `json:"role_other"`
RoleSkipped bool `json:"role_skipped"`
UseCase stringOrSlice `json:"use_case"`
UseCaseOther string `json:"use_case_other"`
UseCaseSkipped bool `json:"use_case_skipped"`
Version int `json:"version"`
}
func (q questionnaireAnswers) sourceResolved() bool {
return len(q.Source) > 0 || q.SourceSkipped
}
func (q questionnaireAnswers) roleResolved() bool {
return q.Role != "" || q.RoleSkipped
}
func (q questionnaireAnswers) useCaseResolved() bool {
return len(q.UseCase) > 0 || q.UseCaseSkipped
}
// questionnaireSchemaVersion is the schema this handler understands.
// `complete()` and the funnel event are scoped to this version so a
// future v3 row can't be silently mis-counted against v2 semantics.
const questionnaireSchemaVersion = 2
func (q questionnaireAnswers) complete() bool {
if q.Version != questionnaireSchemaVersion {
return false
}
return q.sourceResolved() && q.roleResolved() && q.useCaseResolved()
}
// PatchOnboarding persists the user's questionnaire answers. The
// field is optional; an omitted questionnaire is preserved. Which
// step the user is on is deliberately not persisted — every
// onboarding entry starts at Welcome.
//
// Emits `onboarding_questionnaire_submitted` exactly once per user:
// the first PATCH that transitions the answers from "at least one
// slot empty" to "all three filled". Revisions past that point don't
// re-emit — the funnel counts users, not edits.
func (h *Handler) PatchOnboarding(w http.ResponseWriter, r *http.Request) {
userID, ok := requireUserID(w, r)
if !ok {
return
}
// Bound the body so the JSONB column can't be weaponized as bulk
// storage — otherwise every subsequent `/api/me` read would have
// to return the bloat.
r.Body = http.MaxBytesReader(w, r.Body, patchOnboardingBodyLimit)
var req patchOnboardingRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
// Read prior answers so we can detect the NULL/partial → complete
// transition after the update. An errored decode on the prior row
// is treated as "incomplete" — worst case we emit once more than
// we should, never twice for the same transition.
var before questionnaireAnswers
beforeRaw := []byte("{}")
if beforeUser, err := h.Queries.GetUser(r.Context(), parseUUID(userID)); err == nil {
beforeRaw = beforeUser.OnboardingQuestionnaire
_ = json.Unmarshal(beforeRaw, &before)
}
// firstTouch is true when the user has never written any
// onboarding state on the server before this PATCH. Used to fire
// onboarding_started exactly once per user from the server side.
firstTouch := len(beforeRaw) == 0 || string(beforeRaw) == "null" || string(beforeRaw) == "{}"
params := db.PatchUserOnboardingParams{ID: parseUUID(userID)}
if req.Questionnaire != nil {
params.Questionnaire = []byte(*req.Questionnaire)
}
user, err := h.Queries.PatchUserOnboarding(r.Context(), params)
if err != nil {
slog.Warn("patch onboarding failed", append(logger.RequestAttrs(r), "error", err)...)
writeError(w, http.StatusInternalServerError, "failed to update onboarding")
return
}
// Server-side onboarding_started: fire on the first PATCH that
// actually carries a questionnaire payload. The frontend also
// emits its own onboarding_started on page open; the two together
// let Grafana cross-check the funnel against PostHog.
if firstTouch && req.Questionnaire != nil && len(*req.Questionnaire) > 0 && string(*req.Questionnaire) != "{}" {
platform, _, _ := middleware.ClientMetadataFromContext(r.Context())
obsmetrics.RecordEvent(h.Analytics, h.Metrics, analytics.OnboardingStarted(userID, platform))
}
var after questionnaireAnswers
_ = json.Unmarshal(user.OnboardingQuestionnaire, &after)
if after.complete() && !before.complete() {
obsmetrics.RecordEvent(h.Analytics, h.Metrics, analytics.OnboardingQuestionnaireSubmitted(
userID,
[]string(after.Source),
after.Role,
[]string(after.UseCase),
after.SourceSkipped,
after.RoleSkipped,
after.UseCaseSkipped,
after.SourceOther != "",
after.RoleOther != "",
after.UseCaseOther != "",
))
}
writeJSON(w, http.StatusOK, userToResponse(user))
}
type joinCloudWaitlistRequest struct {
Email string `json:"email"`
Reason string `json:"reason"`
}
// JoinCloudWaitlist records a user's interest in cloud runtimes.
// Pure side effect — does NOT complete onboarding. The user still
// has to pick a real Step 3 path (CLI with a detected runtime) or
// Skip to move on. Repeating the call overwrites email + reason.
func (h *Handler) JoinCloudWaitlist(w http.ResponseWriter, r *http.Request) {
userID, ok := requireUserID(w, r)
if !ok {
return
}
var req joinCloudWaitlistRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
// RFC 5321 caps email at 254 chars; the column is VARCHAR(254) and
// the format check below rejects anything net/mail can't parse.
email := strings.ToLower(strings.TrimSpace(req.Email))
if email == "" {
writeError(w, http.StatusBadRequest, "email is required")
return
}
if len(email) > 254 {
writeError(w, http.StatusBadRequest, "email is too long")
return
}
if _, err := mail.ParseAddress(email); err != nil {
writeError(w, http.StatusBadRequest, "email is invalid")
return
}
reason := strings.TrimSpace(req.Reason)
if len(reason) > cloudWaitlistReasonMaxLen {
writeError(w, http.StatusBadRequest, "reason is too long")
return
}
reasonParam := pgtype.Text{}
if reason != "" {
reasonParam = pgtype.Text{String: reason, Valid: true}
}
user, err := h.Queries.JoinCloudWaitlist(r.Context(), db.JoinCloudWaitlistParams{
ID: parseUUID(userID),
CloudWaitlistEmail: pgtype.Text{String: email, Valid: true},
CloudWaitlistReason: reasonParam,
})
if err != nil {
writeError(w, http.StatusInternalServerError, "failed to join waitlist")
return
}
obsmetrics.RecordEvent(h.Analytics, h.Metrics, analytics.CloudWaitlistJoined(userID, reason != ""))
writeJSON(w, http.StatusOK, userToResponse(user))
}