Compare commits

...

2 Commits

Author SHA1 Message Date
J
84750a0598 docs(analytics): correct agent_task Prometheus metric contract (MUL-2967)
Address PR review: the agent_task_* "Prometheus-only" banner claimed the old
PostHog event properties (task_id, agent_id, duration_ms, error_type,
will_retry, ...) were the metric label set. They are not — the real labels are
only source/runtime_mode/provider/terminal_status/failure_reason.

- Replace the agent_task_* sections with the actual metric names and labels
  (multica_agent_task_*; see business.go / labels.go), and explain that
  completed/failed/cancelled are terminal_status values on
  multica_agent_task_terminal_total, with wall-clock in the *_seconds
  histograms.
- Tighten the runtime_*/autopilot_run_* banners so id properties aren't
  mistaken for labels.
- Drop the stale AgentTask allow-list reference from the pairing lint test
  header comment.

Co-authored-by: multica-agent <github@multica.ai>
2026-06-03 23:54:48 +08:00
J
7b52767803 chore(analytics): stop shipping operational events to PostHog (MUL-2967)
Operational / execution-lifecycle telemetry dominated PostHog event volume
and drove the bill: runtime_offline alone was ~54% of ~22.6M events/mo, and
~99% of events were billed at the higher identified-event rate. These signals
already have Prometheus counters (Grafana), so the PostHog copies were
redundant cost.

- Add analytics.IsMetricsOnly; metrics.RecordEvent now skips the PostHog
  Capture for runtime_* and autopilot_run_* while still incrementing their
  Prometheus counter (their analytics.Event constructors are retained to feed
  the metric label set via IncForEvent).
- Remove the agent_task_* PostHog path entirely: drop captureTaskEvent and the
  AgentTask* constructors/constants. Their Prometheus side is unchanged via the
  typed BusinessMetrics.RecordTask* methods. Also remove the now-dead
  taskDurationMS / willRetryTask helpers.
- Update the pairing lint test (no agent_task allow-list, no naked-Capture
  exception), add a RecordEvent skip test + IsMetricsOnly test, and update
  docs/analytics.md (taxonomy, per-event banners, reconciliation).

Product/funnel events (signup, onboarding, issue_created, issue_executed,
chat_message_sent, agent_created, autopilot_created, etc.) are unchanged and
still ship to PostHog.

Co-authored-by: multica-agent <github@multica.ai>
2026-06-03 22:40:34 +08:00
7 changed files with 209 additions and 257 deletions

View File

@@ -6,6 +6,20 @@ drives our weekly Active Workspaces (WAW) north-star metric.
See [MUL-1122](https://github.com/multica-ai/multica) for the design context.
> **PostHog is reserved for user/product-behaviour events.** High-volume
> operational / execution-lifecycle telemetry — runtime lifecycle
> (`runtime_registered` / `runtime_ready` / `runtime_failed` /
> `runtime_offline`), the agent task lifecycle (`agent_task_*`), and autopilot
> run lifecycle (`autopilot_run_started` / `autopilot_run_completed` /
> `autopilot_run_failed`) — is **Prometheus-only** and is **not** shipped to
> PostHog. Grafana already covers it and the per-event PostHog ingestion cost
> (these events dominate volume and bill at the identified-event rate) is not
> justified. The runtime/autopilot events are flagged by
> `analytics.IsMetricsOnly`, which `metrics.RecordEvent` consults to skip the
> PostHog `Capture` while still incrementing the Prometheus counter; the
> `agent_task_*` lifecycle is recorded straight to Prometheus via the typed
> `BusinessMetrics.RecordTask*` methods and has no `analytics.Event` at all.
## Configuration
All analytics shipping is toggled by environment variables (see `.env.example`):
@@ -89,15 +103,18 @@ Every event is assigned to one dashboard category:
| Category | Events |
|---|---|
| `core_loop` | `workspace_created`, `runtime_registered`, `runtime_ready`, `runtime_failed`, `runtime_offline`, `agent_created`, `issue_created`, `chat_message_sent`, `agent_task_queued`, `agent_task_dispatched`, `agent_task_started`, `agent_task_completed`, `agent_task_failed`, `agent_task_cancelled`, `autopilot_run_started`, `autopilot_run_completed`, `autopilot_run_failed` |
| `core_loop` | `workspace_created`, `agent_created`, `issue_created`, `chat_message_sent`, `issue_executed`, `autopilot_created`, `squad_created` |
| `onboarding_support` | `onboarding_started`, `onboarding_questionnaire_submitted`, `onboarding_completed`, `onboarding_runtime_path_selected`, `onboarding_runtime_detected` |
| `acquisition` | `signup`, `download_intent_expressed`, `download_page_viewed`, `download_initiated`, `cloud_waitlist_joined`, `contact_sales_submitted` |
| `ops_feedback` | `feedback_opened`, `feedback_submitted` |
| `system/noise` | `$pageview`, `$set`, `$identify`, `$autocapture`, `$rageclick` |
| `operational (Prometheus-only — NOT in PostHog)` | `runtime_registered`, `runtime_ready`, `runtime_failed`, `runtime_offline`, `agent_task_queued`, `agent_task_dispatched`, `agent_task_started`, `agent_task_completed`, `agent_task_failed`, `agent_task_cancelled`, `autopilot_run_started`, `autopilot_run_completed`, `autopilot_run_failed` |
The v0 core dashboard must use only `core_loop` plus the specific
`onboarding_support` steps used by the activation funnel. Acquisition,
feedback, and system/noise events stay in separate dashboards.
feedback, and system/noise events stay in separate dashboards. The
`operational` row is **not shipped to PostHog** — those signals live in
Grafana via `multica_*` business counters (see `server/internal/metrics`).
## Standard core properties
@@ -165,6 +182,13 @@ funnel with "first time user does X" or a cohort on
### `runtime_registered`
> **Prometheus-only — not shipped to PostHog** (see the note at the top of this
> doc). The `analytics.Event` is still constructed so `metrics.IncForEvent` can
> derive the Prometheus counter; the fields below are that **event** shape, not
> a PostHog contract. Only the low-cardinality fields (`runtime_mode`,
> `provider`) become Prometheus labels — ids like `runtime_id` / `daemon_id`
> are not labels.
Fires the first time a `(workspace_id, daemon_id, provider)` tuple is
upserted. Heartbeats and repeat registrations never re-emit. First-time
detection uses Postgres `xmax = 0` on the upsert RETURNING clause — no
@@ -186,6 +210,8 @@ under a single "anonymous" person.
### `runtime_ready`
> **Prometheus-only — not shipped to PostHog.**
Fires when a runtime is first registered in an online/ready state. This is the
activation-funnel step that should replace treating `runtime_registered` as
proof of readiness. The backend emits this only on the INSERT path for a new
@@ -203,6 +229,8 @@ distinct `runtime_id`.
### `runtime_failed`
> **Prometheus-only — not shipped to PostHog.**
Fires when runtime setup/registration fails before a ready runtime can be
recorded. Today this is scoped to backend registration persistence failures;
future setup flows should reuse it for provider detection or daemon boot
@@ -218,6 +246,8 @@ failures.
### `runtime_offline`
> **Prometheus-only — not shipped to PostHog.**
Fires when a runtime is explicitly deregistered or the backend sweeper marks it
offline after missed heartbeats. This is not an activation step; it supports
local runtime retention and drop-off diagnosis.
@@ -247,40 +277,52 @@ is queued.
| `agent_id` | string (UUID) | Chat agent. |
| `source` | string | Always `chat`. |
### `agent_task_queued` / `agent_task_dispatched` / `agent_task_started` / `agent_task_completed`
### agent task lifecycle (Prometheus-only)
Canonical task lifecycle events emitted from `agent_task_queue` state
transitions. `agent_task_dispatched` fires when the backend claims a queued
task for a runtime, before the daemon marks it running with
`agent_task_started`. These events replace `issue_executed` for core loop
success metrics and allow the activation funnel to split queue backlog from
claim/start handoff.
> **Not shipped to PostHog and has no `analytics.Event`.** The agent task
> lifecycle is recorded directly to Prometheus by the typed
> `BusinessMetrics.RecordTask*` methods in `server/internal/service/task.go`.
> The old PostHog event names (`agent_task_queued` / `dispatched` / `started` /
> `completed` / `failed` / `cancelled`) and their properties (`task_id`,
> `agent_id`, `issue_id`, `chat_session_id`, `autopilot_run_id`, `duration_ms`,
> `error_type`, `will_retry`) no longer exist anywhere — those high-cardinality
> ids were never Prometheus labels and must not be used in dashboards or
> reconciliation.
| Property | Type | Description |
The actual metrics (defined in `server/internal/metrics/business.go`; label
sets in `server/internal/metrics/labels.go`):
| Metric | Type | Labels |
|---|---|---|
| `task_id` | string (UUID) | `agent_task_queue.id`; required. |
| `agent_id` | string (UUID) | Owning agent. |
| `issue_id` | string (UUID) | Present for issue-linked tasks. |
| `chat_session_id` | string (UUID) | Present for chat tasks. |
| `autopilot_run_id` | string (UUID) | Present for run-only autopilot tasks. |
| `source` | string | `manual`, `chat`, or `autopilot`. |
| `runtime_mode` | string | `local` / `cloud`. |
| `provider` | string | Runtime provider. |
| `duration_ms` | int64 | Terminal events only; measured from `started_at` when available. |
| `multica_agent_task_enqueued_total` | counter | `source`, `runtime_mode` |
| `multica_agent_task_dispatched_total` | counter | `source`, `runtime_mode` |
| `multica_agent_task_started_total` | counter | `source`, `runtime_mode`, `provider` |
| `multica_agent_task_terminal_total` | counter | `source`, `runtime_mode`, `terminal_status` |
| `multica_agent_task_failed_total` | counter | `source`, `runtime_mode`, `failure_reason` |
| `multica_agent_task_queue_wait_seconds` | histogram | `source`, `runtime_mode` |
| `multica_agent_task_run_seconds` | histogram | `source`, `runtime_mode`, `terminal_status` |
| `multica_agent_task_total_seconds` | histogram | `source`, `runtime_mode`, `terminal_status` |
### `agent_task_failed` / `agent_task_cancelled`
Terminal task lifecycle events. They use the same join fields as
`agent_task_completed`. `agent_task_failed` also carries:
| Property | Type | Description |
|---|---|---|
| `failure_reason` | string | Stable reason from `agent_task_queue.failure_reason`, default `agent_error`. |
| `error_type` | string | Stable coarse classifier, e.g. `runtime`, `timeout`, `agent_output`, `cancelled`, `agent_error`. |
| `will_retry` | bool | Whether the backend auto-retry policy will create another task attempt. |
- `terminal_status` is the task's final `agent_task_queue.status`
`completed` / `failed` / `cancelled`. There is **no** separate
completed/cancelled metric: all three land on
`multica_agent_task_terminal_total{terminal_status=…}`. Failures
additionally increment `multica_agent_task_failed_total` carrying the coarse
`failure_reason` (`agent_task_queue.failure_reason`, default `agent_error`).
- Task wall-clock lives in the `*_seconds` histograms (queue wait / run /
total), replacing the old `duration_ms` event property.
- `source` / `runtime_mode` / `provider` are the normalized label values
(`NormalizeTaskSource` / `NormalizeRuntimeMode` / `NormalizeRuntimeProvider`).
### `autopilot_run_started` / `autopilot_run_completed` / `autopilot_run_failed`
> **Prometheus-only — not shipped to PostHog.** The `analytics.*` constructors
> are retained only so `metrics.IncForEvent` can derive the Prometheus counter;
> `analytics.IsMetricsOnly` keeps them out of PostHog. Only `cadence`,
> `trigger_kind`, and `terminal_status` become Prometheus labels — the
> `autopilot_id` / `autopilot_run_id` / `agent_id` fields below are event shape,
> not labels.
Fires from `autopilot_run` lifecycle changes. `source` is always
`autopilot`; the trigger origin is carried in `trigger_source` (`manual`,
`schedule`, `webhook`, or `api`).
@@ -329,9 +371,11 @@ emit `n=1`. PostHog answers the same question at query time via
and funnel steps of the form "workspace has had ≥2 `issue_executed`
events" are expressible without the property. No information is lost.
Compatibility: `issue_executed` remains a historical compatibility event for
old dashboards. New core-loop success dashboards should use
`agent_task_completed` and filter by `source`/`issue_id` as needed.
`issue_executed` is the canonical **PostHog** core-loop success signal (the
`agent_task_*` lifecycle that previously served per-task success dashboards is
now Prometheus-only). Per-task completion counts live in Grafana via
`BusinessMetrics.RecordTaskTerminal`; use `issue_executed` for the
PostHog-side activation funnel and filter by `source` as needed.
### `team_invite_sent`
@@ -604,8 +648,10 @@ sent from a pre-workspace surface.
## Reconciliation
`agent_task_completed` is the canonical PostHog-side task success event. It
should reconcile daily against the operational source of truth:
Per-task completion is no longer shipped to PostHog. Task success now
reconciles **DB ↔ Prometheus** instead of DB ↔ PostHog: the
`BusinessMetrics.RecordTaskTerminal` counter (exported as a `multica_*` task
metric) should track the operational source of truth:
```sql
SELECT date_trunc('day', completed_at AT TIME ZONE 'UTC') AS day,
@@ -617,22 +663,13 @@ GROUP BY 1
ORDER BY 1;
```
Equivalent HogQL:
Compare against the equivalent Prometheus counter in Grafana. The expected
difference should be near zero; sustained drift means either an emission site
is missing or the metrics pipeline is unhealthy.
```sql
SELECT toStartOfDay(timestamp) AS day,
count() AS posthog_completed_tasks
FROM events
WHERE event = 'agent_task_completed'
AND properties.environment = 'production'
AND timestamp >= now() - interval 30 day
GROUP BY day
ORDER BY day
```
The expected difference should be near zero. Allow a small delay window for
PostHog ingestion and backend analytics queue drops; sustained drift means
either an emission site is missing or PostHog shipping is unhealthy.
On the PostHog side, `issue_executed` remains the product-level success signal
(at most one per issue) and can be reconciled against
`issue.first_executed_at` if needed.
## Governance

View File

@@ -13,12 +13,6 @@ const (
EventIssueExecuted = "issue_executed"
EventIssueCreated = "issue_created"
EventChatMessageSent = "chat_message_sent"
EventAgentTaskQueued = "agent_task_queued"
EventAgentTaskDispatched = "agent_task_dispatched"
EventAgentTaskStarted = "agent_task_started"
EventAgentTaskCompleted = "agent_task_completed"
EventAgentTaskFailed = "agent_task_failed"
EventAgentTaskCancelled = "agent_task_cancelled"
EventAutopilotRunStarted = "autopilot_run_started"
EventAutopilotRunCompleted = "autopilot_run_completed"
EventAutopilotRunFailed = "autopilot_run_failed"
@@ -37,6 +31,35 @@ const (
const EventSchemaVersion = 2
// metricsOnlyEvents are operational / execution-lifecycle events that are
// recorded to Prometheus (via metrics.IncForEvent, for Grafana) but are
// deliberately NOT shipped to PostHog. They are high-volume runtime/autopilot
// telemetry whose per-event PostHog ingestion cost is not justified — Grafana
// already carries the equivalent counters. metrics.RecordEvent consults this
// set and skips the PostHog Capture for these names while still incrementing
// the counter. PostHog is reserved for user/product-behaviour events.
//
// Note: agent_task_* lifecycle events are also Prometheus-only, but their
// Prometheus side is handled by typed BusinessMetrics.RecordTask* methods, so
// they never build an analytics.Event in the first place and don't need an
// entry here.
var metricsOnlyEvents = map[string]struct{}{
EventRuntimeRegistered: {},
EventRuntimeReady: {},
EventRuntimeFailed: {},
EventRuntimeOffline: {},
EventAutopilotRunStarted: {},
EventAutopilotRunCompleted: {},
EventAutopilotRunFailed: {},
}
// IsMetricsOnly reports whether an event name is operational telemetry that
// must be counted in Prometheus but not sent to PostHog. See metricsOnlyEvents.
func IsMetricsOnly(name string) bool {
_, ok := metricsOnlyEvents[name]
return ok
}
const (
SourceOnboarding = "onboarding"
SourceManual = "manual"
@@ -306,39 +329,6 @@ func ChatMessageSent(userID, workspaceID, chatSessionID, taskID, agentID, runtim
}
}
func AgentTaskQueued(ctx TaskContext) Event {
return agentTaskEvent(EventAgentTaskQueued, ctx, nil)
}
func AgentTaskDispatched(ctx TaskContext) Event {
return agentTaskEvent(EventAgentTaskDispatched, ctx, nil)
}
func AgentTaskStarted(ctx TaskContext) Event {
return agentTaskEvent(EventAgentTaskStarted, ctx, nil)
}
func AgentTaskCompleted(ctx TaskContext, durationMS int64) Event {
return agentTaskEvent(EventAgentTaskCompleted, ctx, map[string]any{
"duration_ms": durationMS,
})
}
func AgentTaskFailed(ctx TaskContext, durationMS int64, failureReason, errorType string, willRetry bool) Event {
return agentTaskEvent(EventAgentTaskFailed, ctx, map[string]any{
"duration_ms": durationMS,
"failure_reason": failureReason,
"error_type": errorType,
"will_retry": willRetry,
})
}
func AgentTaskCancelled(ctx TaskContext, durationMS int64) Event {
return agentTaskEvent(EventAgentTaskCancelled, ctx, map[string]any{
"duration_ms": durationMS,
})
}
// AutopilotAssignee describes the autopilot's configured target. agent_id is
// always the agent that will actually execute the work (the squad leader for
// squad autopilots) so funnels grouping by agent stay consistent. assignee_*
@@ -657,16 +647,6 @@ func AutopilotCreated(actorID, workspaceID, autopilotID, cadence, triggerKind st
}
}
func agentTaskEvent(name string, ctx TaskContext, extra map[string]any) Event {
props := withCoreProperties(extra, CoreProperties(ctx))
return Event{
Name: name,
DistinctID: distinctID(ctx.UserID, ctx.WorkspaceID, ctx.AgentID),
WorkspaceID: ctx.WorkspaceID,
Properties: props,
}
}
func autopilotRunEvent(name, actorID, workspaceID, autopilotID, runID, cadence string, assignee AutopilotAssignee, triggerSource string, extra map[string]any) Event {
if extra == nil {
extra = map[string]any{}
@@ -733,20 +713,6 @@ func withCoreProperties(props map[string]any, core CoreProperties) map[string]an
return props
}
func distinctID(userID, workspaceID, agentID string) string {
if userID != "" {
return userID
}
// Synthetic PostHog distinct IDs are namespace-prefixed; user UUIDs are not.
if agentID != "" {
return "agent:" + agentID
}
if workspaceID != "" {
return "workspace:" + workspaceID
}
return ""
}
func nonAgentUserID(distinct string) string {
if distinct == "" || strings.Contains(distinct, ":") {
return ""

View File

@@ -15,21 +15,6 @@ func TestRuntimeReadyOmitsUnmeasuredDuration(t *testing.T) {
}
func TestFailedEventsUseWillRetry(t *testing.T) {
ctx := TaskContext{
UserID: "user-1",
WorkspaceID: "workspace-1",
AgentID: "agent-1",
TaskID: "task-1",
Source: SourceManual,
}
taskEv := AgentTaskFailed(ctx, 10, "runtime_offline", "runtime", true)
if got := taskEv.Properties["will_retry"]; got != true {
t.Fatalf("task will_retry = %v, want true", got)
}
if _, ok := taskEv.Properties["recoverable"]; ok {
t.Fatalf("task failure should not emit recoverable")
}
runEv := AutopilotRunFailed("user-1", "workspace-1", "autopilot-1", "run-1", "manual", AutopilotAssignee{AgentID: "agent-1", AssigneeType: "agent"}, "manual", "task failed", "task_error", false, 10)
if got := runEv.Properties["will_retry"]; got != false {
t.Fatalf("autopilot will_retry = %v, want false", got)
@@ -39,29 +24,24 @@ func TestFailedEventsUseWillRetry(t *testing.T) {
}
}
func TestAgentTaskDispatchedUsesTaskCoreProperties(t *testing.T) {
ctx := TaskContext{
UserID: "user-1",
WorkspaceID: "workspace-1",
AgentID: "agent-1",
TaskID: "task-1",
IssueID: "issue-1",
Source: SourceManual,
RuntimeMode: "local",
Provider: "codex",
func TestIsMetricsOnly(t *testing.T) {
// Operational / execution-lifecycle events are Prometheus-only and must
// not be shipped to PostHog.
for _, name := range []string{
EventRuntimeRegistered, EventRuntimeReady, EventRuntimeFailed, EventRuntimeOffline,
EventAutopilotRunStarted, EventAutopilotRunCompleted, EventAutopilotRunFailed,
} {
if !IsMetricsOnly(name) {
t.Errorf("IsMetricsOnly(%q) = false, want true (operational event must stay out of PostHog)", name)
}
}
ev := AgentTaskDispatched(ctx)
if ev.Name != EventAgentTaskDispatched {
t.Fatalf("event name = %q, want %q", ev.Name, EventAgentTaskDispatched)
}
if got := ev.WorkspaceID; got != "workspace-1" {
t.Fatalf("workspace_id = %q, want workspace-1", got)
}
if got := ev.Properties["task_id"]; got != "task-1" {
t.Fatalf("task_id = %v, want task-1", got)
}
if got := ev.Properties["runtime_mode"]; got != "local" {
t.Fatalf("runtime_mode = %v, want local", got)
// Product-behaviour events must still reach PostHog.
for _, name := range []string{
EventSignup, EventWorkspaceCreated, EventIssueCreated, EventIssueExecuted,
EventChatMessageSent, EventAgentCreated, EventAutopilotCreated,
} {
if IsMetricsOnly(name) {
t.Errorf("IsMetricsOnly(%q) = true, want false (product event must reach PostHog)", name)
}
}
}

View File

@@ -239,13 +239,18 @@ func (e *businessEventMetrics) collectors() []prometheus.Collector {
// `m = nil` (no metrics) safely; both sides are best-effort and never block
// the request path.
//
// Operational / execution-lifecycle events flagged by analytics.IsMetricsOnly
// (runtime_*, autopilot_run_*) still increment their Prometheus counter but are
// NOT shipped to PostHog — Grafana already covers them and their high volume is
// not worth the per-event PostHog ingestion cost. PostHog is reserved for
// user/product-behaviour events.
//
// This is the canonical way to emit any of the funnel / community / commercial
// PostHog events from server code. Direct analytics.Client.Capture(...) with
// an event constructed from analytics.* is rejected by the lint test in
// business_pairing_test.go — see that test for the allow-list of events whose
// Prometheus side is handled by typed BusinessMetrics methods (multica_agent_task_*).
// business_pairing_test.go.
func RecordEvent(client analytics.Client, m *BusinessMetrics, ev analytics.Event) {
if client != nil {
if client != nil && !analytics.IsMetricsOnly(ev.Name) {
client.Capture(ev)
}
if m != nil {
@@ -344,11 +349,11 @@ func (m *BusinessMetrics) IncForEvent(ev analytics.Event) {
case analytics.EventContactSalesSubmitted:
m.events.contactSalesSubmitted.WithLabelValues(NormalizeContactSalesSource(stringProp(ev.Properties, "form_source"))).Inc()
default:
// AgentTask* events are intentionally not dispatched here — their
// Prometheus side is handled by typed BusinessMetrics.RecordTask*
// methods that take queue/run/total seconds the analytics event
// does not carry. The lint test allow-lists those names.
// Anything else is a missing case and the lint test will fail CI.
// agent_task_* lifecycle telemetry is recorded straight to Prometheus
// via the typed BusinessMetrics.RecordTask* methods (they take
// queue/run/total seconds that an analytics.Event does not carry), so
// there is no analytics.Event to dispatch here. Anything else reaching
// this default is a missing case and the lint test will fail CI.
}
}

View File

@@ -4,8 +4,10 @@ package metrics_test
// server/internal/analytics/events.go has a paired Prometheus counter
// reachable through metrics.RecordEvent — and that every
// h.Analytics.Capture(analytics.<Helper>(...)) call site goes through
// metrics.RecordEvent (no naked Capture allowed except for the AgentTask*
// allow-list whose Prometheus side is handled by typed PR2 methods).
// metrics.RecordEvent (no naked Capture allowed). The agent task lifecycle is
// no longer an analytics.Event — it is recorded straight to Prometheus via the
// typed BusinessMetrics.RecordTask* methods — so there is no longer an
// AgentTask* allow-list here.
import (
"go/ast"
@@ -21,22 +23,6 @@ import (
"github.com/multica-ai/multica/server/internal/metrics"
)
// taskMetricEvents are emitted via the typed PR2 methods (RecordTaskEnqueued,
// RecordTaskDispatched, RecordTaskStarted, RecordTaskTerminal, RecordTaskFailed)
// instead of the generic RecordEvent dispatcher because their Prometheus side
// needs queue/run/total seconds that the analytics event does not carry.
//
// These names are still required to be paired — the lint test verifies they
// have a typed RecordTask* hit in service/task.go.
var taskMetricEvents = map[string]string{
analytics.EventAgentTaskQueued: "RecordTaskEnqueued",
analytics.EventAgentTaskDispatched: "RecordTaskDispatched",
analytics.EventAgentTaskStarted: "RecordTaskStarted",
analytics.EventAgentTaskCompleted: "RecordTaskTerminal",
analytics.EventAgentTaskFailed: "RecordTaskFailed",
analytics.EventAgentTaskCancelled: "RecordTaskTerminal",
}
// frontendOnlyEvents are declared in events.go but emitted from the frontend,
// not from server code. They still need a Prometheus counter (so a future
// server-side emission point lights up the same label set) but the server
@@ -46,10 +32,13 @@ var frontendOnlyEvents = map[string]bool{
}
// TestEveryAnalyticsEventHasPrometheusCounter asserts that every Event*
// constant declared in analytics/events.go either:
// - is dispatched by metrics.IncForEvent (verified by sending a synthetic
// event through RecordEvent and observing a counter delta), or
// - is in the typed taskMetricEvents allow-list.
// constant declared in analytics/events.go is dispatched by
// metrics.IncForEvent (verified by sending a synthetic event through
// RecordEvent and observing a counter delta).
//
// Note: agent_task_* lifecycle telemetry is Prometheus-only via the typed
// BusinessMetrics.RecordTask* methods and is no longer declared as an
// analytics.Event, so there are no agent_task constants to exempt here.
func TestEveryAnalyticsEventHasPrometheusCounter(t *testing.T) {
t.Parallel()
@@ -57,9 +46,6 @@ func TestEveryAnalyticsEventHasPrometheusCounter(t *testing.T) {
m := metrics.NewBusinessMetrics()
for name := range declared {
if _, allowed := taskMetricEvents[name]; allowed {
continue
}
// Build a minimal event with the required label properties that the
// dispatcher reads. Since IncForEvent reads via stringProp helpers,
// a nil Properties map is acceptable for events with empty label
@@ -78,11 +64,10 @@ func TestEveryAnalyticsEventHasPrometheusCounter(t *testing.T) {
// TestNoNakedAnalyticsCaptureInHandlersOrServices walks every Go file under
// server/internal/handler and server/internal/service and asserts that every
// `<x>.Analytics.Capture(analytics.<Helper>(...))` call has been migrated to
// metrics.RecordEvent. The only exception is the body of
// `service/task.go`'s captureTaskEvent function — a function-granular
// allow-list, not a whole-file one, so any new naked Capture added to the
// same file fails CI.
// `<x>.Analytics.Capture(analytics.<Helper>(...))` call goes through
// metrics.RecordEvent. There are no exceptions: every server-side PostHog
// event must flow through RecordEvent so the Prometheus and PostHog sides
// cannot drift.
func TestNoNakedAnalyticsCaptureInHandlersOrServices(t *testing.T) {
t.Parallel()
@@ -93,13 +78,10 @@ func TestNoNakedAnalyticsCaptureInHandlersOrServices(t *testing.T) {
}
// allowedFunctions is keyed by absolute file path, valued by the set of
// function names whose bodies are allowed to call Analytics.Capture
// directly. Granularity is per-function, not per-file: anything else
// in the same file still trips the check.
allowedFunctions := map[string]map[string]struct{}{
filepath.Join(repoRoot(t), "internal", "service", "task.go"): {
"captureTaskEvent": {},
},
}
// directly. Granularity is per-function, not per-file. Currently empty —
// no server code is permitted to call Analytics.Capture outside
// metrics.RecordEvent.
allowedFunctions := map[string]map[string]struct{}{}
var offenders []string
fset := token.NewFileSet()

View File

@@ -0,0 +1,44 @@
package metrics_test
import (
"testing"
"github.com/multica-ai/multica/server/internal/analytics"
"github.com/multica-ai/multica/server/internal/metrics"
)
// captureSpy records the names of every event handed to Capture so tests can
// assert which events actually reach PostHog.
type captureSpy struct{ names []string }
func (c *captureSpy) Capture(e analytics.Event) { c.names = append(c.names, e.Name) }
func (c *captureSpy) Close() {}
// TestRecordEventSkipsPostHogForMetricsOnly verifies that operational events
// flagged by analytics.IsMetricsOnly still increment a Prometheus counter but
// are NOT shipped to PostHog, while product-behaviour events are shipped.
func TestRecordEventSkipsPostHogForMetricsOnly(t *testing.T) {
spy := &captureSpy{}
m := metrics.NewBusinessMetrics()
// Operational event: Prometheus counter moves, PostHog gets nothing.
before := metrics.SumAllCounters(m)
metrics.RecordEvent(spy, m, analytics.RuntimeOffline("user-1", "ws-1", "rt-1", "daemon-1", "claude"))
if len(spy.names) != 0 {
t.Fatalf("runtime_offline shipped %d events to PostHog, want 0: %v", len(spy.names), spy.names)
}
if metrics.SumAllCounters(m) <= before {
t.Fatalf("runtime_offline did not increment a Prometheus counter")
}
metrics.RecordEvent(spy, m, analytics.AutopilotRunCompleted("user-1", "ws-1", "ap-1", "run-1", "manual", analytics.AutopilotAssignee{AgentID: "agent-1", AssigneeType: "agent"}, "manual", 10))
if len(spy.names) != 0 {
t.Fatalf("autopilot_run_completed shipped to PostHog, want 0: %v", spy.names)
}
// Product-behaviour event: still shipped to PostHog.
metrics.RecordEvent(spy, m, analytics.WorkspaceCreated("user-1", "ws-1"))
if len(spy.names) != 1 || spy.names[0] != analytics.EventWorkspaceCreated {
t.Fatalf("workspace_created was not shipped to PostHog exactly once: %v", spy.names)
}
}

View File

@@ -140,7 +140,6 @@ func (s *TaskService) captureTaskQueued(ctx context.Context, task db.AgentTaskQu
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
s.Metrics.RecordTaskEnqueued(source, runtimeMode)
}
s.captureTaskEvent(ctx, analytics.AgentTaskQueued(s.taskAnalyticsContext(ctx, task)))
}
func (s *TaskService) captureTaskDispatched(ctx context.Context, task db.AgentTaskQueue) {
@@ -148,7 +147,6 @@ func (s *TaskService) captureTaskDispatched(ctx context.Context, task db.AgentTa
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
s.Metrics.RecordTaskDispatched(util.UUIDToString(task.ID), source, runtimeMode, taskQueueWaitSeconds(task))
}
s.captureTaskEvent(ctx, analytics.AgentTaskDispatched(s.taskAnalyticsContext(ctx, task)))
}
func (s *TaskService) AnalyticsContextForTask(ctx context.Context, task db.AgentTaskQueue) analytics.TaskContext {
@@ -160,7 +158,6 @@ func (s *TaskService) captureTaskStarted(ctx context.Context, task db.AgentTaskQ
source, runtimeMode, provider := s.taskMetricsContext(ctx, task)
s.Metrics.RecordTaskStarted(source, runtimeMode, provider)
}
s.captureTaskEvent(ctx, analytics.AgentTaskStarted(s.taskAnalyticsContext(ctx, task)))
}
func (s *TaskService) captureTaskCompleted(ctx context.Context, task db.AgentTaskQueue) {
@@ -168,10 +165,6 @@ func (s *TaskService) captureTaskCompleted(ctx context.Context, task db.AgentTas
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
s.Metrics.RecordTaskTerminal(util.UUIDToString(task.ID), source, runtimeMode, task.Status, taskRunSeconds(task), taskTotalSeconds(task), task.Attempt)
}
s.captureTaskEvent(ctx, analytics.AgentTaskCompleted(
s.taskAnalyticsContext(ctx, task),
taskDurationMS(task),
))
}
func (s *TaskService) captureTaskFailed(ctx context.Context, task db.AgentTaskQueue) {
@@ -181,13 +174,6 @@ func (s *TaskService) captureTaskFailed(ctx context.Context, task db.AgentTaskQu
s.Metrics.RecordTaskTerminal(util.UUIDToString(task.ID), source, runtimeMode, task.Status, taskRunSeconds(task), taskTotalSeconds(task), task.Attempt)
s.Metrics.RecordTaskFailed(source, runtimeMode, failureReason)
}
s.captureTaskEvent(ctx, analytics.AgentTaskFailed(
s.taskAnalyticsContext(ctx, task),
taskDurationMS(task),
failureReason,
taskErrorType(failureReason),
s.willRetryTask(task),
))
}
func (s *TaskService) captureTaskCancelled(ctx context.Context, task db.AgentTaskQueue) {
@@ -195,10 +181,6 @@ func (s *TaskService) captureTaskCancelled(ctx context.Context, task db.AgentTas
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
s.Metrics.RecordTaskTerminal(util.UUIDToString(task.ID), source, runtimeMode, task.Status, taskRunSeconds(task), taskTotalSeconds(task), task.Attempt)
}
s.captureTaskEvent(ctx, analytics.AgentTaskCancelled(
s.taskAnalyticsContext(ctx, task),
taskDurationMS(task),
))
// Revoke any mat_ task tokens minted for this task. Cancellation is
// a terminal transition, so the running agent process no longer
// needs to call back; eagerly deleting the token closes the
@@ -239,16 +221,6 @@ func (s *TaskService) CaptureLeaseExpiredTasks(ctx context.Context, tasks []db.A
}
}
func (s *TaskService) captureTaskEvent(ctx context.Context, event analytics.Event) {
if s.Analytics == nil {
return
}
if event.WorkspaceID == "" {
return
}
s.Analytics.Capture(event)
}
func (s *TaskService) cachedTaskAnalyticsContext(task db.AgentTaskQueue) (analytics.TaskContext, bool) {
key := taskAnalyticsContextKey(task)
if key == "" {
@@ -410,26 +382,6 @@ func (s *TaskService) taskAnalyticsContext(ctx context.Context, task db.AgentTas
return tc
}
func taskDurationMS(task db.AgentTaskQueue) int64 {
if !task.CompletedAt.Valid {
return 0
}
start := task.CreatedAt
if task.StartedAt.Valid {
start = task.StartedAt
} else if task.DispatchedAt.Valid {
start = task.DispatchedAt
}
if !start.Valid {
return 0
}
ms := task.CompletedAt.Time.Sub(start.Time).Milliseconds()
if ms < 0 {
return 0
}
return ms
}
func taskQueueWaitSeconds(task db.AgentTaskQueue) float64 {
return durationSeconds(task.CreatedAt, task.DispatchedAt)
}
@@ -475,20 +427,6 @@ func taskErrorType(reason string) string {
}
}
func (s *TaskService) willRetryTask(task db.AgentTaskQueue) bool {
reason := taskFailureReason(task)
if !retryableReasons[reason] {
return false
}
if task.Attempt >= task.MaxAttempts {
return false
}
if task.AutopilotRunID.Valid {
return false
}
return task.IssueID.Valid || task.ChatSessionID.Valid
}
// EnqueueTaskForIssue creates a queued task for an agent-assigned issue.
// No context snapshot is stored — the agent fetches all data it needs at
// runtime via the multica CLI.