mirror of
https://github.com/multica-ai/multica.git
synced 2026-06-28 18:09:14 +02:00
Compare commits
2 Commits
codex/agen
...
agent/j/f5
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
84750a0598 | ||
|
|
7b52767803 |
@@ -6,6 +6,20 @@ drives our weekly Active Workspaces (WAW) north-star metric.
|
||||
|
||||
See [MUL-1122](https://github.com/multica-ai/multica) for the design context.
|
||||
|
||||
> **PostHog is reserved for user/product-behaviour events.** High-volume
|
||||
> operational / execution-lifecycle telemetry — runtime lifecycle
|
||||
> (`runtime_registered` / `runtime_ready` / `runtime_failed` /
|
||||
> `runtime_offline`), the agent task lifecycle (`agent_task_*`), and autopilot
|
||||
> run lifecycle (`autopilot_run_started` / `autopilot_run_completed` /
|
||||
> `autopilot_run_failed`) — is **Prometheus-only** and is **not** shipped to
|
||||
> PostHog. Grafana already covers it and the per-event PostHog ingestion cost
|
||||
> (these events dominate volume and bill at the identified-event rate) is not
|
||||
> justified. The runtime/autopilot events are flagged by
|
||||
> `analytics.IsMetricsOnly`, which `metrics.RecordEvent` consults to skip the
|
||||
> PostHog `Capture` while still incrementing the Prometheus counter; the
|
||||
> `agent_task_*` lifecycle is recorded straight to Prometheus via the typed
|
||||
> `BusinessMetrics.RecordTask*` methods and has no `analytics.Event` at all.
|
||||
|
||||
## Configuration
|
||||
|
||||
All analytics shipping is toggled by environment variables (see `.env.example`):
|
||||
@@ -89,15 +103,18 @@ Every event is assigned to one dashboard category:
|
||||
|
||||
| Category | Events |
|
||||
|---|---|
|
||||
| `core_loop` | `workspace_created`, `runtime_registered`, `runtime_ready`, `runtime_failed`, `runtime_offline`, `agent_created`, `issue_created`, `chat_message_sent`, `agent_task_queued`, `agent_task_dispatched`, `agent_task_started`, `agent_task_completed`, `agent_task_failed`, `agent_task_cancelled`, `autopilot_run_started`, `autopilot_run_completed`, `autopilot_run_failed` |
|
||||
| `core_loop` | `workspace_created`, `agent_created`, `issue_created`, `chat_message_sent`, `issue_executed`, `autopilot_created`, `squad_created` |
|
||||
| `onboarding_support` | `onboarding_started`, `onboarding_questionnaire_submitted`, `onboarding_completed`, `onboarding_runtime_path_selected`, `onboarding_runtime_detected` |
|
||||
| `acquisition` | `signup`, `download_intent_expressed`, `download_page_viewed`, `download_initiated`, `cloud_waitlist_joined`, `contact_sales_submitted` |
|
||||
| `ops_feedback` | `feedback_opened`, `feedback_submitted` |
|
||||
| `system/noise` | `$pageview`, `$set`, `$identify`, `$autocapture`, `$rageclick` |
|
||||
| `operational (Prometheus-only — NOT in PostHog)` | `runtime_registered`, `runtime_ready`, `runtime_failed`, `runtime_offline`, `agent_task_queued`, `agent_task_dispatched`, `agent_task_started`, `agent_task_completed`, `agent_task_failed`, `agent_task_cancelled`, `autopilot_run_started`, `autopilot_run_completed`, `autopilot_run_failed` |
|
||||
|
||||
The v0 core dashboard must use only `core_loop` plus the specific
|
||||
`onboarding_support` steps used by the activation funnel. Acquisition,
|
||||
feedback, and system/noise events stay in separate dashboards.
|
||||
feedback, and system/noise events stay in separate dashboards. The
|
||||
`operational` row is **not shipped to PostHog** — those signals live in
|
||||
Grafana via `multica_*` business counters (see `server/internal/metrics`).
|
||||
|
||||
## Standard core properties
|
||||
|
||||
@@ -165,6 +182,13 @@ funnel with "first time user does X" or a cohort on
|
||||
|
||||
### `runtime_registered`
|
||||
|
||||
> **Prometheus-only — not shipped to PostHog** (see the note at the top of this
|
||||
> doc). The `analytics.Event` is still constructed so `metrics.IncForEvent` can
|
||||
> derive the Prometheus counter; the fields below are that **event** shape, not
|
||||
> a PostHog contract. Only the low-cardinality fields (`runtime_mode`,
|
||||
> `provider`) become Prometheus labels — ids like `runtime_id` / `daemon_id`
|
||||
> are not labels.
|
||||
|
||||
Fires the first time a `(workspace_id, daemon_id, provider)` tuple is
|
||||
upserted. Heartbeats and repeat registrations never re-emit. First-time
|
||||
detection uses Postgres `xmax = 0` on the upsert RETURNING clause — no
|
||||
@@ -186,6 +210,8 @@ under a single "anonymous" person.
|
||||
|
||||
### `runtime_ready`
|
||||
|
||||
> **Prometheus-only — not shipped to PostHog.**
|
||||
|
||||
Fires when a runtime is first registered in an online/ready state. This is the
|
||||
activation-funnel step that should replace treating `runtime_registered` as
|
||||
proof of readiness. The backend emits this only on the INSERT path for a new
|
||||
@@ -203,6 +229,8 @@ distinct `runtime_id`.
|
||||
|
||||
### `runtime_failed`
|
||||
|
||||
> **Prometheus-only — not shipped to PostHog.**
|
||||
|
||||
Fires when runtime setup/registration fails before a ready runtime can be
|
||||
recorded. Today this is scoped to backend registration persistence failures;
|
||||
future setup flows should reuse it for provider detection or daemon boot
|
||||
@@ -218,6 +246,8 @@ failures.
|
||||
|
||||
### `runtime_offline`
|
||||
|
||||
> **Prometheus-only — not shipped to PostHog.**
|
||||
|
||||
Fires when a runtime is explicitly deregistered or the backend sweeper marks it
|
||||
offline after missed heartbeats. This is not an activation step; it supports
|
||||
local runtime retention and drop-off diagnosis.
|
||||
@@ -247,40 +277,52 @@ is queued.
|
||||
| `agent_id` | string (UUID) | Chat agent. |
|
||||
| `source` | string | Always `chat`. |
|
||||
|
||||
### `agent_task_queued` / `agent_task_dispatched` / `agent_task_started` / `agent_task_completed`
|
||||
### agent task lifecycle (Prometheus-only)
|
||||
|
||||
Canonical task lifecycle events emitted from `agent_task_queue` state
|
||||
transitions. `agent_task_dispatched` fires when the backend claims a queued
|
||||
task for a runtime, before the daemon marks it running with
|
||||
`agent_task_started`. These events replace `issue_executed` for core loop
|
||||
success metrics and allow the activation funnel to split queue backlog from
|
||||
claim/start handoff.
|
||||
> **Not shipped to PostHog and has no `analytics.Event`.** The agent task
|
||||
> lifecycle is recorded directly to Prometheus by the typed
|
||||
> `BusinessMetrics.RecordTask*` methods in `server/internal/service/task.go`.
|
||||
> The old PostHog event names (`agent_task_queued` / `dispatched` / `started` /
|
||||
> `completed` / `failed` / `cancelled`) and their properties (`task_id`,
|
||||
> `agent_id`, `issue_id`, `chat_session_id`, `autopilot_run_id`, `duration_ms`,
|
||||
> `error_type`, `will_retry`) no longer exist anywhere — those high-cardinality
|
||||
> ids were never Prometheus labels and must not be used in dashboards or
|
||||
> reconciliation.
|
||||
|
||||
| Property | Type | Description |
|
||||
The actual metrics (defined in `server/internal/metrics/business.go`; label
|
||||
sets in `server/internal/metrics/labels.go`):
|
||||
|
||||
| Metric | Type | Labels |
|
||||
|---|---|---|
|
||||
| `task_id` | string (UUID) | `agent_task_queue.id`; required. |
|
||||
| `agent_id` | string (UUID) | Owning agent. |
|
||||
| `issue_id` | string (UUID) | Present for issue-linked tasks. |
|
||||
| `chat_session_id` | string (UUID) | Present for chat tasks. |
|
||||
| `autopilot_run_id` | string (UUID) | Present for run-only autopilot tasks. |
|
||||
| `source` | string | `manual`, `chat`, or `autopilot`. |
|
||||
| `runtime_mode` | string | `local` / `cloud`. |
|
||||
| `provider` | string | Runtime provider. |
|
||||
| `duration_ms` | int64 | Terminal events only; measured from `started_at` when available. |
|
||||
| `multica_agent_task_enqueued_total` | counter | `source`, `runtime_mode` |
|
||||
| `multica_agent_task_dispatched_total` | counter | `source`, `runtime_mode` |
|
||||
| `multica_agent_task_started_total` | counter | `source`, `runtime_mode`, `provider` |
|
||||
| `multica_agent_task_terminal_total` | counter | `source`, `runtime_mode`, `terminal_status` |
|
||||
| `multica_agent_task_failed_total` | counter | `source`, `runtime_mode`, `failure_reason` |
|
||||
| `multica_agent_task_queue_wait_seconds` | histogram | `source`, `runtime_mode` |
|
||||
| `multica_agent_task_run_seconds` | histogram | `source`, `runtime_mode`, `terminal_status` |
|
||||
| `multica_agent_task_total_seconds` | histogram | `source`, `runtime_mode`, `terminal_status` |
|
||||
|
||||
### `agent_task_failed` / `agent_task_cancelled`
|
||||
|
||||
Terminal task lifecycle events. They use the same join fields as
|
||||
`agent_task_completed`. `agent_task_failed` also carries:
|
||||
|
||||
| Property | Type | Description |
|
||||
|---|---|---|
|
||||
| `failure_reason` | string | Stable reason from `agent_task_queue.failure_reason`, default `agent_error`. |
|
||||
| `error_type` | string | Stable coarse classifier, e.g. `runtime`, `timeout`, `agent_output`, `cancelled`, `agent_error`. |
|
||||
| `will_retry` | bool | Whether the backend auto-retry policy will create another task attempt. |
|
||||
- `terminal_status` is the task's final `agent_task_queue.status` —
|
||||
`completed` / `failed` / `cancelled`. There is **no** separate
|
||||
completed/cancelled metric: all three land on
|
||||
`multica_agent_task_terminal_total{terminal_status=…}`. Failures
|
||||
additionally increment `multica_agent_task_failed_total` carrying the coarse
|
||||
`failure_reason` (`agent_task_queue.failure_reason`, default `agent_error`).
|
||||
- Task wall-clock lives in the `*_seconds` histograms (queue wait / run /
|
||||
total), replacing the old `duration_ms` event property.
|
||||
- `source` / `runtime_mode` / `provider` are the normalized label values
|
||||
(`NormalizeTaskSource` / `NormalizeRuntimeMode` / `NormalizeRuntimeProvider`).
|
||||
|
||||
### `autopilot_run_started` / `autopilot_run_completed` / `autopilot_run_failed`
|
||||
|
||||
> **Prometheus-only — not shipped to PostHog.** The `analytics.*` constructors
|
||||
> are retained only so `metrics.IncForEvent` can derive the Prometheus counter;
|
||||
> `analytics.IsMetricsOnly` keeps them out of PostHog. Only `cadence`,
|
||||
> `trigger_kind`, and `terminal_status` become Prometheus labels — the
|
||||
> `autopilot_id` / `autopilot_run_id` / `agent_id` fields below are event shape,
|
||||
> not labels.
|
||||
|
||||
Fires from `autopilot_run` lifecycle changes. `source` is always
|
||||
`autopilot`; the trigger origin is carried in `trigger_source` (`manual`,
|
||||
`schedule`, `webhook`, or `api`).
|
||||
@@ -329,9 +371,11 @@ emit `n=1`. PostHog answers the same question at query time via
|
||||
and funnel steps of the form "workspace has had ≥2 `issue_executed`
|
||||
events" are expressible without the property. No information is lost.
|
||||
|
||||
Compatibility: `issue_executed` remains a historical compatibility event for
|
||||
old dashboards. New core-loop success dashboards should use
|
||||
`agent_task_completed` and filter by `source`/`issue_id` as needed.
|
||||
`issue_executed` is the canonical **PostHog** core-loop success signal (the
|
||||
`agent_task_*` lifecycle that previously served per-task success dashboards is
|
||||
now Prometheus-only). Per-task completion counts live in Grafana via
|
||||
`BusinessMetrics.RecordTaskTerminal`; use `issue_executed` for the
|
||||
PostHog-side activation funnel and filter by `source` as needed.
|
||||
|
||||
### `team_invite_sent`
|
||||
|
||||
@@ -604,8 +648,10 @@ sent from a pre-workspace surface.
|
||||
|
||||
## Reconciliation
|
||||
|
||||
`agent_task_completed` is the canonical PostHog-side task success event. It
|
||||
should reconcile daily against the operational source of truth:
|
||||
Per-task completion is no longer shipped to PostHog. Task success now
|
||||
reconciles **DB ↔ Prometheus** instead of DB ↔ PostHog: the
|
||||
`BusinessMetrics.RecordTaskTerminal` counter (exported as a `multica_*` task
|
||||
metric) should track the operational source of truth:
|
||||
|
||||
```sql
|
||||
SELECT date_trunc('day', completed_at AT TIME ZONE 'UTC') AS day,
|
||||
@@ -617,22 +663,13 @@ GROUP BY 1
|
||||
ORDER BY 1;
|
||||
```
|
||||
|
||||
Equivalent HogQL:
|
||||
Compare against the equivalent Prometheus counter in Grafana. The expected
|
||||
difference should be near zero; sustained drift means either an emission site
|
||||
is missing or the metrics pipeline is unhealthy.
|
||||
|
||||
```sql
|
||||
SELECT toStartOfDay(timestamp) AS day,
|
||||
count() AS posthog_completed_tasks
|
||||
FROM events
|
||||
WHERE event = 'agent_task_completed'
|
||||
AND properties.environment = 'production'
|
||||
AND timestamp >= now() - interval 30 day
|
||||
GROUP BY day
|
||||
ORDER BY day
|
||||
```
|
||||
|
||||
The expected difference should be near zero. Allow a small delay window for
|
||||
PostHog ingestion and backend analytics queue drops; sustained drift means
|
||||
either an emission site is missing or PostHog shipping is unhealthy.
|
||||
On the PostHog side, `issue_executed` remains the product-level success signal
|
||||
(at most one per issue) and can be reconciled against
|
||||
`issue.first_executed_at` if needed.
|
||||
|
||||
## Governance
|
||||
|
||||
|
||||
@@ -13,12 +13,6 @@ const (
|
||||
EventIssueExecuted = "issue_executed"
|
||||
EventIssueCreated = "issue_created"
|
||||
EventChatMessageSent = "chat_message_sent"
|
||||
EventAgentTaskQueued = "agent_task_queued"
|
||||
EventAgentTaskDispatched = "agent_task_dispatched"
|
||||
EventAgentTaskStarted = "agent_task_started"
|
||||
EventAgentTaskCompleted = "agent_task_completed"
|
||||
EventAgentTaskFailed = "agent_task_failed"
|
||||
EventAgentTaskCancelled = "agent_task_cancelled"
|
||||
EventAutopilotRunStarted = "autopilot_run_started"
|
||||
EventAutopilotRunCompleted = "autopilot_run_completed"
|
||||
EventAutopilotRunFailed = "autopilot_run_failed"
|
||||
@@ -37,6 +31,35 @@ const (
|
||||
|
||||
const EventSchemaVersion = 2
|
||||
|
||||
// metricsOnlyEvents are operational / execution-lifecycle events that are
|
||||
// recorded to Prometheus (via metrics.IncForEvent, for Grafana) but are
|
||||
// deliberately NOT shipped to PostHog. They are high-volume runtime/autopilot
|
||||
// telemetry whose per-event PostHog ingestion cost is not justified — Grafana
|
||||
// already carries the equivalent counters. metrics.RecordEvent consults this
|
||||
// set and skips the PostHog Capture for these names while still incrementing
|
||||
// the counter. PostHog is reserved for user/product-behaviour events.
|
||||
//
|
||||
// Note: agent_task_* lifecycle events are also Prometheus-only, but their
|
||||
// Prometheus side is handled by typed BusinessMetrics.RecordTask* methods, so
|
||||
// they never build an analytics.Event in the first place and don't need an
|
||||
// entry here.
|
||||
var metricsOnlyEvents = map[string]struct{}{
|
||||
EventRuntimeRegistered: {},
|
||||
EventRuntimeReady: {},
|
||||
EventRuntimeFailed: {},
|
||||
EventRuntimeOffline: {},
|
||||
EventAutopilotRunStarted: {},
|
||||
EventAutopilotRunCompleted: {},
|
||||
EventAutopilotRunFailed: {},
|
||||
}
|
||||
|
||||
// IsMetricsOnly reports whether an event name is operational telemetry that
|
||||
// must be counted in Prometheus but not sent to PostHog. See metricsOnlyEvents.
|
||||
func IsMetricsOnly(name string) bool {
|
||||
_, ok := metricsOnlyEvents[name]
|
||||
return ok
|
||||
}
|
||||
|
||||
const (
|
||||
SourceOnboarding = "onboarding"
|
||||
SourceManual = "manual"
|
||||
@@ -306,39 +329,6 @@ func ChatMessageSent(userID, workspaceID, chatSessionID, taskID, agentID, runtim
|
||||
}
|
||||
}
|
||||
|
||||
func AgentTaskQueued(ctx TaskContext) Event {
|
||||
return agentTaskEvent(EventAgentTaskQueued, ctx, nil)
|
||||
}
|
||||
|
||||
func AgentTaskDispatched(ctx TaskContext) Event {
|
||||
return agentTaskEvent(EventAgentTaskDispatched, ctx, nil)
|
||||
}
|
||||
|
||||
func AgentTaskStarted(ctx TaskContext) Event {
|
||||
return agentTaskEvent(EventAgentTaskStarted, ctx, nil)
|
||||
}
|
||||
|
||||
func AgentTaskCompleted(ctx TaskContext, durationMS int64) Event {
|
||||
return agentTaskEvent(EventAgentTaskCompleted, ctx, map[string]any{
|
||||
"duration_ms": durationMS,
|
||||
})
|
||||
}
|
||||
|
||||
func AgentTaskFailed(ctx TaskContext, durationMS int64, failureReason, errorType string, willRetry bool) Event {
|
||||
return agentTaskEvent(EventAgentTaskFailed, ctx, map[string]any{
|
||||
"duration_ms": durationMS,
|
||||
"failure_reason": failureReason,
|
||||
"error_type": errorType,
|
||||
"will_retry": willRetry,
|
||||
})
|
||||
}
|
||||
|
||||
func AgentTaskCancelled(ctx TaskContext, durationMS int64) Event {
|
||||
return agentTaskEvent(EventAgentTaskCancelled, ctx, map[string]any{
|
||||
"duration_ms": durationMS,
|
||||
})
|
||||
}
|
||||
|
||||
// AutopilotAssignee describes the autopilot's configured target. agent_id is
|
||||
// always the agent that will actually execute the work (the squad leader for
|
||||
// squad autopilots) so funnels grouping by agent stay consistent. assignee_*
|
||||
@@ -657,16 +647,6 @@ func AutopilotCreated(actorID, workspaceID, autopilotID, cadence, triggerKind st
|
||||
}
|
||||
}
|
||||
|
||||
func agentTaskEvent(name string, ctx TaskContext, extra map[string]any) Event {
|
||||
props := withCoreProperties(extra, CoreProperties(ctx))
|
||||
return Event{
|
||||
Name: name,
|
||||
DistinctID: distinctID(ctx.UserID, ctx.WorkspaceID, ctx.AgentID),
|
||||
WorkspaceID: ctx.WorkspaceID,
|
||||
Properties: props,
|
||||
}
|
||||
}
|
||||
|
||||
func autopilotRunEvent(name, actorID, workspaceID, autopilotID, runID, cadence string, assignee AutopilotAssignee, triggerSource string, extra map[string]any) Event {
|
||||
if extra == nil {
|
||||
extra = map[string]any{}
|
||||
@@ -733,20 +713,6 @@ func withCoreProperties(props map[string]any, core CoreProperties) map[string]an
|
||||
return props
|
||||
}
|
||||
|
||||
func distinctID(userID, workspaceID, agentID string) string {
|
||||
if userID != "" {
|
||||
return userID
|
||||
}
|
||||
// Synthetic PostHog distinct IDs are namespace-prefixed; user UUIDs are not.
|
||||
if agentID != "" {
|
||||
return "agent:" + agentID
|
||||
}
|
||||
if workspaceID != "" {
|
||||
return "workspace:" + workspaceID
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func nonAgentUserID(distinct string) string {
|
||||
if distinct == "" || strings.Contains(distinct, ":") {
|
||||
return ""
|
||||
|
||||
@@ -15,21 +15,6 @@ func TestRuntimeReadyOmitsUnmeasuredDuration(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestFailedEventsUseWillRetry(t *testing.T) {
|
||||
ctx := TaskContext{
|
||||
UserID: "user-1",
|
||||
WorkspaceID: "workspace-1",
|
||||
AgentID: "agent-1",
|
||||
TaskID: "task-1",
|
||||
Source: SourceManual,
|
||||
}
|
||||
taskEv := AgentTaskFailed(ctx, 10, "runtime_offline", "runtime", true)
|
||||
if got := taskEv.Properties["will_retry"]; got != true {
|
||||
t.Fatalf("task will_retry = %v, want true", got)
|
||||
}
|
||||
if _, ok := taskEv.Properties["recoverable"]; ok {
|
||||
t.Fatalf("task failure should not emit recoverable")
|
||||
}
|
||||
|
||||
runEv := AutopilotRunFailed("user-1", "workspace-1", "autopilot-1", "run-1", "manual", AutopilotAssignee{AgentID: "agent-1", AssigneeType: "agent"}, "manual", "task failed", "task_error", false, 10)
|
||||
if got := runEv.Properties["will_retry"]; got != false {
|
||||
t.Fatalf("autopilot will_retry = %v, want false", got)
|
||||
@@ -39,29 +24,24 @@ func TestFailedEventsUseWillRetry(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestAgentTaskDispatchedUsesTaskCoreProperties(t *testing.T) {
|
||||
ctx := TaskContext{
|
||||
UserID: "user-1",
|
||||
WorkspaceID: "workspace-1",
|
||||
AgentID: "agent-1",
|
||||
TaskID: "task-1",
|
||||
IssueID: "issue-1",
|
||||
Source: SourceManual,
|
||||
RuntimeMode: "local",
|
||||
Provider: "codex",
|
||||
func TestIsMetricsOnly(t *testing.T) {
|
||||
// Operational / execution-lifecycle events are Prometheus-only and must
|
||||
// not be shipped to PostHog.
|
||||
for _, name := range []string{
|
||||
EventRuntimeRegistered, EventRuntimeReady, EventRuntimeFailed, EventRuntimeOffline,
|
||||
EventAutopilotRunStarted, EventAutopilotRunCompleted, EventAutopilotRunFailed,
|
||||
} {
|
||||
if !IsMetricsOnly(name) {
|
||||
t.Errorf("IsMetricsOnly(%q) = false, want true (operational event must stay out of PostHog)", name)
|
||||
}
|
||||
}
|
||||
ev := AgentTaskDispatched(ctx)
|
||||
|
||||
if ev.Name != EventAgentTaskDispatched {
|
||||
t.Fatalf("event name = %q, want %q", ev.Name, EventAgentTaskDispatched)
|
||||
}
|
||||
if got := ev.WorkspaceID; got != "workspace-1" {
|
||||
t.Fatalf("workspace_id = %q, want workspace-1", got)
|
||||
}
|
||||
if got := ev.Properties["task_id"]; got != "task-1" {
|
||||
t.Fatalf("task_id = %v, want task-1", got)
|
||||
}
|
||||
if got := ev.Properties["runtime_mode"]; got != "local" {
|
||||
t.Fatalf("runtime_mode = %v, want local", got)
|
||||
// Product-behaviour events must still reach PostHog.
|
||||
for _, name := range []string{
|
||||
EventSignup, EventWorkspaceCreated, EventIssueCreated, EventIssueExecuted,
|
||||
EventChatMessageSent, EventAgentCreated, EventAutopilotCreated,
|
||||
} {
|
||||
if IsMetricsOnly(name) {
|
||||
t.Errorf("IsMetricsOnly(%q) = true, want false (product event must reach PostHog)", name)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -239,13 +239,18 @@ func (e *businessEventMetrics) collectors() []prometheus.Collector {
|
||||
// `m = nil` (no metrics) safely; both sides are best-effort and never block
|
||||
// the request path.
|
||||
//
|
||||
// Operational / execution-lifecycle events flagged by analytics.IsMetricsOnly
|
||||
// (runtime_*, autopilot_run_*) still increment their Prometheus counter but are
|
||||
// NOT shipped to PostHog — Grafana already covers them and their high volume is
|
||||
// not worth the per-event PostHog ingestion cost. PostHog is reserved for
|
||||
// user/product-behaviour events.
|
||||
//
|
||||
// This is the canonical way to emit any of the funnel / community / commercial
|
||||
// PostHog events from server code. Direct analytics.Client.Capture(...) with
|
||||
// an event constructed from analytics.* is rejected by the lint test in
|
||||
// business_pairing_test.go — see that test for the allow-list of events whose
|
||||
// Prometheus side is handled by typed BusinessMetrics methods (multica_agent_task_*).
|
||||
// business_pairing_test.go.
|
||||
func RecordEvent(client analytics.Client, m *BusinessMetrics, ev analytics.Event) {
|
||||
if client != nil {
|
||||
if client != nil && !analytics.IsMetricsOnly(ev.Name) {
|
||||
client.Capture(ev)
|
||||
}
|
||||
if m != nil {
|
||||
@@ -344,11 +349,11 @@ func (m *BusinessMetrics) IncForEvent(ev analytics.Event) {
|
||||
case analytics.EventContactSalesSubmitted:
|
||||
m.events.contactSalesSubmitted.WithLabelValues(NormalizeContactSalesSource(stringProp(ev.Properties, "form_source"))).Inc()
|
||||
default:
|
||||
// AgentTask* events are intentionally not dispatched here — their
|
||||
// Prometheus side is handled by typed BusinessMetrics.RecordTask*
|
||||
// methods that take queue/run/total seconds the analytics event
|
||||
// does not carry. The lint test allow-lists those names.
|
||||
// Anything else is a missing case and the lint test will fail CI.
|
||||
// agent_task_* lifecycle telemetry is recorded straight to Prometheus
|
||||
// via the typed BusinessMetrics.RecordTask* methods (they take
|
||||
// queue/run/total seconds that an analytics.Event does not carry), so
|
||||
// there is no analytics.Event to dispatch here. Anything else reaching
|
||||
// this default is a missing case and the lint test will fail CI.
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -4,8 +4,10 @@ package metrics_test
|
||||
// server/internal/analytics/events.go has a paired Prometheus counter
|
||||
// reachable through metrics.RecordEvent — and that every
|
||||
// h.Analytics.Capture(analytics.<Helper>(...)) call site goes through
|
||||
// metrics.RecordEvent (no naked Capture allowed except for the AgentTask*
|
||||
// allow-list whose Prometheus side is handled by typed PR2 methods).
|
||||
// metrics.RecordEvent (no naked Capture allowed). The agent task lifecycle is
|
||||
// no longer an analytics.Event — it is recorded straight to Prometheus via the
|
||||
// typed BusinessMetrics.RecordTask* methods — so there is no longer an
|
||||
// AgentTask* allow-list here.
|
||||
|
||||
import (
|
||||
"go/ast"
|
||||
@@ -21,22 +23,6 @@ import (
|
||||
"github.com/multica-ai/multica/server/internal/metrics"
|
||||
)
|
||||
|
||||
// taskMetricEvents are emitted via the typed PR2 methods (RecordTaskEnqueued,
|
||||
// RecordTaskDispatched, RecordTaskStarted, RecordTaskTerminal, RecordTaskFailed)
|
||||
// instead of the generic RecordEvent dispatcher because their Prometheus side
|
||||
// needs queue/run/total seconds that the analytics event does not carry.
|
||||
//
|
||||
// These names are still required to be paired — the lint test verifies they
|
||||
// have a typed RecordTask* hit in service/task.go.
|
||||
var taskMetricEvents = map[string]string{
|
||||
analytics.EventAgentTaskQueued: "RecordTaskEnqueued",
|
||||
analytics.EventAgentTaskDispatched: "RecordTaskDispatched",
|
||||
analytics.EventAgentTaskStarted: "RecordTaskStarted",
|
||||
analytics.EventAgentTaskCompleted: "RecordTaskTerminal",
|
||||
analytics.EventAgentTaskFailed: "RecordTaskFailed",
|
||||
analytics.EventAgentTaskCancelled: "RecordTaskTerminal",
|
||||
}
|
||||
|
||||
// frontendOnlyEvents are declared in events.go but emitted from the frontend,
|
||||
// not from server code. They still need a Prometheus counter (so a future
|
||||
// server-side emission point lights up the same label set) but the server
|
||||
@@ -46,10 +32,13 @@ var frontendOnlyEvents = map[string]bool{
|
||||
}
|
||||
|
||||
// TestEveryAnalyticsEventHasPrometheusCounter asserts that every Event*
|
||||
// constant declared in analytics/events.go either:
|
||||
// - is dispatched by metrics.IncForEvent (verified by sending a synthetic
|
||||
// event through RecordEvent and observing a counter delta), or
|
||||
// - is in the typed taskMetricEvents allow-list.
|
||||
// constant declared in analytics/events.go is dispatched by
|
||||
// metrics.IncForEvent (verified by sending a synthetic event through
|
||||
// RecordEvent and observing a counter delta).
|
||||
//
|
||||
// Note: agent_task_* lifecycle telemetry is Prometheus-only via the typed
|
||||
// BusinessMetrics.RecordTask* methods and is no longer declared as an
|
||||
// analytics.Event, so there are no agent_task constants to exempt here.
|
||||
func TestEveryAnalyticsEventHasPrometheusCounter(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -57,9 +46,6 @@ func TestEveryAnalyticsEventHasPrometheusCounter(t *testing.T) {
|
||||
|
||||
m := metrics.NewBusinessMetrics()
|
||||
for name := range declared {
|
||||
if _, allowed := taskMetricEvents[name]; allowed {
|
||||
continue
|
||||
}
|
||||
// Build a minimal event with the required label properties that the
|
||||
// dispatcher reads. Since IncForEvent reads via stringProp helpers,
|
||||
// a nil Properties map is acceptable for events with empty label
|
||||
@@ -78,11 +64,10 @@ func TestEveryAnalyticsEventHasPrometheusCounter(t *testing.T) {
|
||||
|
||||
// TestNoNakedAnalyticsCaptureInHandlersOrServices walks every Go file under
|
||||
// server/internal/handler and server/internal/service and asserts that every
|
||||
// `<x>.Analytics.Capture(analytics.<Helper>(...))` call has been migrated to
|
||||
// metrics.RecordEvent. The only exception is the body of
|
||||
// `service/task.go`'s captureTaskEvent function — a function-granular
|
||||
// allow-list, not a whole-file one, so any new naked Capture added to the
|
||||
// same file fails CI.
|
||||
// `<x>.Analytics.Capture(analytics.<Helper>(...))` call goes through
|
||||
// metrics.RecordEvent. There are no exceptions: every server-side PostHog
|
||||
// event must flow through RecordEvent so the Prometheus and PostHog sides
|
||||
// cannot drift.
|
||||
func TestNoNakedAnalyticsCaptureInHandlersOrServices(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -93,13 +78,10 @@ func TestNoNakedAnalyticsCaptureInHandlersOrServices(t *testing.T) {
|
||||
}
|
||||
// allowedFunctions is keyed by absolute file path, valued by the set of
|
||||
// function names whose bodies are allowed to call Analytics.Capture
|
||||
// directly. Granularity is per-function, not per-file: anything else
|
||||
// in the same file still trips the check.
|
||||
allowedFunctions := map[string]map[string]struct{}{
|
||||
filepath.Join(repoRoot(t), "internal", "service", "task.go"): {
|
||||
"captureTaskEvent": {},
|
||||
},
|
||||
}
|
||||
// directly. Granularity is per-function, not per-file. Currently empty —
|
||||
// no server code is permitted to call Analytics.Capture outside
|
||||
// metrics.RecordEvent.
|
||||
allowedFunctions := map[string]map[string]struct{}{}
|
||||
|
||||
var offenders []string
|
||||
fset := token.NewFileSet()
|
||||
|
||||
44
server/internal/metrics/record_event_test.go
Normal file
44
server/internal/metrics/record_event_test.go
Normal file
@@ -0,0 +1,44 @@
|
||||
package metrics_test
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/multica-ai/multica/server/internal/analytics"
|
||||
"github.com/multica-ai/multica/server/internal/metrics"
|
||||
)
|
||||
|
||||
// captureSpy records the names of every event handed to Capture so tests can
|
||||
// assert which events actually reach PostHog.
|
||||
type captureSpy struct{ names []string }
|
||||
|
||||
func (c *captureSpy) Capture(e analytics.Event) { c.names = append(c.names, e.Name) }
|
||||
func (c *captureSpy) Close() {}
|
||||
|
||||
// TestRecordEventSkipsPostHogForMetricsOnly verifies that operational events
|
||||
// flagged by analytics.IsMetricsOnly still increment a Prometheus counter but
|
||||
// are NOT shipped to PostHog, while product-behaviour events are shipped.
|
||||
func TestRecordEventSkipsPostHogForMetricsOnly(t *testing.T) {
|
||||
spy := &captureSpy{}
|
||||
m := metrics.NewBusinessMetrics()
|
||||
|
||||
// Operational event: Prometheus counter moves, PostHog gets nothing.
|
||||
before := metrics.SumAllCounters(m)
|
||||
metrics.RecordEvent(spy, m, analytics.RuntimeOffline("user-1", "ws-1", "rt-1", "daemon-1", "claude"))
|
||||
if len(spy.names) != 0 {
|
||||
t.Fatalf("runtime_offline shipped %d events to PostHog, want 0: %v", len(spy.names), spy.names)
|
||||
}
|
||||
if metrics.SumAllCounters(m) <= before {
|
||||
t.Fatalf("runtime_offline did not increment a Prometheus counter")
|
||||
}
|
||||
|
||||
metrics.RecordEvent(spy, m, analytics.AutopilotRunCompleted("user-1", "ws-1", "ap-1", "run-1", "manual", analytics.AutopilotAssignee{AgentID: "agent-1", AssigneeType: "agent"}, "manual", 10))
|
||||
if len(spy.names) != 0 {
|
||||
t.Fatalf("autopilot_run_completed shipped to PostHog, want 0: %v", spy.names)
|
||||
}
|
||||
|
||||
// Product-behaviour event: still shipped to PostHog.
|
||||
metrics.RecordEvent(spy, m, analytics.WorkspaceCreated("user-1", "ws-1"))
|
||||
if len(spy.names) != 1 || spy.names[0] != analytics.EventWorkspaceCreated {
|
||||
t.Fatalf("workspace_created was not shipped to PostHog exactly once: %v", spy.names)
|
||||
}
|
||||
}
|
||||
@@ -140,7 +140,6 @@ func (s *TaskService) captureTaskQueued(ctx context.Context, task db.AgentTaskQu
|
||||
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
|
||||
s.Metrics.RecordTaskEnqueued(source, runtimeMode)
|
||||
}
|
||||
s.captureTaskEvent(ctx, analytics.AgentTaskQueued(s.taskAnalyticsContext(ctx, task)))
|
||||
}
|
||||
|
||||
func (s *TaskService) captureTaskDispatched(ctx context.Context, task db.AgentTaskQueue) {
|
||||
@@ -148,7 +147,6 @@ func (s *TaskService) captureTaskDispatched(ctx context.Context, task db.AgentTa
|
||||
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
|
||||
s.Metrics.RecordTaskDispatched(util.UUIDToString(task.ID), source, runtimeMode, taskQueueWaitSeconds(task))
|
||||
}
|
||||
s.captureTaskEvent(ctx, analytics.AgentTaskDispatched(s.taskAnalyticsContext(ctx, task)))
|
||||
}
|
||||
|
||||
func (s *TaskService) AnalyticsContextForTask(ctx context.Context, task db.AgentTaskQueue) analytics.TaskContext {
|
||||
@@ -160,7 +158,6 @@ func (s *TaskService) captureTaskStarted(ctx context.Context, task db.AgentTaskQ
|
||||
source, runtimeMode, provider := s.taskMetricsContext(ctx, task)
|
||||
s.Metrics.RecordTaskStarted(source, runtimeMode, provider)
|
||||
}
|
||||
s.captureTaskEvent(ctx, analytics.AgentTaskStarted(s.taskAnalyticsContext(ctx, task)))
|
||||
}
|
||||
|
||||
func (s *TaskService) captureTaskCompleted(ctx context.Context, task db.AgentTaskQueue) {
|
||||
@@ -168,10 +165,6 @@ func (s *TaskService) captureTaskCompleted(ctx context.Context, task db.AgentTas
|
||||
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
|
||||
s.Metrics.RecordTaskTerminal(util.UUIDToString(task.ID), source, runtimeMode, task.Status, taskRunSeconds(task), taskTotalSeconds(task), task.Attempt)
|
||||
}
|
||||
s.captureTaskEvent(ctx, analytics.AgentTaskCompleted(
|
||||
s.taskAnalyticsContext(ctx, task),
|
||||
taskDurationMS(task),
|
||||
))
|
||||
}
|
||||
|
||||
func (s *TaskService) captureTaskFailed(ctx context.Context, task db.AgentTaskQueue) {
|
||||
@@ -181,13 +174,6 @@ func (s *TaskService) captureTaskFailed(ctx context.Context, task db.AgentTaskQu
|
||||
s.Metrics.RecordTaskTerminal(util.UUIDToString(task.ID), source, runtimeMode, task.Status, taskRunSeconds(task), taskTotalSeconds(task), task.Attempt)
|
||||
s.Metrics.RecordTaskFailed(source, runtimeMode, failureReason)
|
||||
}
|
||||
s.captureTaskEvent(ctx, analytics.AgentTaskFailed(
|
||||
s.taskAnalyticsContext(ctx, task),
|
||||
taskDurationMS(task),
|
||||
failureReason,
|
||||
taskErrorType(failureReason),
|
||||
s.willRetryTask(task),
|
||||
))
|
||||
}
|
||||
|
||||
func (s *TaskService) captureTaskCancelled(ctx context.Context, task db.AgentTaskQueue) {
|
||||
@@ -195,10 +181,6 @@ func (s *TaskService) captureTaskCancelled(ctx context.Context, task db.AgentTas
|
||||
source, runtimeMode, _ := s.taskMetricsContext(ctx, task)
|
||||
s.Metrics.RecordTaskTerminal(util.UUIDToString(task.ID), source, runtimeMode, task.Status, taskRunSeconds(task), taskTotalSeconds(task), task.Attempt)
|
||||
}
|
||||
s.captureTaskEvent(ctx, analytics.AgentTaskCancelled(
|
||||
s.taskAnalyticsContext(ctx, task),
|
||||
taskDurationMS(task),
|
||||
))
|
||||
// Revoke any mat_ task tokens minted for this task. Cancellation is
|
||||
// a terminal transition, so the running agent process no longer
|
||||
// needs to call back; eagerly deleting the token closes the
|
||||
@@ -239,16 +221,6 @@ func (s *TaskService) CaptureLeaseExpiredTasks(ctx context.Context, tasks []db.A
|
||||
}
|
||||
}
|
||||
|
||||
func (s *TaskService) captureTaskEvent(ctx context.Context, event analytics.Event) {
|
||||
if s.Analytics == nil {
|
||||
return
|
||||
}
|
||||
if event.WorkspaceID == "" {
|
||||
return
|
||||
}
|
||||
s.Analytics.Capture(event)
|
||||
}
|
||||
|
||||
func (s *TaskService) cachedTaskAnalyticsContext(task db.AgentTaskQueue) (analytics.TaskContext, bool) {
|
||||
key := taskAnalyticsContextKey(task)
|
||||
if key == "" {
|
||||
@@ -410,26 +382,6 @@ func (s *TaskService) taskAnalyticsContext(ctx context.Context, task db.AgentTas
|
||||
return tc
|
||||
}
|
||||
|
||||
func taskDurationMS(task db.AgentTaskQueue) int64 {
|
||||
if !task.CompletedAt.Valid {
|
||||
return 0
|
||||
}
|
||||
start := task.CreatedAt
|
||||
if task.StartedAt.Valid {
|
||||
start = task.StartedAt
|
||||
} else if task.DispatchedAt.Valid {
|
||||
start = task.DispatchedAt
|
||||
}
|
||||
if !start.Valid {
|
||||
return 0
|
||||
}
|
||||
ms := task.CompletedAt.Time.Sub(start.Time).Milliseconds()
|
||||
if ms < 0 {
|
||||
return 0
|
||||
}
|
||||
return ms
|
||||
}
|
||||
|
||||
func taskQueueWaitSeconds(task db.AgentTaskQueue) float64 {
|
||||
return durationSeconds(task.CreatedAt, task.DispatchedAt)
|
||||
}
|
||||
@@ -475,20 +427,6 @@ func taskErrorType(reason string) string {
|
||||
}
|
||||
}
|
||||
|
||||
func (s *TaskService) willRetryTask(task db.AgentTaskQueue) bool {
|
||||
reason := taskFailureReason(task)
|
||||
if !retryableReasons[reason] {
|
||||
return false
|
||||
}
|
||||
if task.Attempt >= task.MaxAttempts {
|
||||
return false
|
||||
}
|
||||
if task.AutopilotRunID.Valid {
|
||||
return false
|
||||
}
|
||||
return task.IssueID.Valid || task.ChatSessionID.Valid
|
||||
}
|
||||
|
||||
// EnqueueTaskForIssue creates a queued task for an agent-assigned issue.
|
||||
// No context snapshot is stored — the agent fetches all data it needs at
|
||||
// runtime via the multica CLI.
|
||||
|
||||
Reference in New Issue
Block a user