mirror of
https://github.com/multica-ai/multica.git
synced 2026-06-17 03:38:32 +02:00
* feat(metrics): scrape-time BusinessSamplerCollector for active users / queued / runtime gauges (MUL-2947)
Adds an opt-in prometheus.Collector that runs a fixed set of read-only
SQL queries on every /metrics scrape and exposes the results as gauges:
- multica_active_users{window=5m|1h|24h}
- multica_active_workspaces{window=...}
- multica_agent_task_queued{source}
- multica_agent_task_running{source,runtime_mode}
- multica_agent_task_stuck_total{source}
- multica_runtime_online{runtime_mode,provider}
- multica_runtime_heartbeat_age_seconds{runtime_mode} (histogram)
- multica_workspace_total
Plus a self-introspection histogram
multica_business_sampler_query_seconds{name=...} and a counter
multica_business_sampler_query_errors_total{name=...} so the sampler's
own behaviour is observable on /metrics.
Production-safety contract per the PR4 brief:
- every query runs in its own BEGIN READ ONLY tx with
SET LOCAL statement_timeout = '500ms' (configurable)
- the sampler takes a dedicated *pgxpool.Pool option so operators
can isolate it from business traffic
- successful results are cached for 5–10s (default 8s) to absorb
concurrent scrapes from multiple Prometheus replicas
- every SQL has a hard LIMIT 100 fallback
- all label values flow through the existing BusinessMetrics
NormalizeTaskSource / NormalizeRuntimeMode / NormalizeRuntimeProvider
whitelists, so a misbehaving runtime cannot inflate cardinality
- sampler is OPT-IN via RegistryOptions.BusinessSampler — existing
callers that only pass Pool keep their current behaviour and never
start hitting the DB on /metrics
Tests cover: emit shape, TTL cache (one DB call per N scrapes),
bounded cardinality under malicious labels, opt-out (no leakage), and
DB-hang isolation (unreachable host -> /metrics returns within 5s,
query_errors_total advances).
Refs MUL-2947 (depends on PR2 / MUL-2948, merged in #3695).
Co-authored-by: multica-agent <github@multica.ai>
* fix(metrics): address PR4 review — wire sampler in main.go, fix LIMIT bug, add live-DB statement_timeout test
Three fixes from 大彪's review on #3706:
1. main.go was building NewRegistry without the BusinessSampler option,
so the collector was effectively dead code in prod. Now constructs a
dedicated 2-conn pgxpool (newSamplerDBPool) from the same DATABASE_URL
when METRICS_ADDR is set, plumbs it into RegistryOptions.BusinessSampler,
and defers Close() at shutdown. A pool-build failure logs and disables
the sampler instead of taking down the server.
2. queryActiveUsers / queryActiveWorkspaces previously wrapped the
distinct-user/workspace subquery in a 'LIMIT 100', then COUNT(*)'d
the result — capping the active-user gauge at 100 regardless of
reality. Removed the inner LIMIT; the COUNT scalar is one row anyway,
and metric cardinality is bounded by the fixed samplerWindows
allow-list, not by the SQL shape.
3. The previous DB-hang test only exercised the acquire-fails path. Added
business_sampler_pgsleep_test.go which connects to a live Postgres
(skips cleanly when DATABASE_URL is not set), runs SELECT pg_sleep(2)
inside a sampler-style tx with SET LOCAL statement_timeout = '500ms',
and asserts:
- the call returns in well under 1.5 s (proving the server-side
cancellation, not just our caller-side context)
- query_errors_total{name=pg_sleep_canary} advances
- the duration histogram records the cancellation
Verified locally: 550 ms, SQLSTATE 57014 'canceling statement due to
statement timeout' — exactly the safety net the PR claims.
Refs MUL-2947 / PR #3706.
Co-authored-by: multica-agent <github@multica.ai>
* test(metrics): assert SQLSTATE 57014 on pg_sleep cancellation
The previous assertion only checked that the query was cut off in well
under the sleep duration, which a caller-side context cancellation
would also satisfy. Capturing the inner pgconn.PgError and asserting
Code == "57014" ("query_canceled") nails down that Postgres itself
cancelled the statement because of the SET LOCAL statement_timeout —
so a regression that drops the SET LOCAL line fails this test loudly
instead of silently passing on context cancellation.
Refs MUL-2947 / PR #3706 review nit.
Co-authored-by: multica-agent <github@multica.ai>
---------
Co-authored-by: multica-agent <github@multica.ai>