mirror of
https://github.com/multica-ai/multica.git
synced 2026-06-17 03:38:32 +02:00
Lift MUL-1949's offline backfill failure_reason taxonomy into a shared in-flight classifier so the agent_task_queue.failure_reason column is written with refined values (provider_auth_or_access, context_overflow, provider_capacity_or_rate_limit, …) at write time rather than waiting on SQL backfill to re-classify after the fact. PR1 of the Grafana board plan in MUL-2328 — the upcoming PR2 reuses pkg/taskfailure.AllReasons() to pre-warm the Prometheus failure_reason label set. * server/pkg/taskfailure: new package with the canonical 21 Reason constants (7 platform-side + 14 agent_error.* sub-reasons), AllReasons() returning a defensive copy, IsAgentError() prefix check, and Classify(rawError) Reason mirroring the SQL CASE rules from MUL-1949 (db-boy's analysis). 100% statement coverage. * server/internal/daemon/daemon.go: route the 'agent_error' coarse fallback paths (StartTask error, runTask early-return error, CompleteTask permanent rejection, reportTaskResult default branch) and the executeAndDrain default error case (chained after classifyPoisonedError) through taskfailure.Classify so blocked / timeout / unknown-status results all carry a refined reason on the wire. * server/internal/service/task.go: FailTask classifies errMsg when the daemon-supplied failureReason is empty, eliminating the legacy COALESCE(.., 'agent_error') landing. * server/internal/daemon/poisoned.go: alias FailureReasonIterationLimit and FailureReasonAPIInvalidRequest to the canonical taskfailure constants. agent_fallback_message and codex_semantic_inactivity are pre-existing operational reasons not in the canonical 21 — kept as literals for now and revisited in a follow-up PR. Backfill SQL from MUL-1949 stays as the authoritative offline source of truth; this PR keeps the in-flight classifier in lock-step with the SQL CASE expression so historical and future rows share the same taxonomy. No behavior change for the platform-side reasons (queued_expired, runtime_offline, runtime_recovery, timeout, etc.) which already align with the canonical set. Co-authored-by: Eve <eve@multica-ai.local> Co-authored-by: multica-agent <github@multica.ai>
135 lines
6.2 KiB
Go
135 lines
6.2 KiB
Go
package daemon
|
|
|
|
import (
|
|
"strings"
|
|
|
|
"github.com/multica-ai/multica/server/pkg/agent"
|
|
"github.com/multica-ai/multica/server/pkg/taskfailure"
|
|
)
|
|
|
|
// FailureReason values for tasks whose session is "poisoned" — i.e.
|
|
// resuming the same conversation on a follow-up task would deterministically
|
|
// reproduce the same failure. Listed here so the server-side query
|
|
// GetLastTaskSession can filter them out and the next task starts from
|
|
// a fresh agent session instead of inheriting the bad state.
|
|
//
|
|
// Two flavors:
|
|
// - Output-side: agent "completed" with output that is actually a known
|
|
// fallback marker (gave up mid-thought, emitted a meta message). Detected
|
|
// via classifyPoisonedOutput.
|
|
// - Error-side: the LLM API itself rejected the request with a 400
|
|
// invalid_request_error (oversized payload, malformed image, etc.).
|
|
// The bad message is already baked into the conversation history, so
|
|
// every resume hits the same 400. Detected via classifyPoisonedError.
|
|
// - Timeout-side: Codex reported semantic inactivity after the session got
|
|
// stuck without agent progress. Resuming that Codex session can replay the
|
|
// same stuck state, while a fresh manual rerun may succeed. Detected via
|
|
// classifyResumeUnsafeTimeout.
|
|
//
|
|
// MUL-2946: ReasonIterationLimit and ReasonAPIInvalidRequest are aliased
|
|
// to the canonical taskfailure values so the daemon and the in-flight
|
|
// classifier (used by every other failure path) share a single source
|
|
// of truth. agent_fallback_message and codex_semantic_inactivity are
|
|
// pre-existing operational reasons not in the canonical 21 — kept as
|
|
// string literals here until a follow-up PR migrates them or extends
|
|
// the taxonomy.
|
|
const (
|
|
FailureReasonIterationLimit = string(taskfailure.ReasonIterationLimit)
|
|
FailureReasonAgentFallbackMsg = "agent_fallback_message"
|
|
FailureReasonAPIInvalidRequest = string(taskfailure.ReasonAPIInvalidRequest)
|
|
FailureReasonCodexSemanticInactivity = "codex_semantic_inactivity"
|
|
)
|
|
|
|
// poisonedOutputMaxLen caps how long an output can be and still be
|
|
// classified as a poisoned fallback. Real fallback messages are short,
|
|
// one-sentence affairs; a long output that happens to mention a marker
|
|
// is almost certainly a real conclusion (e.g. a code-review reply
|
|
// quoting these strings, like the one currently quoting them in
|
|
// MUL-1630). The cap intentionally errs on the side of NOT classifying
|
|
// — a missed poisoned task gets retried by user action, but a
|
|
// false-positive turns a successful task into a failure and a system
|
|
// comment.
|
|
const poisonedOutputMaxLen = 320
|
|
|
|
// poisonedMarkers maps a substring fingerprint of a known agent fallback
|
|
// terminal message to its failure_reason classifier. Match is case-
|
|
// insensitive and substring-based; the cap above prevents long outputs
|
|
// that quote a marker from being misclassified.
|
|
var poisonedMarkers = []struct {
|
|
Substring string
|
|
Reason string
|
|
}{
|
|
{"i reached the iteration limit", FailureReasonIterationLimit},
|
|
{"put your final update inside the content string", FailureReasonAgentFallbackMsg},
|
|
}
|
|
|
|
// classifyPoisonedOutput reports whether output matches a known agent
|
|
// fallback terminal message and, if so, returns the failure_reason that
|
|
// should be persisted on the task row. Long outputs are never
|
|
// classified: a real fallback is the agent's only utterance for the
|
|
// turn, so anything beyond ~one paragraph is treated as a real result
|
|
// even if it contains a marker substring.
|
|
func classifyPoisonedOutput(output string) (string, bool) {
|
|
trimmed := strings.TrimSpace(output)
|
|
if trimmed == "" || len(trimmed) > poisonedOutputMaxLen {
|
|
return "", false
|
|
}
|
|
lowered := strings.ToLower(trimmed)
|
|
for _, m := range poisonedMarkers {
|
|
if strings.Contains(lowered, m.Substring) {
|
|
return m.Reason, true
|
|
}
|
|
}
|
|
return "", false
|
|
}
|
|
|
|
// classifyPoisonedError reports whether an agent error message indicates
|
|
// the LLM API itself rejected the request body — i.e. the conversation
|
|
// history contains content the API will not accept (oversized image,
|
|
// malformed base64, prompt-too-long, etc.). The conversation cannot be
|
|
// resumed: every retry replays the same body and reproduces the same 400.
|
|
// The classifier returns FailureReasonAPIInvalidRequest so GetLastTaskSession
|
|
// excludes the task from the (agent_id, issue_id) resume lookup, and the
|
|
// next task on the issue starts a fresh session instead of permanently
|
|
// inheriting the bad state.
|
|
//
|
|
// Match shape: the Claude Code SDK and similar backends surface upstream
|
|
// API failures verbatim, e.g.
|
|
//
|
|
// API Error: 400 {"type":"error","error":{"type":"invalid_request_error","message":"Could not process image"},"request_id":"..."}
|
|
//
|
|
// Matching on both "400" and "invalid_request_error" keeps the classifier
|
|
// narrow: 429 rate-limits, 5xx overloads, and tool-shaped errors are
|
|
// transient and SHOULD resume on retry.
|
|
func classifyPoisonedError(errMsg string) (string, bool) {
|
|
if errMsg == "" {
|
|
return "", false
|
|
}
|
|
lowered := strings.ToLower(errMsg)
|
|
// Both markers must be present: "400" alone is too generic (a tool
|
|
// could surface a 400 from anywhere) and "invalid_request_error"
|
|
// alone could in theory appear in non-poisoning contexts. The
|
|
// combination is the canonical Anthropic error shape and indicates
|
|
// the request body — i.e. the conversation history — is the problem.
|
|
if strings.Contains(lowered, "invalid_request_error") && strings.Contains(lowered, "400") {
|
|
return FailureReasonAPIInvalidRequest, true
|
|
}
|
|
return "", false
|
|
}
|
|
|
|
// classifyResumeUnsafeTimeout reports whether a timeout means the recorded
|
|
// session should not be resumed. Keep this intentionally provider-specific:
|
|
// ordinary daemon/backend timeouts are infrastructure-shaped and should keep
|
|
// the resume pointer so retries can continue the in-flight conversation.
|
|
func classifyResumeUnsafeTimeout(provider, errMsg string) (string, bool) {
|
|
if strings.ToLower(strings.TrimSpace(provider)) != "codex" || errMsg == "" {
|
|
return "", false
|
|
}
|
|
lowered := strings.ToLower(errMsg)
|
|
if strings.Contains(lowered, strings.ToLower(agent.CodexSemanticInactivityMarker)) ||
|
|
strings.Contains(lowered, strings.ToLower(agent.CodexFirstTurnNoProgressMarker)) {
|
|
return FailureReasonCodexSemanticInactivity, true
|
|
}
|
|
return "", false
|
|
}
|