Files
multica/server/internal/handler/agent_thinking_test.go
Bohan Jiang 2bec2221d2 feat(agent): per-agent thinking_level for claude + codex (MUL-2339) (#2865)
* feat(agent): persist thinking_level per agent (MUL-2339)

Adds a nullable `thinking_level` column to the `agent` table so the
backend can route a runtime-native reasoning/effort token (e.g. Claude's
`xhigh`, Codex's `minimal`) through to the agent CLI on every dispatch.

The column is intentionally TEXT rather than an enum — Claude and Codex
publish overlapping but distinct vocabularies and we want the persisted
value to round-trip exactly through whichever CLI receives it. NULL is
the "use runtime default" sentinel that every downstream consumer reads
as "do not inject --effort / reasoning_effort".

This commit is just the storage layer (migration + sqlc); subsequent
commits wire it through the API, daemon, and agent backends.

Co-authored-by: multica-agent <github@multica.ai>

* feat(agent-backend): inject reasoning effort for claude + codex (MUL-2339)

Extends ExecOptions with a runtime-native ThinkingLevel string and wires
it into the Claude and Codex backends. Discovery is driven by the local
CLI so the daemon advertises whatever the host install supports rather
than a hand-maintained list that goes stale.

Per Elon's PR1 review:
- Claude: parses `claude --help` to learn the `--effort` superset and
  projects through a per-model allow-list (xhigh is Opus-only; max is
  session-only on the smaller models). Falls back to a conservative
  static list when the binary is missing or help drift hides the line.
- Codex: drives `codex debug models --output json` so per-model
  reasoning subsets and the documented default come directly from the
  CLI. The older config-error probe trick is gone — the JSON path is
  stable and doesn't pollute stderr with an intentional misconfig.
- Cache key includes (provider, executablePath, cliVersion) so a CLI
  upgrade invalidates entries that referenced the older help / catalog.

Per Trump's PR1 constraint, all three Codex injection points
(thread/start.config, thread/resume.config, turn/start.effort) flow
through one helper (`applyCodexReasoningEffort`) so they cannot drift
independently. The shared `codexReasoningCases` fixture in
`thinking_test.go` asserts the same value→{shape, key} contract at
each site for every level the runtimes know about.

Claude's `--effort` is also added to `claudeBlockedArgs` so a user
custom_args entry can't silently outvote the daemon-injected value.

Co-authored-by: multica-agent <github@multica.ai>

* feat(api): wire thinking_level through API + daemon contract (MUL-2339)

End-to-end plumbing for the per-agent reasoning/effort setting:

- AgentResponse / TaskAgentData now carry `thinking_level`; the daemon's
  claim response includes it and the daemon's executor passes it through
  to agent.ExecOptions, where the Claude and Codex backends already know
  what to do with it.
- ModelEntry on the runtime-models wire format gains a `thinking` block
  carrying `supported_levels` + `default_level` per model so the UI can
  render a runtime-aware picker without the server having to know about
  the local CLI install. `handleModelList` projects the agent-package
  catalog (including the new Thinking field) into the wire shape.
- CreateAgent / UpdateAgent gate the field with a synchronous provider
  enum check (claude / codex only today). UpdateAgent is tri-state:
  field omitted = no change, "" = explicit clear (new
  `ClearAgentThinkingLevel` query, mirrors the existing mcp_config null
  pattern), non-empty = validate then set.

Per Trump's PR1 review, the API NEVER auto-clears on a runtime/model
swap and ALWAYS returns 400 on an unknown literal value — same shape
across CreateAgent, UpdateAgent, and combined patches that move
runtime + level in one request. Per-model combination failures (e.g.
`xhigh` against a model that only supports up to `high`) surface as a
daemon-side task error, not a silent server-side rewrite.

TS types follow the same shape: `Agent.thinking_level`,
`CreateAgentRequest`/`UpdateAgentRequest` add the field, `RuntimeModel`
grows a `thinking` block. Older backends omit the field, which the
front-end treats as "no picker for this model" — installed desktop
builds keep working.

Co-authored-by: multica-agent <github@multica.ai>

* fix(agent): correct codex debug models argv + pin via runner test (MUL-2339)

`codex debug models --output json` is rejected by codex-cli 0.131.0 —
the subcommand emits JSON on stdout by default and has no `--output`
flag. Drop the flag and add `--bundled` to skip the network refresh
discovery doesn't need. Move the argv to a package-level var and add
a test that runs a fake `codex` to assert the binary actually
receives exactly `debug models --bundled`, so the contract can't
silently drift on the next refactor.

Also teach ValidateThinkingLevel to resolve an empty model to the
provider's default model entry. Without this, every default-model
task with a persisted thinking_level would be misjudged "unknown
model" by the daemon guard.

Co-authored-by: multica-agent <github@multica.ai>

* fix(api): reject runtime switch that would leave invalid thinking_level (MUL-2339)

A PATCH that changed `runtime_id` without touching `thinking_level`
used to silently keep the existing value, so a Claude agent storing
`max` could land on a Codex runtime where `max` is not a recognised
token at all, and the daemon would receive a literal-invalid level.

Hold the same "always 400 on literal-invalid, never silent coerce"
rule on this implicit path. When runtime_id changes and the existing
value is not in the new provider's enum, return 400 with the
recovery options (clear via `thinking_level=""` or re-set in the
same PATCH).

Add coverage for both the kept-when-still-valid and the rejected
cases, plus the two recovery paths (clear and replace).

Co-authored-by: multica-agent <github@multica.ai>

* fix(daemon): guard runTask with per-model thinking_level validator (MUL-2339)

ValidateThinkingLevel existed but had no call site — `task.Agent.
ThinkingLevel` flowed straight into ExecOptions, so `xhigh` configured
on a non-Opus Claude model, or API-side stale values that escaped the
provider enum gate, would be injected anyway.

Run the validator before building ExecOptions. Invalid combinations
log a warning and drop the level instead of failing the task: the
agent still runs, just at the runtime's default reasoning effort.
Discovery errors fail open (keep the level, let the CLI surface any
objection) so a transient `claude --help` failure can't strand work.

Empty model is forwarded as-is; the validator resolves it to the
provider's default model internally per the cross-package contract.

Co-authored-by: multica-agent <github@multica.ai>

* chore(agent): drop stale `--output json` comments + unused scanner (MUL-2339)

Codex CLI's `debug models` subcommand emits JSON without an `--output`
flag, and `parseCodexDebugModels` never read from the bufio.Scanner.
Sync the comments with the actual invocation and remove the dead init.

Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: multica-agent <github@multica.ai>
2026-05-20 12:30:10 +08:00

373 lines
14 KiB
Go

package handler
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
)
// TestCreateAgent_ThinkingLevel_ValidationConsistency exercises the
// MUL-2339 invariant: when an HTTP caller sends a literal-invalid
// thinking_level the API MUST return 400, regardless of which other
// field combination the same request mutates. The constraint comes
// from Trump's PR1 review: "invalid value 的 API 行为请保持一致,
// 不要同一类变更有时 400、有时静默清空".
func TestCreateAgent_ThinkingLevel_ValidationConsistency(t *testing.T) {
if testHandler == nil {
t.Skip("database not available")
}
ctx := context.Background()
claudeRuntimeID := createClaudeProviderRuntime(t)
t.Cleanup(func() {
testPool.Exec(ctx,
`DELETE FROM agent WHERE workspace_id = $1 AND name LIKE 'thinking-test-%'`,
testWorkspaceID,
)
})
t.Run("empty value succeeds", func(t *testing.T) {
body := map[string]any{
"name": "thinking-test-empty",
"runtime_id": claudeRuntimeID,
"visibility": "private",
"max_concurrent_tasks": 1,
"thinking_level": "",
}
w := httptest.NewRecorder()
testHandler.CreateAgent(w, newRequest(http.MethodPost, "/api/agents", body))
if w.Code != http.StatusCreated {
t.Fatalf("empty thinking_level: expected 201, got %d: %s", w.Code, w.Body.String())
}
})
t.Run("known claude value succeeds", func(t *testing.T) {
body := map[string]any{
"name": "thinking-test-known",
"runtime_id": claudeRuntimeID,
"visibility": "private",
"max_concurrent_tasks": 1,
"thinking_level": "high",
}
w := httptest.NewRecorder()
testHandler.CreateAgent(w, newRequest(http.MethodPost, "/api/agents", body))
if w.Code != http.StatusCreated {
t.Fatalf("thinking_level=high: expected 201, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]any
_ = json.NewDecoder(w.Body).Decode(&resp)
if resp["thinking_level"] != "high" {
t.Errorf("expected thinking_level=high in response, got %v", resp["thinking_level"])
}
})
t.Run("codex-only token rejected for claude runtime", func(t *testing.T) {
// `none` is a valid Codex token but NOT a Claude token. The
// gate must always 400 regardless of which other fields the
// request also tried to change.
body := map[string]any{
"name": "thinking-test-codex-only",
"runtime_id": claudeRuntimeID,
"visibility": "private",
"max_concurrent_tasks": 1,
"thinking_level": "none",
}
w := httptest.NewRecorder()
testHandler.CreateAgent(w, newRequest(http.MethodPost, "/api/agents", body))
if w.Code != http.StatusBadRequest {
t.Fatalf("codex-only thinking_level on claude runtime: expected 400, got %d: %s", w.Code, w.Body.String())
}
})
t.Run("garbage value rejected", func(t *testing.T) {
body := map[string]any{
"name": "thinking-test-garbage",
"runtime_id": claudeRuntimeID,
"visibility": "private",
"max_concurrent_tasks": 1,
"thinking_level": "supersonic",
}
w := httptest.NewRecorder()
testHandler.CreateAgent(w, newRequest(http.MethodPost, "/api/agents", body))
if w.Code != http.StatusBadRequest {
t.Fatalf("garbage thinking_level: expected 400, got %d: %s", w.Code, w.Body.String())
}
})
}
// TestUpdateAgent_ThinkingLevel_TriState covers the three modes of
// the field on PATCH:
// - field omitted → leave the existing value alone (the silent-clear
// anti-pattern flagged by Trump's review must NOT happen here)
// - explicit "" → clear back to NULL
// - non-empty → validate against the CURRENT runtime's provider enum
//
// All three branches share the same 400 / 200 outcome rule: validation
// failures are always 400, never auto-clear.
func TestUpdateAgent_ThinkingLevel_TriState(t *testing.T) {
if testHandler == nil {
t.Skip("database not available")
}
ctx := context.Background()
claudeRuntimeID := createClaudeProviderRuntime(t)
agentID := createAgentOnRuntime(t, "thinking-update-test", claudeRuntimeID, "high")
t.Cleanup(func() {
testPool.Exec(ctx, `DELETE FROM agent WHERE id = $1`, agentID)
})
// 1. Omitted field — name-only update must NOT touch thinking_level.
t.Run("omitted field leaves value alone", func(t *testing.T) {
body := map[string]any{
"name": "thinking-update-test-renamed",
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("name-only update: expected 200, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]any
_ = json.NewDecoder(w.Body).Decode(&resp)
if resp["thinking_level"] != "high" {
t.Errorf("name-only update silently changed thinking_level: got %v, want high", resp["thinking_level"])
}
})
// 2. Explicit "" — must clear.
t.Run("empty string clears", func(t *testing.T) {
body := map[string]any{
"thinking_level": "",
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("clear update: expected 200, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]any
_ = json.NewDecoder(w.Body).Decode(&resp)
if resp["thinking_level"] != "" {
t.Errorf("empty thinking_level should clear: got %v", resp["thinking_level"])
}
})
// 3. Garbage value — always 400, never silently clear.
t.Run("garbage value is always 400", func(t *testing.T) {
body := map[string]any{
"thinking_level": "warp-speed",
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("garbage thinking_level: expected 400, got %d: %s", w.Code, w.Body.String())
}
})
// 4. Codex-only token while bound to a Claude runtime → 400. This
// is the "consistency" case from Trump's review: the API does
// NOT auto-clear or coerce; the same token that's valid for a
// Codex runtime is rejected here.
t.Run("codex token on claude runtime is 400, not silent clear", func(t *testing.T) {
body := map[string]any{
"thinking_level": "minimal",
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("codex token on claude runtime: expected 400, got %d: %s", w.Code, w.Body.String())
}
})
}
// TestUpdateAgent_RuntimeSwitch_PreservesValidValueRejectsInvalid covers
// the gap Elon flagged in PR1 review: a PATCH that switches `runtime_id`
// without explicitly touching `thinking_level` used to silently keep
// the existing value, so a Claude agent storing `max` could land on a
// Codex runtime where `max` is not a recognised token at all, and the
// daemon would receive a literal-invalid level.
//
// The contract the test pins, matching the existing "always 400 on
// literal-invalid" rule:
//
// - existing value still valid for the new runtime → 200, value kept
// - existing value invalid for the new runtime → 400, never silent
// clear or coerce
// - caller can recover by re-sending with `thinking_level: ""` to clear
// in the same PATCH
func TestUpdateAgent_RuntimeSwitch_PreservesValidValueRejectsInvalid(t *testing.T) {
if testHandler == nil {
t.Skip("database not available")
}
ctx := context.Background()
claudeRuntimeID := createClaudeProviderRuntime(t)
codexRuntimeID := createCodexProviderRuntime(t)
t.Cleanup(func() {
testPool.Exec(ctx, `DELETE FROM agent WHERE workspace_id = $1 AND name LIKE 'runtime-switch-%'`, testWorkspaceID)
})
t.Run("existing value still valid for new runtime is kept", func(t *testing.T) {
// `high` is valid for both Claude and Codex enums — switching
// runtime without touching thinking_level should keep it.
agentID := createAgentOnRuntime(t, "runtime-switch-keep", claudeRuntimeID, "high")
body := map[string]any{
"runtime_id": codexRuntimeID,
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected 200 when existing value is still valid, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]any
_ = json.NewDecoder(w.Body).Decode(&resp)
if resp["thinking_level"] != "high" {
t.Errorf("expected thinking_level=high preserved across runtime switch, got %v", resp["thinking_level"])
}
})
t.Run("existing value invalid for new runtime is 400, not silent", func(t *testing.T) {
// `max` is Claude-only; switching to Codex must NOT silently
// keep it. Behaviour stays consistent with the explicit-set
// path: always 400 on literal-invalid.
agentID := createAgentOnRuntime(t, "runtime-switch-reject", claudeRuntimeID, "max")
body := map[string]any{
"runtime_id": codexRuntimeID,
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected 400 when existing value is invalid for new runtime, got %d: %s", w.Code, w.Body.String())
}
})
t.Run("simultaneous explicit clear lets the switch through", func(t *testing.T) {
// The 400 above is recoverable: pass `thinking_level: ""` in
// the same PATCH and the switch goes through with a cleared
// value. This is the documented escape hatch in the error
// message; the test pins it so the contract holds.
agentID := createAgentOnRuntime(t, "runtime-switch-clear", claudeRuntimeID, "max")
body := map[string]any{
"runtime_id": codexRuntimeID,
"thinking_level": "",
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected 200 with simultaneous clear, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]any
_ = json.NewDecoder(w.Body).Decode(&resp)
if resp["thinking_level"] != "" {
t.Errorf("expected thinking_level cleared, got %v", resp["thinking_level"])
}
})
t.Run("simultaneous explicit set to valid value lets the switch through", func(t *testing.T) {
// The other recovery: caller picks a value valid for the new
// runtime. Same PATCH, no need for a separate roundtrip.
agentID := createAgentOnRuntime(t, "runtime-switch-replace", claudeRuntimeID, "max")
body := map[string]any{
"runtime_id": codexRuntimeID,
"thinking_level": "minimal",
}
w := httptest.NewRecorder()
req := withURLParam(newRequest(http.MethodPatch, "/api/agents/"+agentID, body), "id", agentID)
testHandler.UpdateAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected 200 with simultaneous set, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]any
_ = json.NewDecoder(w.Body).Decode(&resp)
if resp["thinking_level"] != "minimal" {
t.Errorf("expected thinking_level=minimal, got %v", resp["thinking_level"])
}
})
}
// createCodexProviderRuntime mirrors createClaudeProviderRuntime but for
// the codex provider, so runtime-switch tests can exercise a real
// cross-provider transition.
func createCodexProviderRuntime(t *testing.T) string {
t.Helper()
var runtimeID string
err := testPool.QueryRow(context.Background(), `
INSERT INTO agent_runtime (
workspace_id, daemon_id, name, runtime_mode, provider, status,
device_info, metadata, last_seen_at, owner_id
)
VALUES ($1, NULL, $2, 'cloud', 'codex', 'online', $3, '{}'::jsonb, now(), $4)
RETURNING id
`, testWorkspaceID, "Codex Thinking Runtime", "Codex thinking-level test runtime", testUserID).Scan(&runtimeID)
if err != nil {
t.Fatalf("create codex runtime: %v", err)
}
t.Cleanup(func() {
testPool.Exec(context.Background(), `DELETE FROM agent_runtime WHERE id = $1`, runtimeID)
})
return runtimeID
}
// createClaudeProviderRuntime stands up a runtime row with provider
// "claude" so the thinking_level gate runs against the real Claude
// enum (the default test runtime uses a fake provider). The runtime
// is workspace-private but visible to the test owner.
func createClaudeProviderRuntime(t *testing.T) string {
t.Helper()
var runtimeID string
err := testPool.QueryRow(context.Background(), `
INSERT INTO agent_runtime (
workspace_id, daemon_id, name, runtime_mode, provider, status,
device_info, metadata, last_seen_at, owner_id
)
VALUES ($1, NULL, $2, 'cloud', 'claude', 'online', $3, '{}'::jsonb, now(), $4)
RETURNING id
`, testWorkspaceID, "Claude Thinking Runtime", "Claude thinking-level test runtime", testUserID).Scan(&runtimeID)
if err != nil {
t.Fatalf("create claude runtime: %v", err)
}
t.Cleanup(func() {
testPool.Exec(context.Background(), `DELETE FROM agent_runtime WHERE id = $1`, runtimeID)
})
return runtimeID
}
// createAgentOnRuntime seeds an agent row bound to the given runtime
// with the given initial thinking_level (empty for NULL).
func createAgentOnRuntime(t *testing.T, name, runtimeID, level string) string {
t.Helper()
var agentID string
var levelArg any
if level == "" {
levelArg = nil
} else {
levelArg = level
}
err := testPool.QueryRow(context.Background(), `
INSERT INTO agent (
workspace_id, name, description, runtime_mode, runtime_config,
runtime_id, visibility, max_concurrent_tasks, owner_id,
instructions, custom_env, custom_args, thinking_level
)
VALUES ($1, $2, '', 'cloud', '{}'::jsonb, $3, 'private', 1, $4, '', '{}'::jsonb, '[]'::jsonb, $5)
RETURNING id
`, testWorkspaceID, name, runtimeID, testUserID, levelArg).Scan(&agentID)
if err != nil {
t.Fatalf("create agent on runtime %s: %v", runtimeID, err)
}
t.Cleanup(func() {
testPool.Exec(context.Background(), `DELETE FROM agent WHERE id = $1`, agentID)
})
return agentID
}