Files
multica/server/internal/handler/runtime_models.go
Bohan Jiang 2bec2221d2 feat(agent): per-agent thinking_level for claude + codex (MUL-2339) (#2865)
* feat(agent): persist thinking_level per agent (MUL-2339)

Adds a nullable `thinking_level` column to the `agent` table so the
backend can route a runtime-native reasoning/effort token (e.g. Claude's
`xhigh`, Codex's `minimal`) through to the agent CLI on every dispatch.

The column is intentionally TEXT rather than an enum — Claude and Codex
publish overlapping but distinct vocabularies and we want the persisted
value to round-trip exactly through whichever CLI receives it. NULL is
the "use runtime default" sentinel that every downstream consumer reads
as "do not inject --effort / reasoning_effort".

This commit is just the storage layer (migration + sqlc); subsequent
commits wire it through the API, daemon, and agent backends.

Co-authored-by: multica-agent <github@multica.ai>

* feat(agent-backend): inject reasoning effort for claude + codex (MUL-2339)

Extends ExecOptions with a runtime-native ThinkingLevel string and wires
it into the Claude and Codex backends. Discovery is driven by the local
CLI so the daemon advertises whatever the host install supports rather
than a hand-maintained list that goes stale.

Per Elon's PR1 review:
- Claude: parses `claude --help` to learn the `--effort` superset and
  projects through a per-model allow-list (xhigh is Opus-only; max is
  session-only on the smaller models). Falls back to a conservative
  static list when the binary is missing or help drift hides the line.
- Codex: drives `codex debug models --output json` so per-model
  reasoning subsets and the documented default come directly from the
  CLI. The older config-error probe trick is gone — the JSON path is
  stable and doesn't pollute stderr with an intentional misconfig.
- Cache key includes (provider, executablePath, cliVersion) so a CLI
  upgrade invalidates entries that referenced the older help / catalog.

Per Trump's PR1 constraint, all three Codex injection points
(thread/start.config, thread/resume.config, turn/start.effort) flow
through one helper (`applyCodexReasoningEffort`) so they cannot drift
independently. The shared `codexReasoningCases` fixture in
`thinking_test.go` asserts the same value→{shape, key} contract at
each site for every level the runtimes know about.

Claude's `--effort` is also added to `claudeBlockedArgs` so a user
custom_args entry can't silently outvote the daemon-injected value.

Co-authored-by: multica-agent <github@multica.ai>

* feat(api): wire thinking_level through API + daemon contract (MUL-2339)

End-to-end plumbing for the per-agent reasoning/effort setting:

- AgentResponse / TaskAgentData now carry `thinking_level`; the daemon's
  claim response includes it and the daemon's executor passes it through
  to agent.ExecOptions, where the Claude and Codex backends already know
  what to do with it.
- ModelEntry on the runtime-models wire format gains a `thinking` block
  carrying `supported_levels` + `default_level` per model so the UI can
  render a runtime-aware picker without the server having to know about
  the local CLI install. `handleModelList` projects the agent-package
  catalog (including the new Thinking field) into the wire shape.
- CreateAgent / UpdateAgent gate the field with a synchronous provider
  enum check (claude / codex only today). UpdateAgent is tri-state:
  field omitted = no change, "" = explicit clear (new
  `ClearAgentThinkingLevel` query, mirrors the existing mcp_config null
  pattern), non-empty = validate then set.

Per Trump's PR1 review, the API NEVER auto-clears on a runtime/model
swap and ALWAYS returns 400 on an unknown literal value — same shape
across CreateAgent, UpdateAgent, and combined patches that move
runtime + level in one request. Per-model combination failures (e.g.
`xhigh` against a model that only supports up to `high`) surface as a
daemon-side task error, not a silent server-side rewrite.

TS types follow the same shape: `Agent.thinking_level`,
`CreateAgentRequest`/`UpdateAgentRequest` add the field, `RuntimeModel`
grows a `thinking` block. Older backends omit the field, which the
front-end treats as "no picker for this model" — installed desktop
builds keep working.

Co-authored-by: multica-agent <github@multica.ai>

* fix(agent): correct codex debug models argv + pin via runner test (MUL-2339)

`codex debug models --output json` is rejected by codex-cli 0.131.0 —
the subcommand emits JSON on stdout by default and has no `--output`
flag. Drop the flag and add `--bundled` to skip the network refresh
discovery doesn't need. Move the argv to a package-level var and add
a test that runs a fake `codex` to assert the binary actually
receives exactly `debug models --bundled`, so the contract can't
silently drift on the next refactor.

Also teach ValidateThinkingLevel to resolve an empty model to the
provider's default model entry. Without this, every default-model
task with a persisted thinking_level would be misjudged "unknown
model" by the daemon guard.

Co-authored-by: multica-agent <github@multica.ai>

* fix(api): reject runtime switch that would leave invalid thinking_level (MUL-2339)

A PATCH that changed `runtime_id` without touching `thinking_level`
used to silently keep the existing value, so a Claude agent storing
`max` could land on a Codex runtime where `max` is not a recognised
token at all, and the daemon would receive a literal-invalid level.

Hold the same "always 400 on literal-invalid, never silent coerce"
rule on this implicit path. When runtime_id changes and the existing
value is not in the new provider's enum, return 400 with the
recovery options (clear via `thinking_level=""` or re-set in the
same PATCH).

Add coverage for both the kept-when-still-valid and the rejected
cases, plus the two recovery paths (clear and replace).

Co-authored-by: multica-agent <github@multica.ai>

* fix(daemon): guard runTask with per-model thinking_level validator (MUL-2339)

ValidateThinkingLevel existed but had no call site — `task.Agent.
ThinkingLevel` flowed straight into ExecOptions, so `xhigh` configured
on a non-Opus Claude model, or API-side stale values that escaped the
provider enum gate, would be injected anyway.

Run the validator before building ExecOptions. Invalid combinations
log a warning and drop the level instead of failing the task: the
agent still runs, just at the runtime's default reasoning effort.
Discovery errors fail open (keep the level, let the CLI surface any
objection) so a transient `claude --help` failure can't strand work.

Empty model is forwarded as-is; the validator resolves it to the
provider's default model internally per the cross-package contract.

Co-authored-by: multica-agent <github@multica.ai>

* chore(agent): drop stale `--output json` comments + unused scanner (MUL-2339)

Codex CLI's `debug models` subcommand emits JSON without an `--output`
flag, and `parseCodexDebugModels` never read from the bufio.Scanner.
Sync the comments with the actual invocation and remove the dead init.

Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: multica-agent <github@multica.ai>
2026-05-20 12:30:10 +08:00

398 lines
14 KiB
Go

package handler
import (
"context"
"encoding/json"
"log/slog"
"net/http"
"sync"
"time"
"github.com/go-chi/chi/v5"
)
// ---------------------------------------------------------------------------
// Model list request store
// ---------------------------------------------------------------------------
//
// The server cannot call the daemon directly (the daemon is behind the user's
// NAT and only polls the server). So "list models for this runtime" uses a
// pending-request pattern: a frontend POST creates a pending request, the
// daemon pops it on the next heartbeat, executes locally, and reports the
// result back.
//
// The store is the cross-cutting state for that flow. It MUST stay coherent
// across API replicas — POST, heartbeat and poll can each land on a different
// node, and they all need to see the same request lifecycle. The single-node
// in-memory implementation is fine for self-hosted dev; multi-node deploys
// (Multica Cloud) MUST use the Redis-backed implementation, otherwise the
// pending request is invisible to whichever replica receives the next call
// and the picker shows "No models available" (regression: see issue
// review on multica-ai/multica#2009).
// ModelListStatus represents the lifecycle of a model list request.
type ModelListStatus string
const (
ModelListPending ModelListStatus = "pending"
ModelListRunning ModelListStatus = "running"
ModelListCompleted ModelListStatus = "completed"
ModelListFailed ModelListStatus = "failed"
ModelListTimeout ModelListStatus = "timeout"
)
// ModelListRequest represents a pending or completed model list request.
// Supported is false when the provider ignores per-agent model
// selection entirely (currently: hermes). The UI uses this to
// disable its dropdown rather than silently accepting a value the
// backend will drop.
//
// RunStartedAt is set when PopPending claims the request. It is
// `json:"-"` because it's a server-side bookkeeping field — the UI only
// needs Status / UpdatedAt to drive the polling loop.
type ModelListRequest struct {
ID string `json:"id"`
RuntimeID string `json:"runtime_id"`
Status ModelListStatus `json:"status"`
Models []ModelEntry `json:"models,omitempty"`
Supported bool `json:"supported"`
Error string `json:"error,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
RunStartedAt *time.Time `json:"-"`
}
// ModelEntry mirrors agent.Model for the wire. `Default` tags the
// model the runtime advertises as its preferred pick (e.g. Claude
// Code's shipped default, or hermes' currentModelId) so the UI can
// badge it — don't drop it when marshalling.
//
// `Thinking` carries the per-model reasoning-effort catalog discovered
// by the daemon for runtimes that support it (claude, codex — see
// MUL-2339). nil means "no picker for this model"; the UI hides the
// thinking_level selector. Older daemons (pre-2026-05) won't send this
// field, which is fine: the UI hides the selector and the agent runs
// with the runtime default.
type ModelEntry struct {
ID string `json:"id"`
Label string `json:"label"`
Provider string `json:"provider,omitempty"`
Default bool `json:"default,omitempty"`
Thinking *ModelThinking `json:"thinking,omitempty"`
}
// ModelThinking is the wire shape for the per-model thinking catalog.
// Mirrors agent.ModelThinking so the daemon's report passes through
// without remapping.
type ModelThinking struct {
SupportedLevels []ThinkingLevel `json:"supported_levels"`
DefaultLevel string `json:"default_level,omitempty"`
}
// ThinkingLevel is the wire shape for a single entry in a model's
// reasoning-effort catalog. `Value` is the literal token the daemon
// passes to the CLI; `Label` is the human-readable display string;
// `Description` is optional helper copy (Codex's debug-models output
// includes one per level).
type ThinkingLevel struct {
Value string `json:"value"`
Label string `json:"label"`
Description string `json:"description,omitempty"`
}
const (
// modelListPendingTimeout bounds how long a pending request can sit in
// the store before the UI is told "daemon didn't pick this up".
modelListPendingTimeout = 30 * time.Second
// modelListRunningTimeout bounds how long a claimed (running) request
// can stay claimed before the UI is told "daemon picked this up but
// never reported a result". This matters when the heartbeat response
// carrying `pending_model_list` is lost in transit (e.g. HTTP client
// timeout after PopPending already mutated store state): without this
// transition the UI would keep polling a record that is stuck in
// `running` until retention sweeps it.
modelListRunningTimeout = 60 * time.Second
// modelListStoreRetention bounds how long any stored request lives in
// the backing store. The Redis backend uses it as a TTL; the in-memory
// backend GCs on Create. The window is deliberately wider than the
// running/pending timeouts so terminal records are still readable when
// the UI's last poll arrives.
modelListStoreRetention = 2 * time.Minute
)
// ModelListStore is the contract every backend (in-memory single-node,
// Redis multi-node) must satisfy. Methods take a context so the Redis
// implementation can honour the heartbeat-side timeout that gates a
// slow shared store from stalling the rest of the heartbeat.
type ModelListStore interface {
Create(ctx context.Context, runtimeID string) (*ModelListRequest, error)
Get(ctx context.Context, id string) (*ModelListRequest, error)
// HasPending is a cheap read-only probe used by the heartbeat hot path
// to gate the side-effecting PopPending. A spurious "true" is fine —
// PopPending handles "queue empty after probe" by returning nil.
HasPending(ctx context.Context, runtimeID string) (bool, error)
PopPending(ctx context.Context, runtimeID string) (*ModelListRequest, error)
Complete(ctx context.Context, id string, models []ModelEntry, supported bool) error
Fail(ctx context.Context, id string, errMsg string) error
}
// applyModelListTimeout transitions a request to ModelListTimeout when it has
// been stuck in a non-terminal state past its threshold. Returns true when
// the record was modified so callers can persist the change. The pending
// threshold catches "daemon never picked this up"; the running threshold
// catches "daemon picked it up but the result report was lost" — without
// the running escape, only retention sweep ends the polling loop.
func applyModelListTimeout(req *ModelListRequest, now time.Time) bool {
switch req.Status {
case ModelListPending:
if now.Sub(req.CreatedAt) > modelListPendingTimeout {
req.Status = ModelListTimeout
req.Error = "daemon did not respond within 30 seconds"
req.UpdatedAt = now
return true
}
case ModelListRunning:
if req.RunStartedAt != nil && now.Sub(*req.RunStartedAt) > modelListRunningTimeout {
req.Status = ModelListTimeout
req.Error = "daemon did not finish within 60 seconds"
req.UpdatedAt = now
return true
}
}
return false
}
// InMemoryModelListStore is the single-node implementation. Adequate for
// self-hosted dev and the test suite, but unsafe in multi-node deploys
// (each replica gets its own map and the pending request is invisible to
// every replica that didn't receive the POST).
type InMemoryModelListStore struct {
mu sync.Mutex
requests map[string]*ModelListRequest
}
func NewInMemoryModelListStore() *InMemoryModelListStore {
return &InMemoryModelListStore{requests: make(map[string]*ModelListRequest)}
}
func (s *InMemoryModelListStore) Create(_ context.Context, runtimeID string) (*ModelListRequest, error) {
s.mu.Lock()
defer s.mu.Unlock()
// Garbage-collect stale entries so the map can't grow unbounded.
for id, req := range s.requests {
if time.Since(req.CreatedAt) > modelListStoreRetention {
delete(s.requests, id)
}
}
now := time.Now()
req := &ModelListRequest{
ID: randomID(),
RuntimeID: runtimeID,
Status: ModelListPending,
// Default to true; the daemon overrides this in the report
// for providers that don't support per-agent model selection.
Supported: true,
CreatedAt: now,
UpdatedAt: now,
}
s.requests[req.ID] = req
return req, nil
}
func (s *InMemoryModelListStore) Get(_ context.Context, id string) (*ModelListRequest, error) {
s.mu.Lock()
defer s.mu.Unlock()
req, ok := s.requests[id]
if !ok {
return nil, nil
}
applyModelListTimeout(req, time.Now())
return req, nil
}
func (s *InMemoryModelListStore) HasPending(_ context.Context, runtimeID string) (bool, error) {
s.mu.Lock()
defer s.mu.Unlock()
now := time.Now()
for _, req := range s.requests {
applyModelListTimeout(req, now)
if req.RuntimeID == runtimeID && req.Status == ModelListPending {
return true, nil
}
}
return false, nil
}
func (s *InMemoryModelListStore) PopPending(_ context.Context, runtimeID string) (*ModelListRequest, error) {
s.mu.Lock()
defer s.mu.Unlock()
var oldest *ModelListRequest
now := time.Now()
for _, req := range s.requests {
applyModelListTimeout(req, now)
if req.RuntimeID == runtimeID && req.Status == ModelListPending {
if oldest == nil || req.CreatedAt.Before(oldest.CreatedAt) {
oldest = req
}
}
}
if oldest != nil {
oldest.Status = ModelListRunning
startedAt := now
oldest.RunStartedAt = &startedAt
oldest.UpdatedAt = now
}
return oldest, nil
}
func (s *InMemoryModelListStore) Complete(_ context.Context, id string, models []ModelEntry, supported bool) error {
s.mu.Lock()
defer s.mu.Unlock()
if req, ok := s.requests[id]; ok {
req.Status = ModelListCompleted
req.Models = models
req.Supported = supported
req.UpdatedAt = time.Now()
}
return nil
}
func (s *InMemoryModelListStore) Fail(_ context.Context, id string, errMsg string) error {
s.mu.Lock()
defer s.mu.Unlock()
if req, ok := s.requests[id]; ok {
req.Status = ModelListFailed
req.Error = errMsg
req.UpdatedAt = time.Now()
}
return nil
}
func modelListRequestTerminal(status ModelListStatus) bool {
return status == ModelListCompleted || status == ModelListFailed || status == ModelListTimeout
}
// ---------------------------------------------------------------------------
// Handlers
// ---------------------------------------------------------------------------
// InitiateListModels creates a pending model list request for a runtime.
// Called by the frontend; the daemon picks it up on its next heartbeat.
func (h *Handler) InitiateListModels(w http.ResponseWriter, r *http.Request) {
runtimeID := chi.URLParam(r, "runtimeId")
runtimeUUID, ok := parseUUIDOrBadRequest(w, runtimeID, "runtime_id")
if !ok {
return
}
rt, err := h.Queries.GetAgentRuntime(r.Context(), runtimeUUID)
if err != nil {
writeError(w, http.StatusNotFound, "runtime not found")
return
}
if _, ok := h.requireWorkspaceMember(w, r, uuidToString(rt.WorkspaceID), "runtime not found"); !ok {
return
}
if rt.Status != "online" {
writeError(w, http.StatusServiceUnavailable, "runtime is offline")
return
}
req, err := h.ModelListStore.Create(r.Context(), uuidToString(rt.ID))
if err != nil {
writeError(w, http.StatusInternalServerError, "failed to enqueue model list request: "+err.Error())
return
}
writeJSON(w, http.StatusOK, req)
}
// GetModelListRequest returns the status of a model list request.
func (h *Handler) GetModelListRequest(w http.ResponseWriter, r *http.Request) {
requestID := chi.URLParam(r, "requestId")
req, err := h.ModelListStore.Get(r.Context(), requestID)
if err != nil {
writeError(w, http.StatusInternalServerError, "failed to load request: "+err.Error())
return
}
if req == nil {
writeError(w, http.StatusNotFound, "request not found")
return
}
writeJSON(w, http.StatusOK, req)
}
// ReportModelListResult receives the list result from the daemon.
func (h *Handler) ReportModelListResult(w http.ResponseWriter, r *http.Request) {
runtimeID := chi.URLParam(r, "runtimeId")
if _, ok := h.requireDaemonRuntimeAccess(w, r, runtimeID); !ok {
return
}
requestID := chi.URLParam(r, "requestId")
// Fetch first so we can ignore stale reports for already-terminal
// requests (e.g. the heartbeat response that triggered the daemon
// run was a retry, and the original report already landed).
existing, err := h.ModelListStore.Get(r.Context(), requestID)
if err != nil {
writeError(w, http.StatusInternalServerError, "failed to load request: "+err.Error())
return
}
if existing == nil || existing.RuntimeID != runtimeID {
writeError(w, http.StatusNotFound, "request not found")
return
}
if modelListRequestTerminal(existing.Status) {
slog.Debug("ignoring stale model list report", "runtime_id", runtimeID, "request_id", requestID, "status", existing.Status)
writeJSON(w, http.StatusOK, map[string]string{"status": "ok"})
return
}
var body struct {
Status string `json:"status"` // "completed" or "failed"
Models []ModelEntry `json:"models"`
Supported *bool `json:"supported"`
Error string `json:"error"`
}
if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
if body.Status == "completed" {
// Older daemons may omit `supported`; default to true to keep
// the UI usable while they haven't been redeployed yet.
supported := true
if body.Supported != nil {
supported = *body.Supported
}
if err := h.ModelListStore.Complete(r.Context(), requestID, body.Models, supported); err != nil {
// Surface the store failure as 5xx so the daemon can retry instead
// of swallowing the report (leaves the request stuck in running
// until the server-side timeout, which is exactly the "looks OK
// but nothing happens" class of bug we're trying to avoid).
slog.Error("ModelListStore Complete failed", "error", err, "request_id", requestID)
writeError(w, http.StatusInternalServerError, "failed to persist completion")
return
}
} else {
if err := h.ModelListStore.Fail(r.Context(), requestID, body.Error); err != nil {
slog.Error("ModelListStore Fail failed", "error", err, "request_id", requestID)
writeError(w, http.StatusInternalServerError, "failed to persist failure")
return
}
}
slog.Debug("model list report", "runtime_id", runtimeID, "request_id", requestID, "status", body.Status, "count", len(body.Models))
writeJSON(w, http.StatusOK, map[string]string{"status": "ok"})
}