Files
multica/server/pkg/agent
Multica Eve aa4478af52 fix(codex): unhang cleanup after stdout scanner overflow (#4520) (#4563)
When codex emits a single stdout line larger than the daemon's 10 MB
bufio.Scanner cap (typical trigger: thread/resume on a long-history
session), the reader goroutine returns scanner.Err()="token too long",
markProcessExited fails the in-flight RPC, and the lifecycle goroutine
enters its failure path. That path calls drainAndWait() — stdin.Close()
+ cmd.Wait() — before sending the failed Result. But cmd.Wait() never
returns: codex is alive and blocked writing the rest of the oversized
line into a stdout pipe nobody is reading, so it never reaches its
stdin read syscall and never sees the EOF. The lifecycle goroutine
therefore never sends Result{failed} to its caller, the outer daemon
blocks on the result channel, and the existing PriorSessionID-with-
empty-SessionID fallback never fires — the task is permanently
stalled and codex (Node wrapper + native Rust app-server) leaks until
the OS reaps them.

The cancel() that would have unblocked things via cmd.WaitDelay's
SIGKILL was registered as a defer AFTER drainAndWait, so LIFO defer
order put cancel last — drainAndWait blocks first, cancel never runs.

Fix:

1. drainAndWait now runs the existing graceful-then-cancel pattern
   itself, in two bounded phases. Phase 1 waits for readerDone (capped
   by codexGracefulShutdownTimeout, so we still give codex its OTEL
   flush window on clean exits); on timeout it cancels the runCtx so
   cmd.Cancel kills the tree and the reader unblocks. Phase 2 bounds
   cmd.Wait() the same way for the scanner-overflow case, where
   readerDone closed early but the process is still alive on a full
   stdout pipe. The success-path cleanup that previously duplicated the
   graceful-cancel pattern around readerDone collapses to a single
   drainAndWait() call.

2. cmd.Cancel is set to send SIGKILL to the whole codex process group
   (Setpgid via configureProcessGroup, signalProcessGroup on cancel)
   instead of just the leader. This addresses YOMXXX's
   orphaned-Codex-child concern: the Node wrapper and the native
   app-server it spawns now both die when cleanup forces the kill,
   rather than the native binary leaking as an orphan reparented to
   init. configureProcessGroup is a no-op on Windows.

3. codexGracefulShutdownTimeoutNanos atomic.Int64 mirrors
   opencodeTerminateGraceNanos so the regression test can shrink the
   grace window from 10 s to 500 ms. Production code is unchanged
   (default 10 s).

Outer daemon (daemon.go) already retries with a fresh session when
result.Status == "failed" && PriorSessionID != "" && result.SessionID
== ""; the failed Result now actually reaches it, so the recovery
fires on its own without any daemon-side change.

Tests:

- New regression TestCodexExecuteCleansUpWhenScannerOverflowsOnResume
  spawns a fake codex that emits an 11 MB single-line thread/resume
  response (trips the scanner cap) and then sleeps without re-reading
  stdin. With the original drainAndWait body it blocks at the 10 s
  executeFakeCodex deadline ("timeout waiting for result") — verified
  by temporarily reverting just the helper body — and with the fix it
  completes in ~1.3 s with Result.Status="failed",
  Result.SessionID="" so the outer fallback can fire.
- Full codex test suite, full agent package, daemon + execenv +
  repocache packages, go build ./..., and go vet on agent/daemon all
  pass.

Out of scope (deferred to follow-up per YOMXXX): bumping the 10 MB
bufio.Scanner cap on codex / claude / copilot / cursor / hermes /
kimi / kiro / codebuddy / antigravity / qoder / openclaw / opencode
(pi already sits at 32 MB), and the shared bounded JSON-RPC line
reader that would eliminate the single-line-overflow risk class
entirely. Buffer size alone is not the fix — recovery behaviour is.

Refs: GH#4520

Co-authored-by: Eve <eve@multica-ai.local>
Co-authored-by: multica-agent <github@multica.ai>
2026-06-25 14:46:20 +08:00
..