Files
multica/server
Bohan Jiang c3d8529ba5 fix(channel): eliminate WaitGroup Add/Wait data race in engine.Supervisor (MUL-3620) (#4517)
The race detector caught a flaky failure on main (passes on retry):
Supervisor.startSupervisor does s.wg.Add(1) under s.mu, while Supervisor.Wait
calls s.wg.Wait() with no lock. Calling WaitGroup.Add concurrently with
WaitGroup.Wait is a data race and undefined per the WaitGroup contract — so it
only trips occasionally (it passed locally and in PR CI).

Wait now blocks on stopChan (closed by Run's defer when Run returns) before
calling wg.Wait(). Run is the sole caller of startSupervisor, so once Run has
returned no further Add can happen and wg.Wait is race-free. WaitWithTimeout
inherits the fix (it calls Wait), and its timer still bounds shutdown.

This latent race existed in the original lark.Hub.Wait too; fixed properly in
the generalized Supervisor.

Verified: go test -race -count=300 on the flagged test and -count=8 on the
whole engine package, all clean; no deadlock from the stopChan gate (every
caller pairs Wait with a started Run + cancelled ctx).

Co-authored-by: J <j@multica.ai>
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 17:21:03 +08:00
..