Compare commits

...

1 Commits

Author SHA1 Message Date
J
e93217400a fix(channel): eliminate WaitGroup Add/Wait data race in engine.Supervisor (MUL-3620)
The race detector caught a flaky failure on main (passes on retry):
Supervisor.startSupervisor does s.wg.Add(1) under s.mu, while Supervisor.Wait
calls s.wg.Wait() with no lock. Calling WaitGroup.Add concurrently with
WaitGroup.Wait is a data race and undefined per the WaitGroup contract — so it
only trips occasionally (it passed locally and in PR CI).

Wait now blocks on stopChan (closed by Run's defer when Run returns) before
calling wg.Wait(). Run is the sole caller of startSupervisor, so once Run has
returned no further Add can happen and wg.Wait is race-free. WaitWithTimeout
inherits the fix (it calls Wait), and its timer still bounds shutdown.

This latent race existed in the original lark.Hub.Wait too; fixed properly in
the generalized Supervisor.

Verified: go test -race -count=300 on the flagged test and -count=8 on the
whole engine package, all clean; no deadlock from the stopChan gate (every
caller pairs Wait with a started Run + cancelled ctx).

Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 17:17:36 +08:00

View File

@@ -287,10 +287,19 @@ func (s *Supervisor) Run(ctx context.Context) {
// Wait blocks until every supervisor goroutine the Supervisor started has
// exited. Call this AFTER cancelling Run's context.
//
// It first waits for Run to return (stopChan closed) and only then joins the
// supervisor WaitGroup. This ordering is load-bearing: Run is the sole caller
// of startSupervisor, which does s.wg.Add(1), and calling WaitGroup.Add
// concurrently with WaitGroup.Wait is a data race (and undefined per the
// WaitGroup contract). Once Run has returned no further Add can happen, so the
// wg.Wait below is race-free. (Run always closes stopChan via defer, even on
// panic; callers always pair Wait with a started Run + cancelled ctx.)
//
// Prefer WaitWithTimeout in shutdown paths so a stuck supervisor (typically
// a hung lease release on a frozen DB pool) cannot block process exit
// indefinitely.
func (s *Supervisor) Wait() {
<-s.stopChan
s.wg.Wait()
}