mirror of
https://github.com/multica-ai/multica.git
synced 2026-07-05 21:39:54 +02:00
fix(channel): eliminate WaitGroup Add/Wait data race in engine.Supervisor (MUL-3620) (#4517)
The race detector caught a flaky failure on main (passes on retry): Supervisor.startSupervisor does s.wg.Add(1) under s.mu, while Supervisor.Wait calls s.wg.Wait() with no lock. Calling WaitGroup.Add concurrently with WaitGroup.Wait is a data race and undefined per the WaitGroup contract — so it only trips occasionally (it passed locally and in PR CI). Wait now blocks on stopChan (closed by Run's defer when Run returns) before calling wg.Wait(). Run is the sole caller of startSupervisor, so once Run has returned no further Add can happen and wg.Wait is race-free. WaitWithTimeout inherits the fix (it calls Wait), and its timer still bounds shutdown. This latent race existed in the original lark.Hub.Wait too; fixed properly in the generalized Supervisor. Verified: go test -race -count=300 on the flagged test and -count=8 on the whole engine package, all clean; no deadlock from the stopChan gate (every caller pairs Wait with a started Run + cancelled ctx). Co-authored-by: J <j@multica.ai> Co-authored-by: multica-agent <github@multica.ai>
This commit is contained in:
@@ -287,10 +287,19 @@ func (s *Supervisor) Run(ctx context.Context) {
|
||||
// Wait blocks until every supervisor goroutine the Supervisor started has
|
||||
// exited. Call this AFTER cancelling Run's context.
|
||||
//
|
||||
// It first waits for Run to return (stopChan closed) and only then joins the
|
||||
// supervisor WaitGroup. This ordering is load-bearing: Run is the sole caller
|
||||
// of startSupervisor, which does s.wg.Add(1), and calling WaitGroup.Add
|
||||
// concurrently with WaitGroup.Wait is a data race (and undefined per the
|
||||
// WaitGroup contract). Once Run has returned no further Add can happen, so the
|
||||
// wg.Wait below is race-free. (Run always closes stopChan via defer, even on
|
||||
// panic; callers always pair Wait with a started Run + cancelled ctx.)
|
||||
//
|
||||
// Prefer WaitWithTimeout in shutdown paths so a stuck supervisor (typically
|
||||
// a hung lease release on a frozen DB pool) cannot block process exit
|
||||
// indefinitely.
|
||||
func (s *Supervisor) Wait() {
|
||||
<-s.stopChan
|
||||
s.wg.Wait()
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user