Retry pipe connection in TryShutdownServer to fix flaky CanShutdownServerProcess test#13559
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce flakiness in MSBuild server shutdown by retrying the named-pipe connection inside MSBuildClient.TryShutdownServer, addressing a narrow timing window where the server releases the ServerBusy mutex before its recycled pipe is ready to accept connections.
Changes:
- Replace the single
TryConnectToServer(1_000)call inTryShutdownServerwith a retry loop that keeps attempting for up to 5 seconds. - Between failed attempts, sleep briefly and recreate the pipe stream via
CreateNodePipeStream().
| bool connected = false; | ||
| Stopwatch connectSw = Stopwatch.StartNew(); | ||
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| connected = TryConnectToServer(1_000); | ||
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | ||
| Thread.Sleep(100); |
There was a problem hiding this comment.
Minor maintainability: the retry constants (5_000, 1_000, 100) are duplicated and embedded in the loop condition and body. Consider extracting them into named constants (e.g., totalRetryMilliseconds / perAttemptTimeoutMilliseconds / retryDelayMilliseconds) to make the intent and future tuning clearer.
| bool connected = false; | |
| Stopwatch connectSw = Stopwatch.StartNew(); | |
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | |
| { | |
| connected = TryConnectToServer(1_000); | |
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | |
| { | |
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | |
| Thread.Sleep(100); | |
| const int totalRetryMilliseconds = 5_000; | |
| const int perAttemptTimeoutMilliseconds = 1_000; | |
| const int retryDelayMilliseconds = 100; | |
| bool connected = false; | |
| Stopwatch connectSw = Stopwatch.StartNew(); | |
| while (!connected && connectSw.ElapsedMilliseconds < totalRetryMilliseconds) | |
| { | |
| connected = TryConnectToServer(perAttemptTimeoutMilliseconds); | |
| if (!connected && connectSw.ElapsedMilliseconds < totalRetryMilliseconds) | |
| { | |
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | |
| Thread.Sleep(retryDelayMilliseconds); |
| bool connected = false; | ||
| Stopwatch connectSw = Stopwatch.StartNew(); | ||
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| connected = TryConnectToServer(1_000); | ||
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) |
There was a problem hiding this comment.
TryConnectToServer sets _exitResult.MSBuildClientExitType = UnableToConnect on timeout. With the new retry loop, a transient timeout can leave _exitResult in UnableToConnect even if a later attempt connects and shutdown succeeds, causing TryShutdownServer to return false at the end. Consider keeping connect retries from mutating _exitResult (e.g., add a non-reporting connect attempt), or explicitly reset _exitResult.MSBuildClientExitType back to Success once a connection is established / shutdown completes without error.
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| connected = TryConnectToServer(1_000); | ||
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | ||
| Thread.Sleep(100); |
There was a problem hiding this comment.
The connect retry loop doesn’t observe cancellationToken (and Thread.Sleep(100) is not cancellable), so ShutdownServer can block for up to ~5 seconds even when cancellation is requested. Please check cancellationToken.IsCancellationRequested in the loop and replace the sleep with a cancellation-aware wait (or throw via ThrowIfCancellationRequested).
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | |
| { | |
| connected = TryConnectToServer(1_000); | |
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | |
| { | |
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | |
| Thread.Sleep(100); | |
| while (!connected && connectSw.ElapsedMilliseconds < 5_000 && !cancellationToken.IsCancellationRequested) | |
| { | |
| cancellationToken.ThrowIfCancellationRequested(); | |
| connected = TryConnectToServer(1_000); | |
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | |
| { | |
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | |
| if (cancellationToken.WaitHandle.WaitOne(100)) | |
| { | |
| cancellationToken.ThrowIfCancellationRequested(); | |
| } |
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | ||
| Thread.Sleep(100); |
There was a problem hiding this comment.
CreateNodePipeStream() overwrites _nodeStream / _packetPump without disposing the previous instances. Calling it repeatedly in this retry loop can leak pipe handles/threads across retries. Consider disposing the prior _packetPump and _nodeStream before recreating them (either in CreateNodePipeStream itself or right before calling it here).
| Thread.Sleep(100); | |
| Thread.Sleep(100); | |
| _packetPump?.Dispose(); | |
| _packetPump = null; | |
| _nodeStream?.Dispose(); | |
| _nodeStream = null; |
Review SummaryThe fix correctly addresses the described race condition: the server releases its "busy" mutex before re-establishing its named pipe listener, causing Issues found (by severity)🟡 Medium — CancellationToken ignored in retry loop 🟡 Medium — No test coverage 🔵 Low — Resource leak on retry ( 🔵 Low — Stale Design observationsRetry layering is reasonable but worth documenting. Happy-path overhead is negligible. When the first No ChangeWave needed. This is a reliability fix to an existing code path with no user-visible behavioral change (shutdown either succeeds or fails; the timing change is internal). Note 🔒 Integrity filter blocked 2 itemsThe following items were blocked because they don't meet the GitHub integrity level.
To allow these resources, lower tools:
github:
min-integrity: approved # merged | approved | unapproved | none
|
There was a problem hiding this comment.
The fix correctly addresses the pipe-recycling race condition. The retry approach is sound and the layering with TryConnectToServer's internal retry is reasonable.
Two items worth addressing before merge:
- CancellationToken should be checked in the retry loop to avoid unresponsive hangs during Ctrl+C.
- Resource disposal —
CreateNodePipeStream()should dispose the previous_nodeStreamto avoid leaking pipe handles on retry.
The stale _exitResult and constant extraction are nice-to-haves. Missing tests are understandable given the race-condition nature of the bug, but a mechanical retry test would add confidence.
Note
🔒 Integrity filter blocked 2 items
The following items were blocked because they don't meet the GitHub integrity level.
- #13559
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - #13559
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (on open) for issue #13559 · ● 1.4M
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| connected = TryConnectToServer(1_000); | ||
| if (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | ||
| Thread.Sleep(100); | ||
| CreateNodePipeStream(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: CancellationToken is ignored in the retry loop
TryShutdownServer accepts a CancellationToken cancellationToken parameter, and the existing code passes it to ReadPacketsLoop further down. However, this new retry loop can block for up to 5 seconds without ever checking for cancellation. If a user presses Ctrl+C during shutdown, the process will hang unresponsively until all retries are exhausted.
Suggested fix — check cancellation in the while condition:
while (!connected && connectSw.ElapsedMilliseconds < 5_000 && !cancellationToken.IsCancellationRequested)And also in the inner if:
if (!connected && connectSw.ElapsedMilliseconds < 5_000 && !cancellationToken.IsCancellationRequested)Alternatively, replace Thread.Sleep(100) with cancellationToken.WaitHandle.WaitOne(100) to make the sleep itself cancellable.
| Stopwatch connectSw = Stopwatch.StartNew(); | ||
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) | ||
| { | ||
| connected = TryConnectToServer(1_000); |
There was a problem hiding this comment.
Nit: _exitResult side-effect on retry
When TryConnectToServer returns false, it sets _exitResult.MSBuildClientExitType = MSBuildClientExitType.UnableToConnect (line 643). If a subsequent retry succeeds, TryConnectToServer returns true without clearing this field. The stale UnableToConnect value persists until ReadPacketsLoop eventually overwrites it.
In practice this is probably harmless because TrySendShutdownCommand / ReadPacketsLoop will set the final exit type. But it's fragile — any future code that checks _exitResult between the connection and the read loop would see the wrong value. Worth a defensive reset after the loop:
if (connected)
{
_exitResult.MSBuildClientExitType = MSBuildClientExitType.Success; // Clear stale retry state
}Low severity — mentioning for robustness, not blocking.
| { | ||
| CommunicationsUtilities.Trace("Failed to connect to idle server, will retry..."); | ||
| Thread.Sleep(100); | ||
| CreateNodePipeStream(); |
There was a problem hiding this comment.
Minor: CreateNodePipeStream() does not dispose the previous stream
CreateNodePipeStream() overwrites _nodeStream and _packetPump without disposing the previous instances. Each failed retry leaks a NamedPipeClientStream. The same issue exists inside TryConnectToServer's internal retry (line 638), so this is pre-existing, but the new outer retry loop amplifies the number of leaked streams from ~1 to potentially ~4.
Consider disposing the old stream before creating a new one:
_nodeStream?.Dispose();
CreateNodePipeStream();Or better yet, fix CreateNodePipeStream() itself to dispose the old stream if non-null.
| // the transitional window. | ||
| bool connected = false; | ||
| Stopwatch connectSw = Stopwatch.StartNew(); | ||
| while (!connected && connectSw.ElapsedMilliseconds < 5_000) |
There was a problem hiding this comment.
Design: Consider extracting the 5_000 constant
The 5_000 ms timeout appears twice (line 264 and 267). Worth extracting to a named constant for clarity and to avoid the values diverging:
const int ShutdownConnectTimeoutMs = 5_000;Also, this is a 5× increase from the previous 1-second timeout. For the shutdown path, this means dotnet build -nodeReuse:false or explicit shutdown calls could take up to 5 seconds when the server is genuinely unreachable (e.g., crashed). The previous behavior would fail after ~1 second. This is a reasonable trade-off for the race condition fix, but it should be documented in the PR description so reviewers can make an informed call.
Fixes #
Retry pipe connection in TryShutdownServer to fix flaky CanShutdownServerProcess test
Context
After a build completes, the MSBuild server releases the
ServerBusymutex before finishing its pipe recycling (creating a newServerNodeEndpointOutOfProcand callingListen). WhenTryShutdownServeris called during this window,TryConnectToServersilently returnsfalsewithHandshakeStatus.Timeoutinstead of retrying. SinceBuildManager.ShutdownAllNodes()ignores the return value ofShutdownServer(), the server never receives the shutdown command and the process remains alive, causingserverProcess.HasExited.ShouldBeTrue()to fail after the 10-secondWaitForExittimeout.Changes Made
MSBuildClient.TryShutdownServer: replaced the singleTryConnectToServer(1_000)call with a retry loop that retries for up to 5 seconds, sleeping 100ms and resetting the pipe stream (CreateNodePipeStream()) between attempts.Testing