Skip to content

Retry pipe connection in TryShutdownServer to fix flaky CanShutdownServerProcess test#13559

Open
huulinhnguyen-dev wants to merge 2 commits intodotnet:mainfrom
huulinhnguyen-dev:dev/huulinhnguyen/fix-flaky-canshutdownserverprocess
Open

Retry pipe connection in TryShutdownServer to fix flaky CanShutdownServerProcess test#13559
huulinhnguyen-dev wants to merge 2 commits intodotnet:mainfrom
huulinhnguyen-dev:dev/huulinhnguyen/fix-flaky-canshutdownserverprocess

Conversation

@huulinhnguyen-dev
Copy link
Copy Markdown
Contributor

Fixes #
Retry pipe connection in TryShutdownServer to fix flaky CanShutdownServerProcess test

Context

After a build completes, the MSBuild server releases the ServerBusy mutex before finishing its pipe recycling (creating a new ServerNodeEndpointOutOfProc and calling Listen). When TryShutdownServer is called during this window, TryConnectToServer silently returns false with HandshakeStatus.Timeout instead of retrying. Since BuildManager.ShutdownAllNodes() ignores the return value of ShutdownServer(), the server never receives the shutdown command and the process remains alive, causing serverProcess.HasExited.ShouldBeTrue() to fail after the 10-second WaitForExit timeout.

Changes Made

  • In MSBuildClient.TryShutdownServer: replaced the single TryConnectToServer(1_000) call with a retry loop that retries for up to 5 seconds, sleeping 100ms and resetting the pipe stream (CreateNodePipeStream()) between attempts.

Testing

Copilot AI review requested due to automatic review settings April 17, 2026 11:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce flakiness in MSBuild server shutdown by retrying the named-pipe connection inside MSBuildClient.TryShutdownServer, addressing a narrow timing window where the server releases the ServerBusy mutex before its recycled pipe is ready to accept connections.

Changes:

  • Replace the single TryConnectToServer(1_000) call in TryShutdownServer with a retry loop that keeps attempting for up to 5 seconds.
  • Between failed attempts, sleep briefly and recreate the pipe stream via CreateNodePipeStream().

Comment on lines +262 to +270
bool connected = false;
Stopwatch connectSw = Stopwatch.StartNew();
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor maintainability: the retry constants (5_000, 1_000, 100) are duplicated and embedded in the loop condition and body. Consider extracting them into named constants (e.g., totalRetryMilliseconds / perAttemptTimeoutMilliseconds / retryDelayMilliseconds) to make the intent and future tuning clearer.

Suggested change
bool connected = false;
Stopwatch connectSw = Stopwatch.StartNew();
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
const int totalRetryMilliseconds = 5_000;
const int perAttemptTimeoutMilliseconds = 1_000;
const int retryDelayMilliseconds = 100;
bool connected = false;
Stopwatch connectSw = Stopwatch.StartNew();
while (!connected && connectSw.ElapsedMilliseconds < totalRetryMilliseconds)
{
connected = TryConnectToServer(perAttemptTimeoutMilliseconds);
if (!connected && connectSw.ElapsedMilliseconds < totalRetryMilliseconds)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(retryDelayMilliseconds);

Copilot uses AI. Check for mistakes.
Comment on lines +262 to +267
bool connected = false;
Stopwatch connectSw = Stopwatch.StartNew();
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TryConnectToServer sets _exitResult.MSBuildClientExitType = UnableToConnect on timeout. With the new retry loop, a transient timeout can leave _exitResult in UnableToConnect even if a later attempt connects and shutdown succeeds, causing TryShutdownServer to return false at the end. Consider keeping connect retries from mutating _exitResult (e.g., add a non-reporting connect attempt), or explicitly reset _exitResult.MSBuildClientExitType back to Success once a connection is established / shutdown completes without error.

Copilot uses AI. Check for mistakes.
Comment on lines +264 to +270
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connect retry loop doesn’t observe cancellationToken (and Thread.Sleep(100) is not cancellable), so ShutdownServer can block for up to ~5 seconds even when cancellation is requested. Please check cancellationToken.IsCancellationRequested in the loop and replace the sleep with a cancellation-aware wait (or throw via ThrowIfCancellationRequested).

Suggested change
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
while (!connected && connectSw.ElapsedMilliseconds < 5_000 && !cancellationToken.IsCancellationRequested)
{
cancellationToken.ThrowIfCancellationRequested();
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
if (cancellationToken.WaitHandle.WaitOne(100))
{
cancellationToken.ThrowIfCancellationRequested();
}

Copilot uses AI. Check for mistakes.
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateNodePipeStream() overwrites _nodeStream / _packetPump without disposing the previous instances. Calling it repeatedly in this retry loop can leak pipe handles/threads across retries. Consider disposing the prior _packetPump and _nodeStream before recreating them (either in CreateNodePipeStream itself or right before calling it here).

Suggested change
Thread.Sleep(100);
Thread.Sleep(100);
_packetPump?.Dispose();
_packetPump = null;
_nodeStream?.Dispose();
_nodeStream = null;

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Review Summary

The fix correctly addresses the described race condition: the server releases its "busy" mutex before re-establishing its named pipe listener, causing TryConnectToServer to time out and TryShutdownServer to give up. A retry loop is the right approach.

Issues found (by severity)

🟡 Medium — CancellationToken ignored in retry loop
The retry loop can block for up to 5 seconds without checking cancellationToken. See inline comment for fix suggestion.

🟡 Medium — No test coverage
No tests were added for the new retry behavior. This race condition is inherently hard to unit test, but a test that verifies the retry loop's mechanics (e.g., mocking TryConnectToServer to fail N times then succeed) would build confidence, especially since this is in the BackEnd execution engine where regressions are difficult to diagnose.

🔵 Low — Resource leak on retry (CreateNodePipeStream)
Each retry creates a new NamedPipeClientStream without disposing the old one. Pre-existing issue but amplified by the outer retry. See inline comment.

🔵 Low — Stale _exitResult between retries
TryConnectToServer sets UnableToConnect on failure; successful retry doesn't clear it. Unlikely to cause issues today but fragile. See inline comment.

Design observations

Retry layering is reasonable but worth documenting. TryConnectToServer already retries internally for non-Timeout handshake failures (wrong pipe owner, version mismatch, etc.). The new outer retry specifically handles HandshakeStatus.Timeout — the case where no pipe exists yet. This two-level retry is correct, but the interplay is subtle. A short code comment in TryConnectToServer noting that the Timeout path is intentionally not retried internally (because the caller may want control) would help future readers.

Happy-path overhead is negligible. When the first TryConnectToServer(1_000) succeeds, the only added cost is a Stopwatch allocation and one boolean check. No regression concern.

No ChangeWave needed. This is a reliability fix to an existing code path with no user-visible behavioral change (shutdown either succeeds or fails; the timing change is internal).

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13559 · ● 1.4M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix correctly addresses the pipe-recycling race condition. The retry approach is sound and the layering with TryConnectToServer's internal retry is reasonable.

Two items worth addressing before merge:

  1. CancellationToken should be checked in the retry loop to avoid unresponsive hangs during Ctrl+C.
  2. Resource disposalCreateNodePipeStream() should dispose the previous _nodeStream to avoid leaking pipe handles on retry.

The stale _exitResult and constant extraction are nice-to-haves. Missing tests are understandable given the race-condition nature of the bug, but a mechanical retry test would add confidence.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

  • #13559 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13559 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13559 · ● 1.4M

Comment on lines +264 to +273
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
if (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
CreateNodePipeStream();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: CancellationToken is ignored in the retry loop

TryShutdownServer accepts a CancellationToken cancellationToken parameter, and the existing code passes it to ReadPacketsLoop further down. However, this new retry loop can block for up to 5 seconds without ever checking for cancellation. If a user presses Ctrl+C during shutdown, the process will hang unresponsively until all retries are exhausted.

Suggested fix — check cancellation in the while condition:

while (!connected && connectSw.ElapsedMilliseconds < 5_000 && !cancellationToken.IsCancellationRequested)

And also in the inner if:

if (!connected && connectSw.ElapsedMilliseconds < 5_000 && !cancellationToken.IsCancellationRequested)

Alternatively, replace Thread.Sleep(100) with cancellationToken.WaitHandle.WaitOne(100) to make the sleep itself cancellable.

Stopwatch connectSw = Stopwatch.StartNew();
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
{
connected = TryConnectToServer(1_000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _exitResult side-effect on retry

When TryConnectToServer returns false, it sets _exitResult.MSBuildClientExitType = MSBuildClientExitType.UnableToConnect (line 643). If a subsequent retry succeeds, TryConnectToServer returns true without clearing this field. The stale UnableToConnect value persists until ReadPacketsLoop eventually overwrites it.

In practice this is probably harmless because TrySendShutdownCommand / ReadPacketsLoop will set the final exit type. But it's fragile — any future code that checks _exitResult between the connection and the read loop would see the wrong value. Worth a defensive reset after the loop:

if (connected)
{
    _exitResult.MSBuildClientExitType = MSBuildClientExitType.Success; // Clear stale retry state
}

Low severity — mentioning for robustness, not blocking.

{
CommunicationsUtilities.Trace("Failed to connect to idle server, will retry...");
Thread.Sleep(100);
CreateNodePipeStream();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: CreateNodePipeStream() does not dispose the previous stream

CreateNodePipeStream() overwrites _nodeStream and _packetPump without disposing the previous instances. Each failed retry leaks a NamedPipeClientStream. The same issue exists inside TryConnectToServer's internal retry (line 638), so this is pre-existing, but the new outer retry loop amplifies the number of leaked streams from ~1 to potentially ~4.

Consider disposing the old stream before creating a new one:

_nodeStream?.Dispose();
CreateNodePipeStream();

Or better yet, fix CreateNodePipeStream() itself to dispose the old stream if non-null.

// the transitional window.
bool connected = false;
Stopwatch connectSw = Stopwatch.StartNew();
while (!connected && connectSw.ElapsedMilliseconds < 5_000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design: Consider extracting the 5_000 constant

The 5_000 ms timeout appears twice (line 264 and 267). Worth extracting to a named constant for clarity and to avoid the values diverging:

const int ShutdownConnectTimeoutMs = 5_000;

Also, this is a 5× increase from the previous 1-second timeout. For the shutdown path, this means dotnet build -nodeReuse:false or explicit shutdown calls could take up to 5 seconds when the server is genuinely unreachable (e.g., crashed). The previous behavior would fail after ~1 second. This is a reasonable trade-off for the race condition fix, but it should be documented in the PR description so reviewers can make an informed call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants