Bug: 100% CPU Starvation Loop on File Descriptor Exhaustion (Linux)

## Description

When `nim-libp2p` reaches the operating system's file descriptor limit (e.g., `ulimit -n`), the `accept` loop in the TCP transport layer falls into an infinite, non-yielding busy loop. This causes the node to instantly peg the CPU at 100%, starving the `chronos` async event loop and effectively deadlocking the entire application.

This is a critical stability issue for production nodes (which primarily run on Linux) under heavy connection load or targeted connection-spam attacks.

## Root Cause & Platform Differences

The bug stems from how the async engine (`chronos`) interacts with the underlying OS event polling mechanisms (`epoll` on Linux vs. `kqueue` on macOS) when the `accept()` system call fails with `EMFILE` (Too many open files).

1. **The TCP Transport `accept` Loop (`tcptransport.nim`)**
   When `tcptransport.accept()` encounters `TransportTooManyError` from Chronos, it catches the exception, logs a `debug` message, and returns `nil`.
   
2. **The Switch `accept` Loop (`switch.nim`)**
   When the `switch` receives a `nil` connection from the transport layer, it simply calls `continue` to instantly retry the `accept()` call without yielding or backing off.

3. **The Platform Difference (Why it hides on macOS but kills Linux nodes)**
   - **On Linux (`epoll`):** Chronos uses level-triggered `epoll` (`EPOLLIN`). Because there are still pending TCP connections in the kernel's listen backlog that we couldn't accept, `epoll` instantly wakes up the event loop again. The loop calls `accept()`, hits `EMFILE`, returns `nil`, and loops again immediately. This loops thousands of times a second, consuming 100% CPU.
   - **On macOS (`kqueue`):** Chronos uses `kqueue`. When `accept()` fails, Chronos removes and re-adds the socket reader. Because no *new* state change has occurred on the listen socket since it was re-added, `kqueue` does not wake the event loop. The loop naturally pauses until a new connection arrives, masking the 100% CPU bug on local Mac development machines.

## Steps to Reproduce (on Linux)

1. Checkout to this commit: https://github.com/vacp2p/nim-libp2p/tree/2a1411ddc06ccabf33b93a914ddd09a43953a7f5
2. Run `docker build -t reproduce-emfile -f Dockerfile.emfile . && docker run --rm reproduce-emfile`
3. You will see errors keep popping up: Server accept error (x1830000): [EMFILE] Too many open files in the process

## Proposed Solution

Introduce an explicit async backoff mechanism in `tcptransport.nim` when `TransportTooManyError` is caught. This yields control back to the event loop, allowing the application to process existing connections, close old ones, and eventually free up file descriptors.

**File:** `libp2p/transports/tcptransport.nim`
```diff
    except TransportTooManyError as exc:
      debug "Too many files opened", description = exc.msg
+     await sleepAsync(100.milliseconds)
      return nil
```

By adding a 100ms `sleepAsync`, we guarantee that the event loop can breathe and process other events, completely mitigating the CPU starvation deadlock.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: 100% CPU Starvation Loop on File Descriptor Exhaustion (Linux) #2335

Description

Root Cause & Platform Differences

Steps to Reproduce (on Linux)

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: 100% CPU Starvation Loop on File Descriptor Exhaustion (Linux) #2335

Description

Description

Root Cause & Platform Differences

Steps to Reproduce (on Linux)

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions