Skip to content

Bug: 100% CPU Starvation Loop on File Descriptor Exhaustion (Linux) #2335

@radiken

Description

@radiken

Description

When nim-libp2p reaches the operating system's file descriptor limit (e.g., ulimit -n), the accept loop in the TCP transport layer falls into an infinite, non-yielding busy loop. This causes the node to instantly peg the CPU at 100%, starving the chronos async event loop and effectively deadlocking the entire application.

This is a critical stability issue for production nodes (which primarily run on Linux) under heavy connection load or targeted connection-spam attacks.

Root Cause & Platform Differences

The bug stems from how the async engine (chronos) interacts with the underlying OS event polling mechanisms (epoll on Linux vs. kqueue on macOS) when the accept() system call fails with EMFILE (Too many open files).

  1. The TCP Transport accept Loop (tcptransport.nim)
    When tcptransport.accept() encounters TransportTooManyError from Chronos, it catches the exception, logs a debug message, and returns nil.

  2. The Switch accept Loop (switch.nim)
    When the switch receives a nil connection from the transport layer, it simply calls continue to instantly retry the accept() call without yielding or backing off.

  3. The Platform Difference (Why it hides on macOS but kills Linux nodes)

    • On Linux (epoll): Chronos uses level-triggered epoll (EPOLLIN). Because there are still pending TCP connections in the kernel's listen backlog that we couldn't accept, epoll instantly wakes up the event loop again. The loop calls accept(), hits EMFILE, returns nil, and loops again immediately. This loops thousands of times a second, consuming 100% CPU.
    • On macOS (kqueue): Chronos uses kqueue. When accept() fails, Chronos removes and re-adds the socket reader. Because no new state change has occurred on the listen socket since it was re-added, kqueue does not wake the event loop. The loop naturally pauses until a new connection arrives, masking the 100% CPU bug on local Mac development machines.

Steps to Reproduce (on Linux)

  1. Checkout to this commit: https://github.com/vacp2p/nim-libp2p/tree/2a1411ddc06ccabf33b93a914ddd09a43953a7f5
  2. Run docker build -t reproduce-emfile -f Dockerfile.emfile . && docker run --rm reproduce-emfile
  3. You will see errors keep popping up: Server accept error (x1830000): [EMFILE] Too many open files in the process

Proposed Solution

Introduce an explicit async backoff mechanism in tcptransport.nim when TransportTooManyError is caught. This yields control back to the event loop, allowing the application to process existing connections, close old ones, and eventually free up file descriptors.

File: libp2p/transports/tcptransport.nim

    except TransportTooManyError as exc:
      debug "Too many files opened", description = exc.msg
+     await sleepAsync(100.milliseconds)
      return nil

By adding a 100ms sleepAsync, we guarantee that the event loop can breathe and process other events, completely mitigating the CPU starvation deadlock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    Status

    new

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions