Skip to content

lib/cgroup_bw: cap task-stall latency under cpu.max#3554

Open
multics69 wants to merge 10 commits into
sched-ext:mainfrom
multics69:cpu-max-task-stall-v7
Open

lib/cgroup_bw: cap task-stall latency under cpu.max#3554
multics69 wants to merge 10 commits into
sched-ext:mainfrom
multics69:cpu-max-task-stall-v7

Conversation

@multics69
Copy link
Copy Markdown
Contributor

@multics69 multics69 commented May 6, 2026

This series caps the wall-clock latency a task can experience when its
cgroup runs out of cpu.max bandwidth. It was driven by stalls
observed while stress-testing scx_lavd under heavy cpu.max throttling
with concurrent cgroup churn, SIGSTOP/SIGCONT cycles, and deep cgroup
hierarchies. The first six commits are diagnostics, robustness, and
refactoring prep work; the last three carry the latency mitigation.

Bounding task-stall latency

Tasks already running when a cgroup throttles hold the CPU for the
full base time slice, blocking the scheduler from rechecking the
throttle for that long. scx_cgroup_bw_pressure() exposes a 1024-
scale pressure hint that combines a hyperbolic budget term with a
linear BTQ-backlog term; scx_lavd consumes it in calc_time_slice()
to scale slices down by up to 16x under heavy pressure.

A throttled task's BTQ position is its scheduler vtime alone, so a
task with a large vtime can starve indefinitely behind a steady
stream of smaller-vtime arrivals. Blend the upper 32 bits of
wall-clock nanoseconds with the lower 32 bits of vtime so any task
waits at most ~4.29 s before its epoch makes it the head of the
queue.

The kernel sometimes wakes a task specifically so it can observe a
pending kernel-mediated transition (group SIGSTOP, ptrace trap,
cgroup-v2 freeze, group exit). Throttling such a task into the BTQ
delays the user-visible operation by the throttle window -- SIGSTOP
appears delayed for seconds, cgroup.freeze appears to hang. Extend
the existing PF_EXITING bypass in scx_cgroup_bw_throttled() to cover
SIGNAL_GROUP_EXIT and JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE.

  • [7/10] ec4a516 lib/cgroup_bw, scx_lavd: add scx_cgroup_bw_pressure() API
  • [8/10] d680de0 lib/cgroup_bw: blend wall-clock time into BTQ vtime to bound throttle delay
  • [9/10] e27e6e5 lib/cgroup_bw: bypass throttling for transient kernel-mediated task states

Supporting changes

cpu_max_bench.py quantifies cpu.max overhead under varying
depth/quota/load and plots eevdf vs scx_lavd side-by-side. The
verbose cbw_dbg* tracing flooded trace_pipe and slowed every hot
path; cbw_err / cbw_warn still cover the actionable cases.

  • [1/10] 49364e4 scripts: add cpu_max_bench.py to measure cpu.max overhead
  • [2/10] 9059248 lib/cgroup_bw: drop verbose cbw_dbg* tracing

The deferred-BTQ destruction ring at 256 slots could wrap under
heavy cgroup churn and trigger a UAF; resized to
CBW_NR_CGRP_LLC_MAX * 2. CBW_CGRP_TREE_HEIGHT_MAX was 32 with a
misleading "matches the kernel" comment (kernel default is INT_MAX);
bumped to 64 and scx_cgroup_bw_init() now rejects deeper trees with
-ENOMEM instead of indexing tree_levels[] out of bounds.

  • [3/10] 6ed19e8 lib/cgroup_bw: size deferred-BTQ ring to CBW_NR_CGRP_LLC_MAX * 2
  • [4/10] 041da9c lib/cgroup_bw: raise cgroup tree height cap to 64 and reject deeper trees

scx_cgroup_bw_dump() was hard-coded to bpf_printk(); add a mode
parameter so ops.dump*() callbacks can route output to the SCX dump
buffer. Factor the open-coded taskc-cached cgx/llcx lookup pattern
into helpers so the new pressure API picks up the caching for free.

  • [5/10] ad0767f lib/cgroup_bw, scx_lavd: route scx_cgroup_bw_dump() to bpf_printk or scx_bpf_dump
  • [6/10] e9baee0 lib/cgroup_bw: factor taskc-cached cgx/llcx accessors into helpers

arena_spin_lock() -ETIMEDOUT leaves the MCS chain with stale ->next
links; a running scheduler must tear down rather than retry against
an inconsistent queue, but a fresh load reinitialises the MCS state.
Surface the failure as scx_bpf_exit(SCX_ECODE_ACT_RESTART, ...) so
user-space orchestration can respawn the scheduler automatically.

  • [10/10] a87edcf lib/atq: request scheduler restart on arena_spin_lock -ETIMEDOUT

Signed-off-by: Changwoo Min changwoo@igalia.com

@multics69
Copy link
Copy Markdown
Contributor Author

PR 3554 includes the changes in PR 3552 (ATQ, -ETIMEDOUT PR) for ease of testing. Once PR 3552 lands, I will rebase it properly.

@multics69 multics69 force-pushed the cpu-max-task-stall-v7 branch 2 times, most recently from 49aac20 to c5b6f4f Compare May 6, 2026 22:47
Comment thread scheds/rust/scx_lavd/src/bpf/main.bpf.c Outdated
@multics69 multics69 force-pushed the cpu-max-task-stall-v7 branch 6 times, most recently from b8f63d0 to a84c653 Compare May 9, 2026 03:08
@multics69
Copy link
Copy Markdown
Contributor Author

Rebased to the master.

Copy link
Copy Markdown
Contributor

@bboymimi bboymimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

multics69 added 10 commits May 13, 2026 23:21
Introduce a benchmark script that quantifies the kernel-mode CPU overhead
imposed by cgroup cpu.max bandwidth enforcement.

The script runs stress-ng --cpu inside a configurable-depth cgroup
hierarchy with cpu.max set at every level, and captures system-wide
perf stat counters (cycles, cycles:k, cache-misses, stalled-cycles-backend,
instructions) over the full duration.  The ratio cycles:k / cycles *
nproc is reported as the overhead expressed in equivalent CPUs.

Key features:
- Configurable cgroup depth (0 = system root, no per-run cgroup), quota
  (percent of nproc), load factor, duration, and scheduler
  (eevdf or scx_lavd).
- Per-second CPU utilisation sampled from cgroup cpu.stat and rendered
  as PNG/SVG plots, with a distinct marker glyph per scheduler so
  overlapping traces stay distinguishable.
- Batch mode via an INI config file with -S/--select fnmatch filter;
  configurations sharing the same (depth, quota, load_factor) are
  grouped in the report.
- Markdown report (report.md) with GFM tables and embedded CPU
  utilisation graphs.
- Dependency check at startup with install instructions for Ubuntu,
  Arch Linux, and Fedora / Amazon Linux.

Cleanup discipline:
- teardown() writes 1 to leaf/cgroup.kill and waits on cgroup.events
  populated=0 before rmdir, so leftover stress-ng workers cannot leak
  cgroups via EBUSY.  rmdir failures are logged loudly instead of
  swallowed.
- A SIGTERM / SIGINT handler and an atexit hook run a best-effort
  teardown of any cgroups currently set up, covering hard exits where
  the normal try/finally would not get a chance to run.
- bench_id includes a per-process monotonic counter alongside PID and
  millisecond timestamp, eliminating collision risk between back-to-back
  runs.

Also add cpu_max_bench.ini as an example configuration covering
root cgroup, baseline, varying load factors, varying cgroup depths,
and 50%-quota runs, for both eevdf and scx_lavd.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
The cbw_dbg / cbw_dbg_cgrp / dbg_cgx prints were useful at the very
early development stage but no longer carry their weight: they fire on
every cgroup init/exit/move, throttle/consume, replenish tick, and BTQ
pop, flooding trace_pipe and slowing the hot paths under any non-
trivial workload.

cbw_err / cbw_warn still cover the actionable cases.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
schedule_atq_destroy() defers BTQ destruction through a lock-free
circular ring of CBW_DEFERRED_BTQ_SIZE slots.  When tail wraps around
to an occupied slot, the incumbent BTQ is evicted and destroyed
immediately.  A use-after-free occurs if a reader still holds a pointer
to the evicted BTQ — the window between READ_ONCE(llcx->btq) in
cbw_drain_btq_batch() and the arena_spin_lock() inside scx_atq_pop().

The original size of 256 is too small: with heavy cgroup churn (e.g.
during scheduler teardown), more than 256 BTQs can be queued before a
reader has released its pointer, wrapping the ring and triggering the
use-after-free observed as repeated "freeing nonexistent idx" errors.

Since there are at most CBW_NR_CGRP_LLC_MAX llcx objects, at most that
many BTQs can ever be live at once.  Setting the ring size to
CBW_NR_CGRP_LLC_MAX * 2 means the ring cannot wrap before the entire
BTQ pool has turned over twice.  Any reader that snapshots a pointer
will find its slot still intact — the window of vulnerability is a few
instructions, far shorter than 2 * CBW_NR_CGRP_LLC_MAX concurrent BTQ
destructions.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
…rees

CBW_CGRP_TREE_HEIGHT_MAX bounds the per-CPU tree_levels[] array used by
cbw_update_runtime_total_sloppy() to walk the cgroup hierarchy.  It was
set to 32 with a comment claiming this matched the kernel's
CGROUPS_DEPTH_MAX, but the kernel default is actually much larger
(cgroup_max_depth = INT_MAX), so the comment was misleading and the cap
was tighter than necessary.

Two changes:

 - Bump the cap from 32 to 64 to give more headroom for genuinely deep
   hierarchies seen in container-on-container setups.

 - Reject cgroups whose level exceeds the cap at scx_cgroup_bw_init()
   instead of silently proceeding and indexing tree_levels[] out of
   bounds.  Returning -ENOMEM from init makes the failure explicit and
   safe.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
…scx_bpf_dump

bpf_printk() was hard-coded, which is wrong from an ops.dump*() callback
where output should land in the SCX dump buffer instead of trace_pipe.

Add a mode parameter:

  enum scx_cgroup_bw_dump_mode {
      SCX_CGROUP_BW_DUMP_PRINTK = 0,
      SCX_CGROUP_BW_DUMP_SCX    = 1,
  };

A cbw_dump_line(mode, fmt, ...) macro dispatches to the chosen helper.

Also normalise a hard-coded cgroup id of 1 to the runtime-detected
ROOT_CGID so namespaced callers find the right root.

scx_lavd uses PRINTK in lavd_dump to avoid flooding the dump buffer
with the full hierarchy, and SCX mode in lavd_dump_task to surface the
offending cgroup state next to the throttled task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
The same cache-lookup pattern was open-coded in cbw_cgroup_bw_throttled(),
scx_cgroup_bw_consume(), and scx_cgroup_bw_pressure(); the matching
invalidation pattern in cbw_drain_atq_to_root(), cbw_free_llc_ctx(), and
scx_cgroup_bw_move().

Add three static __always_inline helpers:

  cbw_taskc_get_cgx_raw(taskc, cgrp_id)
  cbw_taskc_get_llcx_raw(taskc, cgrp_id, llc_id)
  cbw_taskc_invalidate(taskc)

The getters accept a possibly-NULL taskc and return 0 on miss so each
caller keeps its own miss policy.  cbw_taskc_invalidate() centralises
the __sync_lock_test_and_set workaround for the arena-pointer fields,
letting scx_cgroup_bw_move() drop its local `volatile` qualifier.

No semantic change.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup is throttled, tasks already running hold the CPU for their
full time slice before the scheduler can recheck the throttle state.  This
causes task-stall latency that grows with the configured slice length.

Add scx_cgroup_bw_pressure() to expose a 1024-scale pressure hint that BPF
schedulers can use to shorten time slices proportionally:

  slice = (base_slice * 1024) / pressure

Pressure is computed at each replenishment boundary from two signals that
are combined by addition so that both contribute independently:

  Budget pressure: a hyperbolic curve that rises steeply below 25% of the
  replenished period_budget.  A small budget after replenishment also
  indicates accumulated debt from prior over-consumption, so high pressure
  is correct in that case too.

  Backlog pressure: a linear term proportional to the number of tasks
  queued in the BTQ across all LLC domains.  A growing backlog signals that
  the reenqueue path cannot drain fast enough; shorter slices reduce the
  time any single task monopolises the CPU.

The combined pressure is clamped to [1024, 16384], limiting the maximum
reduction to 1/16 of the base slice.

scx_lavd adopts the new API in calc_time_slice(): pressure is fetched once
per scheduling decision, slice boost is suppressed under any throttle
pressure, and the final slice is scaled by the pressure before being
assigned to the task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
… delay

A throttled task's position in the BTQ is determined by its scheduler
vtime.  If other tasks are continuously enqueued with smaller vtimes,
a task with a large vtime can be delayed arbitrarily long even though
it has been waiting in the queue.

Fix this by blending the wall-clock time into the BTQ key:

    btq_vtime = (scx_bpf_now() & CBW_BTQ_VTIME_UPPER_MASK) |
                (vtime & CBW_BTQ_VTIME_LOWER_MASK)

The upper 32 bits come from the current nanosecond timestamp; the lower
32 bits come from the scheduler-provided vtime.  The 64-bit key is split
evenly so each side contributes 32 bits.  Tasks enqueued within the same
~4-second window (2^32 ns ~= 4.29 s) still compete by their scheduler
vtime, preserving relative fairness.  Once a new wall-clock epoch
begins, earlier-queued tasks take priority regardless of their vtime,
guaranteeing that no task waits more than ~4 seconds in the BTQ due to
vtime ordering alone.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
…tates

scx_cgroup_bw_throttled() already bypasses throttling for PF_EXITING
tasks because the BTQ drain's bpf_task_from_pid() returns NULL once the
kernel unhashes an exiting task, losing the task from all runqueues.
Real-world workloads with frequent SIGSTOP/SIGCONT cycles exhibit a
related stall: a task woken specifically so it can observe a pending
group stop is parked in the BTQ, the throttle window elapses, and the
user-visible SIGSTOP appears delayed by seconds.  Cgroup-v2 freeze and
ptrace traps share the same shape -- the kernel-side operation cannot
converge until the scheduler lets the task run briefly.

Extend the bypass to cover both flavours:

  Correctness -- task is leaving SCX before drain can find it:
    PF_EXITING            (already handled)
    SIGNAL_GROUP_EXIT     SIGKILL / exit_group() propagating; narrow
                          window where the group flag is set but
                          PF_EXITING has not yet landed on a sibling.

  Latency -- task wants a short kernel-mediated transition:
    JOBCTL_STOP_PENDING   group SIGSTOP delivery
    JOBCTL_TRAP_STOP      ptrace stop trap
    JOBCTL_TRAP_NOTIFY    ptrace notify trap (seccomp, PTRACE_EVENT_*)
    JOBCTL_TRAP_FREEZE    cgroup-v2 freezer trap

JOBCTL_PENDING_MASK already groups STOP_PENDING with the trap bits;
TRAP_FREEZE is outside the mask and gets its own bit-test.  Quota
impact is negligible: tasks in any of these states consume essentially
no CPU before leaving SCX.

Each branch is marked unlikely() since the steady state is "throttle
normally", and READ_ONCE() is used for p->jobctl and p->signal->flags
because those are written under siglock on a different CPU.

vmlinux.h carries types but not CPP macros, so the SIGNAL_GROUP_EXIT
and JOBCTL_* bit definitions are mirrored from the kernel headers
near the top of cgroup_bw.bpf.c.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
arena_spin_lock() -ETIMEDOUT means a bounded spin loop in the slow
path gave up, leaving the MCS chain with stale ->next links.  The
running scheduler must tear down (retrying races against an
inconsistent queue).

Use scx_bpf_exit(SCX_ECODE_ACT_RESTART, ...) instead of
scx_bpf_error() so user-space orchestration can respawn the
scheduler automatically rather than treating it as a bug.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
@multics69 multics69 force-pushed the cpu-max-task-stall-v7 branch from a84c653 to 5daa5f8 Compare May 13, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants