lib/cgroup_bw: cap task-stall latency under cpu.max by multics69 · Pull Request #3554 · sched-ext/scx

multics69 · 2026-05-06T01:34:30Z

This series caps the wall-clock latency a task can experience when its
cgroup runs out of cpu.max bandwidth. It was driven by stalls
observed while stress-testing scx_lavd under heavy cpu.max throttling
with concurrent cgroup churn, SIGSTOP/SIGCONT cycles, and deep cgroup
hierarchies. The first six commits are diagnostics, robustness, and
refactoring prep work; the last three carry the latency mitigation.

Bounding task-stall latency

Tasks already running when a cgroup throttles hold the CPU for the
full base time slice, blocking the scheduler from rechecking the
throttle for that long. scx_cgroup_bw_pressure() exposes a 1024-
scale pressure hint that combines a hyperbolic budget term with a
linear BTQ-backlog term; scx_lavd consumes it in calc_time_slice()
to scale slices down by up to 16x under heavy pressure.

A throttled task's BTQ position is its scheduler vtime alone, so a
task with a large vtime can starve indefinitely behind a steady
stream of smaller-vtime arrivals. Blend the upper 32 bits of
wall-clock nanoseconds with the lower 32 bits of vtime so any task
waits at most ~4.29 s before its epoch makes it the head of the
queue.

The kernel sometimes wakes a task specifically so it can observe a
pending kernel-mediated transition (group SIGSTOP, ptrace trap,
cgroup-v2 freeze, group exit). Throttling such a task into the BTQ
delays the user-visible operation by the throttle window -- SIGSTOP
appears delayed for seconds, cgroup.freeze appears to hang. Extend
the existing PF_EXITING bypass in scx_cgroup_bw_throttled() to cover
SIGNAL_GROUP_EXIT and JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE.

[7/10] ec4a516 lib/cgroup_bw, scx_lavd: add scx_cgroup_bw_pressure() API
[8/10] d680de0 lib/cgroup_bw: blend wall-clock time into BTQ vtime to bound throttle delay
[9/10] e27e6e5 lib/cgroup_bw: bypass throttling for transient kernel-mediated task states

Supporting changes

cpu_max_bench.py quantifies cpu.max overhead under varying
depth/quota/load and plots eevdf vs scx_lavd side-by-side. The
verbose cbw_dbg* tracing flooded trace_pipe and slowed every hot
path; cbw_err / cbw_warn still cover the actionable cases.

[1/10] 49364e4 scripts: add cpu_max_bench.py to measure cpu.max overhead
[2/10] 9059248 lib/cgroup_bw: drop verbose cbw_dbg* tracing

The deferred-BTQ destruction ring at 256 slots could wrap under
heavy cgroup churn and trigger a UAF; resized to
CBW_NR_CGRP_LLC_MAX * 2. CBW_CGRP_TREE_HEIGHT_MAX was 32 with a
misleading "matches the kernel" comment (kernel default is INT_MAX);
bumped to 64 and scx_cgroup_bw_init() now rejects deeper trees with
-ENOMEM instead of indexing tree_levels[] out of bounds.

[3/10] 6ed19e8 lib/cgroup_bw: size deferred-BTQ ring to CBW_NR_CGRP_LLC_MAX * 2
[4/10] 041da9c lib/cgroup_bw: raise cgroup tree height cap to 64 and reject deeper trees

scx_cgroup_bw_dump() was hard-coded to bpf_printk(); add a mode
parameter so ops.dump*() callbacks can route output to the SCX dump
buffer. Factor the open-coded taskc-cached cgx/llcx lookup pattern
into helpers so the new pressure API picks up the caching for free.

[5/10] ad0767f lib/cgroup_bw, scx_lavd: route scx_cgroup_bw_dump() to bpf_printk or scx_bpf_dump
[6/10] e9baee0 lib/cgroup_bw: factor taskc-cached cgx/llcx accessors into helpers

arena_spin_lock() -ETIMEDOUT leaves the MCS chain with stale ->next
links; a running scheduler must tear down rather than retry against
an inconsistent queue, but a fresh load reinitialises the MCS state.
Surface the failure as scx_bpf_exit(SCX_ECODE_ACT_RESTART, ...) so
user-space orchestration can respawn the scheduler automatically.

[10/10] a87edcf lib/atq: request scheduler restart on arena_spin_lock -ETIMEDOUT

Signed-off-by: Changwoo Min changwoo@igalia.com

multics69 · 2026-05-06T01:37:24Z

PR 3554 includes the changes in PR 3552 (ATQ, -ETIMEDOUT PR) for ease of testing. Once PR 3552 lands, I will rebase it properly.

multics69 · 2026-05-09T12:11:07Z

Rebased to the master.

bboymimi

LGTM, thanks!

Introduce a benchmark script that quantifies the kernel-mode CPU overhead imposed by cgroup cpu.max bandwidth enforcement. The script runs stress-ng --cpu inside a configurable-depth cgroup hierarchy with cpu.max set at every level, and captures system-wide perf stat counters (cycles, cycles:k, cache-misses, stalled-cycles-backend, instructions) over the full duration. The ratio cycles:k / cycles * nproc is reported as the overhead expressed in equivalent CPUs. Key features: - Configurable cgroup depth (0 = system root, no per-run cgroup), quota (percent of nproc), load factor, duration, and scheduler (eevdf or scx_lavd). - Per-second CPU utilisation sampled from cgroup cpu.stat and rendered as PNG/SVG plots, with a distinct marker glyph per scheduler so overlapping traces stay distinguishable. - Batch mode via an INI config file with -S/--select fnmatch filter; configurations sharing the same (depth, quota, load_factor) are grouped in the report. - Markdown report (report.md) with GFM tables and embedded CPU utilisation graphs. - Dependency check at startup with install instructions for Ubuntu, Arch Linux, and Fedora / Amazon Linux. Cleanup discipline: - teardown() writes 1 to leaf/cgroup.kill and waits on cgroup.events populated=0 before rmdir, so leftover stress-ng workers cannot leak cgroups via EBUSY. rmdir failures are logged loudly instead of swallowed. - A SIGTERM / SIGINT handler and an atexit hook run a best-effort teardown of any cgroups currently set up, covering hard exits where the normal try/finally would not get a chance to run. - bench_id includes a per-process monotonic counter alongside PID and millisecond timestamp, eliminating collision risk between back-to-back runs. Also add cpu_max_bench.ini as an example configuration covering root cgroup, baseline, varying load factors, varying cgroup depths, and 50%-quota runs, for both eevdf and scx_lavd. Signed-off-by: Changwoo Min <changwoo@igalia.com>

The cbw_dbg / cbw_dbg_cgrp / dbg_cgx prints were useful at the very early development stage but no longer carry their weight: they fire on every cgroup init/exit/move, throttle/consume, replenish tick, and BTQ pop, flooding trace_pipe and slowing the hot paths under any non- trivial workload. cbw_err / cbw_warn still cover the actionable cases. Signed-off-by: Changwoo Min <changwoo@igalia.com>

schedule_atq_destroy() defers BTQ destruction through a lock-free circular ring of CBW_DEFERRED_BTQ_SIZE slots. When tail wraps around to an occupied slot, the incumbent BTQ is evicted and destroyed immediately. A use-after-free occurs if a reader still holds a pointer to the evicted BTQ — the window between READ_ONCE(llcx->btq) in cbw_drain_btq_batch() and the arena_spin_lock() inside scx_atq_pop(). The original size of 256 is too small: with heavy cgroup churn (e.g. during scheduler teardown), more than 256 BTQs can be queued before a reader has released its pointer, wrapping the ring and triggering the use-after-free observed as repeated "freeing nonexistent idx" errors. Since there are at most CBW_NR_CGRP_LLC_MAX llcx objects, at most that many BTQs can ever be live at once. Setting the ring size to CBW_NR_CGRP_LLC_MAX * 2 means the ring cannot wrap before the entire BTQ pool has turned over twice. Any reader that snapshots a pointer will find its slot still intact — the window of vulnerability is a few instructions, far shorter than 2 * CBW_NR_CGRP_LLC_MAX concurrent BTQ destructions. Signed-off-by: Changwoo Min <changwoo@igalia.com>

…rees CBW_CGRP_TREE_HEIGHT_MAX bounds the per-CPU tree_levels[] array used by cbw_update_runtime_total_sloppy() to walk the cgroup hierarchy. It was set to 32 with a comment claiming this matched the kernel's CGROUPS_DEPTH_MAX, but the kernel default is actually much larger (cgroup_max_depth = INT_MAX), so the comment was misleading and the cap was tighter than necessary. Two changes: - Bump the cap from 32 to 64 to give more headroom for genuinely deep hierarchies seen in container-on-container setups. - Reject cgroups whose level exceeds the cap at scx_cgroup_bw_init() instead of silently proceeding and indexing tree_levels[] out of bounds. Returning -ENOMEM from init makes the failure explicit and safe. Signed-off-by: Changwoo Min <changwoo@igalia.com>

…scx_bpf_dump bpf_printk() was hard-coded, which is wrong from an ops.dump*() callback where output should land in the SCX dump buffer instead of trace_pipe. Add a mode parameter: enum scx_cgroup_bw_dump_mode { SCX_CGROUP_BW_DUMP_PRINTK = 0, SCX_CGROUP_BW_DUMP_SCX = 1, }; A cbw_dump_line(mode, fmt, ...) macro dispatches to the chosen helper. Also normalise a hard-coded cgroup id of 1 to the runtime-detected ROOT_CGID so namespaced callers find the right root. scx_lavd uses PRINTK in lavd_dump to avoid flooding the dump buffer with the full hierarchy, and SCX mode in lavd_dump_task to surface the offending cgroup state next to the throttled task. Signed-off-by: Changwoo Min <changwoo@igalia.com>

The same cache-lookup pattern was open-coded in cbw_cgroup_bw_throttled(), scx_cgroup_bw_consume(), and scx_cgroup_bw_pressure(); the matching invalidation pattern in cbw_drain_atq_to_root(), cbw_free_llc_ctx(), and scx_cgroup_bw_move(). Add three static __always_inline helpers: cbw_taskc_get_cgx_raw(taskc, cgrp_id) cbw_taskc_get_llcx_raw(taskc, cgrp_id, llc_id) cbw_taskc_invalidate(taskc) The getters accept a possibly-NULL taskc and return 0 on miss so each caller keeps its own miss policy. cbw_taskc_invalidate() centralises the __sync_lock_test_and_set workaround for the arena-pointer fields, letting scx_cgroup_bw_move() drop its local `volatile` qualifier. No semantic change. Signed-off-by: Changwoo Min <changwoo@igalia.com>

When a cgroup is throttled, tasks already running hold the CPU for their full time slice before the scheduler can recheck the throttle state. This causes task-stall latency that grows with the configured slice length. Add scx_cgroup_bw_pressure() to expose a 1024-scale pressure hint that BPF schedulers can use to shorten time slices proportionally: slice = (base_slice * 1024) / pressure Pressure is computed at each replenishment boundary from two signals that are combined by addition so that both contribute independently: Budget pressure: a hyperbolic curve that rises steeply below 25% of the replenished period_budget. A small budget after replenishment also indicates accumulated debt from prior over-consumption, so high pressure is correct in that case too. Backlog pressure: a linear term proportional to the number of tasks queued in the BTQ across all LLC domains. A growing backlog signals that the reenqueue path cannot drain fast enough; shorter slices reduce the time any single task monopolises the CPU. The combined pressure is clamped to [1024, 16384], limiting the maximum reduction to 1/16 of the base slice. scx_lavd adopts the new API in calc_time_slice(): pressure is fetched once per scheduling decision, slice boost is suppressed under any throttle pressure, and the final slice is scaled by the pressure before being assigned to the task. Signed-off-by: Changwoo Min <changwoo@igalia.com>

… delay A throttled task's position in the BTQ is determined by its scheduler vtime. If other tasks are continuously enqueued with smaller vtimes, a task with a large vtime can be delayed arbitrarily long even though it has been waiting in the queue. Fix this by blending the wall-clock time into the BTQ key: btq_vtime = (scx_bpf_now() & CBW_BTQ_VTIME_UPPER_MASK) | (vtime & CBW_BTQ_VTIME_LOWER_MASK) The upper 32 bits come from the current nanosecond timestamp; the lower 32 bits come from the scheduler-provided vtime. The 64-bit key is split evenly so each side contributes 32 bits. Tasks enqueued within the same ~4-second window (2^32 ns ~= 4.29 s) still compete by their scheduler vtime, preserving relative fairness. Once a new wall-clock epoch begins, earlier-queued tasks take priority regardless of their vtime, guaranteeing that no task waits more than ~4 seconds in the BTQ due to vtime ordering alone. Signed-off-by: Changwoo Min <changwoo@igalia.com>

…tates scx_cgroup_bw_throttled() already bypasses throttling for PF_EXITING tasks because the BTQ drain's bpf_task_from_pid() returns NULL once the kernel unhashes an exiting task, losing the task from all runqueues. Real-world workloads with frequent SIGSTOP/SIGCONT cycles exhibit a related stall: a task woken specifically so it can observe a pending group stop is parked in the BTQ, the throttle window elapses, and the user-visible SIGSTOP appears delayed by seconds. Cgroup-v2 freeze and ptrace traps share the same shape -- the kernel-side operation cannot converge until the scheduler lets the task run briefly. Extend the bypass to cover both flavours: Correctness -- task is leaving SCX before drain can find it: PF_EXITING (already handled) SIGNAL_GROUP_EXIT SIGKILL / exit_group() propagating; narrow window where the group flag is set but PF_EXITING has not yet landed on a sibling. Latency -- task wants a short kernel-mediated transition: JOBCTL_STOP_PENDING group SIGSTOP delivery JOBCTL_TRAP_STOP ptrace stop trap JOBCTL_TRAP_NOTIFY ptrace notify trap (seccomp, PTRACE_EVENT_*) JOBCTL_TRAP_FREEZE cgroup-v2 freezer trap JOBCTL_PENDING_MASK already groups STOP_PENDING with the trap bits; TRAP_FREEZE is outside the mask and gets its own bit-test. Quota impact is negligible: tasks in any of these states consume essentially no CPU before leaving SCX. Each branch is marked unlikely() since the steady state is "throttle normally", and READ_ONCE() is used for p->jobctl and p->signal->flags because those are written under siglock on a different CPU. vmlinux.h carries types but not CPP macros, so the SIGNAL_GROUP_EXIT and JOBCTL_* bit definitions are mirrored from the kernel headers near the top of cgroup_bw.bpf.c. Signed-off-by: Changwoo Min <changwoo@igalia.com>

arena_spin_lock() -ETIMEDOUT means a bounded spin loop in the slow path gave up, leaving the MCS chain with stale ->next links. The running scheduler must tear down (retrying races against an inconsistent queue). Use scx_bpf_exit(SCX_ECODE_ACT_RESTART, ...) instead of scx_bpf_error() so user-space orchestration can respawn the scheduler automatically rather than treating it as a bug. Signed-off-by: Changwoo Min <changwoo@igalia.com>

multics69 requested review from bboymimi, daidavid, etsal and rrnewton May 6, 2026 01:34

multics69 force-pushed the cpu-max-task-stall-v7 branch 2 times, most recently from 49aac20 to c5b6f4f Compare May 6, 2026 22:47

daidavid reviewed May 6, 2026

View reviewed changes

Comment thread scheds/rust/scx_lavd/src/bpf/main.bpf.c Outdated

multics69 force-pushed the cpu-max-task-stall-v7 branch 6 times, most recently from b8f63d0 to a84c653 Compare May 9, 2026 03:08

bboymimi approved these changes May 11, 2026

View reviewed changes

multics69 added 10 commits May 13, 2026 23:21

multics69 force-pushed the cpu-max-task-stall-v7 branch from a84c653 to 5daa5f8 Compare May 13, 2026 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/cgroup_bw: cap task-stall latency under cpu.max#3554

lib/cgroup_bw: cap task-stall latency under cpu.max#3554
multics69 wants to merge 10 commits into
sched-ext:mainfrom
multics69:cpu-max-task-stall-v7

multics69 commented May 6, 2026 •

edited

Loading

Uh oh!

multics69 commented May 6, 2026

Uh oh!

Uh oh!

multics69 commented May 9, 2026

Uh oh!

bboymimi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

multics69 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bounding task-stall latency

Supporting changes

Uh oh!

multics69 commented May 6, 2026

Uh oh!

Uh oh!

multics69 commented May 9, 2026

Uh oh!

bboymimi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

multics69 commented May 6, 2026 •

edited

Loading