lib/cgroup_bw: cap task-stall latency under cpu.max#3554
Open
multics69 wants to merge 10 commits into
Open
Conversation
Contributor
Author
|
PR 3554 includes the changes in PR 3552 (ATQ, -ETIMEDOUT PR) for ease of testing. Once PR 3552 lands, I will rebase it properly. |
49aac20 to
c5b6f4f
Compare
daidavid
reviewed
May 6, 2026
b8f63d0 to
a84c653
Compare
Contributor
Author
|
Rebased to the master. |
Introduce a benchmark script that quantifies the kernel-mode CPU overhead imposed by cgroup cpu.max bandwidth enforcement. The script runs stress-ng --cpu inside a configurable-depth cgroup hierarchy with cpu.max set at every level, and captures system-wide perf stat counters (cycles, cycles:k, cache-misses, stalled-cycles-backend, instructions) over the full duration. The ratio cycles:k / cycles * nproc is reported as the overhead expressed in equivalent CPUs. Key features: - Configurable cgroup depth (0 = system root, no per-run cgroup), quota (percent of nproc), load factor, duration, and scheduler (eevdf or scx_lavd). - Per-second CPU utilisation sampled from cgroup cpu.stat and rendered as PNG/SVG plots, with a distinct marker glyph per scheduler so overlapping traces stay distinguishable. - Batch mode via an INI config file with -S/--select fnmatch filter; configurations sharing the same (depth, quota, load_factor) are grouped in the report. - Markdown report (report.md) with GFM tables and embedded CPU utilisation graphs. - Dependency check at startup with install instructions for Ubuntu, Arch Linux, and Fedora / Amazon Linux. Cleanup discipline: - teardown() writes 1 to leaf/cgroup.kill and waits on cgroup.events populated=0 before rmdir, so leftover stress-ng workers cannot leak cgroups via EBUSY. rmdir failures are logged loudly instead of swallowed. - A SIGTERM / SIGINT handler and an atexit hook run a best-effort teardown of any cgroups currently set up, covering hard exits where the normal try/finally would not get a chance to run. - bench_id includes a per-process monotonic counter alongside PID and millisecond timestamp, eliminating collision risk between back-to-back runs. Also add cpu_max_bench.ini as an example configuration covering root cgroup, baseline, varying load factors, varying cgroup depths, and 50%-quota runs, for both eevdf and scx_lavd. Signed-off-by: Changwoo Min <changwoo@igalia.com>
The cbw_dbg / cbw_dbg_cgrp / dbg_cgx prints were useful at the very early development stage but no longer carry their weight: they fire on every cgroup init/exit/move, throttle/consume, replenish tick, and BTQ pop, flooding trace_pipe and slowing the hot paths under any non- trivial workload. cbw_err / cbw_warn still cover the actionable cases. Signed-off-by: Changwoo Min <changwoo@igalia.com>
schedule_atq_destroy() defers BTQ destruction through a lock-free circular ring of CBW_DEFERRED_BTQ_SIZE slots. When tail wraps around to an occupied slot, the incumbent BTQ is evicted and destroyed immediately. A use-after-free occurs if a reader still holds a pointer to the evicted BTQ — the window between READ_ONCE(llcx->btq) in cbw_drain_btq_batch() and the arena_spin_lock() inside scx_atq_pop(). The original size of 256 is too small: with heavy cgroup churn (e.g. during scheduler teardown), more than 256 BTQs can be queued before a reader has released its pointer, wrapping the ring and triggering the use-after-free observed as repeated "freeing nonexistent idx" errors. Since there are at most CBW_NR_CGRP_LLC_MAX llcx objects, at most that many BTQs can ever be live at once. Setting the ring size to CBW_NR_CGRP_LLC_MAX * 2 means the ring cannot wrap before the entire BTQ pool has turned over twice. Any reader that snapshots a pointer will find its slot still intact — the window of vulnerability is a few instructions, far shorter than 2 * CBW_NR_CGRP_LLC_MAX concurrent BTQ destructions. Signed-off-by: Changwoo Min <changwoo@igalia.com>
…rees CBW_CGRP_TREE_HEIGHT_MAX bounds the per-CPU tree_levels[] array used by cbw_update_runtime_total_sloppy() to walk the cgroup hierarchy. It was set to 32 with a comment claiming this matched the kernel's CGROUPS_DEPTH_MAX, but the kernel default is actually much larger (cgroup_max_depth = INT_MAX), so the comment was misleading and the cap was tighter than necessary. Two changes: - Bump the cap from 32 to 64 to give more headroom for genuinely deep hierarchies seen in container-on-container setups. - Reject cgroups whose level exceeds the cap at scx_cgroup_bw_init() instead of silently proceeding and indexing tree_levels[] out of bounds. Returning -ENOMEM from init makes the failure explicit and safe. Signed-off-by: Changwoo Min <changwoo@igalia.com>
…scx_bpf_dump
bpf_printk() was hard-coded, which is wrong from an ops.dump*() callback
where output should land in the SCX dump buffer instead of trace_pipe.
Add a mode parameter:
enum scx_cgroup_bw_dump_mode {
SCX_CGROUP_BW_DUMP_PRINTK = 0,
SCX_CGROUP_BW_DUMP_SCX = 1,
};
A cbw_dump_line(mode, fmt, ...) macro dispatches to the chosen helper.
Also normalise a hard-coded cgroup id of 1 to the runtime-detected
ROOT_CGID so namespaced callers find the right root.
scx_lavd uses PRINTK in lavd_dump to avoid flooding the dump buffer
with the full hierarchy, and SCX mode in lavd_dump_task to surface the
offending cgroup state next to the throttled task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The same cache-lookup pattern was open-coded in cbw_cgroup_bw_throttled(), scx_cgroup_bw_consume(), and scx_cgroup_bw_pressure(); the matching invalidation pattern in cbw_drain_atq_to_root(), cbw_free_llc_ctx(), and scx_cgroup_bw_move(). Add three static __always_inline helpers: cbw_taskc_get_cgx_raw(taskc, cgrp_id) cbw_taskc_get_llcx_raw(taskc, cgrp_id, llc_id) cbw_taskc_invalidate(taskc) The getters accept a possibly-NULL taskc and return 0 on miss so each caller keeps its own miss policy. cbw_taskc_invalidate() centralises the __sync_lock_test_and_set workaround for the arena-pointer fields, letting scx_cgroup_bw_move() drop its local `volatile` qualifier. No semantic change. Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup is throttled, tasks already running hold the CPU for their full time slice before the scheduler can recheck the throttle state. This causes task-stall latency that grows with the configured slice length. Add scx_cgroup_bw_pressure() to expose a 1024-scale pressure hint that BPF schedulers can use to shorten time slices proportionally: slice = (base_slice * 1024) / pressure Pressure is computed at each replenishment boundary from two signals that are combined by addition so that both contribute independently: Budget pressure: a hyperbolic curve that rises steeply below 25% of the replenished period_budget. A small budget after replenishment also indicates accumulated debt from prior over-consumption, so high pressure is correct in that case too. Backlog pressure: a linear term proportional to the number of tasks queued in the BTQ across all LLC domains. A growing backlog signals that the reenqueue path cannot drain fast enough; shorter slices reduce the time any single task monopolises the CPU. The combined pressure is clamped to [1024, 16384], limiting the maximum reduction to 1/16 of the base slice. scx_lavd adopts the new API in calc_time_slice(): pressure is fetched once per scheduling decision, slice boost is suppressed under any throttle pressure, and the final slice is scaled by the pressure before being assigned to the task. Signed-off-by: Changwoo Min <changwoo@igalia.com>
… delay
A throttled task's position in the BTQ is determined by its scheduler
vtime. If other tasks are continuously enqueued with smaller vtimes,
a task with a large vtime can be delayed arbitrarily long even though
it has been waiting in the queue.
Fix this by blending the wall-clock time into the BTQ key:
btq_vtime = (scx_bpf_now() & CBW_BTQ_VTIME_UPPER_MASK) |
(vtime & CBW_BTQ_VTIME_LOWER_MASK)
The upper 32 bits come from the current nanosecond timestamp; the lower
32 bits come from the scheduler-provided vtime. The 64-bit key is split
evenly so each side contributes 32 bits. Tasks enqueued within the same
~4-second window (2^32 ns ~= 4.29 s) still compete by their scheduler
vtime, preserving relative fairness. Once a new wall-clock epoch
begins, earlier-queued tasks take priority regardless of their vtime,
guaranteeing that no task waits more than ~4 seconds in the BTQ due to
vtime ordering alone.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
…tates
scx_cgroup_bw_throttled() already bypasses throttling for PF_EXITING
tasks because the BTQ drain's bpf_task_from_pid() returns NULL once the
kernel unhashes an exiting task, losing the task from all runqueues.
Real-world workloads with frequent SIGSTOP/SIGCONT cycles exhibit a
related stall: a task woken specifically so it can observe a pending
group stop is parked in the BTQ, the throttle window elapses, and the
user-visible SIGSTOP appears delayed by seconds. Cgroup-v2 freeze and
ptrace traps share the same shape -- the kernel-side operation cannot
converge until the scheduler lets the task run briefly.
Extend the bypass to cover both flavours:
Correctness -- task is leaving SCX before drain can find it:
PF_EXITING (already handled)
SIGNAL_GROUP_EXIT SIGKILL / exit_group() propagating; narrow
window where the group flag is set but
PF_EXITING has not yet landed on a sibling.
Latency -- task wants a short kernel-mediated transition:
JOBCTL_STOP_PENDING group SIGSTOP delivery
JOBCTL_TRAP_STOP ptrace stop trap
JOBCTL_TRAP_NOTIFY ptrace notify trap (seccomp, PTRACE_EVENT_*)
JOBCTL_TRAP_FREEZE cgroup-v2 freezer trap
JOBCTL_PENDING_MASK already groups STOP_PENDING with the trap bits;
TRAP_FREEZE is outside the mask and gets its own bit-test. Quota
impact is negligible: tasks in any of these states consume essentially
no CPU before leaving SCX.
Each branch is marked unlikely() since the steady state is "throttle
normally", and READ_ONCE() is used for p->jobctl and p->signal->flags
because those are written under siglock on a different CPU.
vmlinux.h carries types but not CPP macros, so the SIGNAL_GROUP_EXIT
and JOBCTL_* bit definitions are mirrored from the kernel headers
near the top of cgroup_bw.bpf.c.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
arena_spin_lock() -ETIMEDOUT means a bounded spin loop in the slow path gave up, leaving the MCS chain with stale ->next links. The running scheduler must tear down (retrying races against an inconsistent queue). Use scx_bpf_exit(SCX_ECODE_ACT_RESTART, ...) instead of scx_bpf_error() so user-space orchestration can respawn the scheduler automatically rather than treating it as a bug. Signed-off-by: Changwoo Min <changwoo@igalia.com>
a84c653 to
5daa5f8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series caps the wall-clock latency a task can experience when its
cgroup runs out of cpu.max bandwidth. It was driven by stalls
observed while stress-testing scx_lavd under heavy cpu.max throttling
with concurrent cgroup churn, SIGSTOP/SIGCONT cycles, and deep cgroup
hierarchies. The first six commits are diagnostics, robustness, and
refactoring prep work; the last three carry the latency mitigation.
Bounding task-stall latency
Tasks already running when a cgroup throttles hold the CPU for the
full base time slice, blocking the scheduler from rechecking the
throttle for that long. scx_cgroup_bw_pressure() exposes a 1024-
scale pressure hint that combines a hyperbolic budget term with a
linear BTQ-backlog term; scx_lavd consumes it in calc_time_slice()
to scale slices down by up to 16x under heavy pressure.
A throttled task's BTQ position is its scheduler vtime alone, so a
task with a large vtime can starve indefinitely behind a steady
stream of smaller-vtime arrivals. Blend the upper 32 bits of
wall-clock nanoseconds with the lower 32 bits of vtime so any task
waits at most ~4.29 s before its epoch makes it the head of the
queue.
The kernel sometimes wakes a task specifically so it can observe a
pending kernel-mediated transition (group SIGSTOP, ptrace trap,
cgroup-v2 freeze, group exit). Throttling such a task into the BTQ
delays the user-visible operation by the throttle window -- SIGSTOP
appears delayed for seconds, cgroup.freeze appears to hang. Extend
the existing PF_EXITING bypass in scx_cgroup_bw_throttled() to cover
SIGNAL_GROUP_EXIT and JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE.
Supporting changes
cpu_max_bench.py quantifies cpu.max overhead under varying
depth/quota/load and plots eevdf vs scx_lavd side-by-side. The
verbose cbw_dbg* tracing flooded trace_pipe and slowed every hot
path; cbw_err / cbw_warn still cover the actionable cases.
The deferred-BTQ destruction ring at 256 slots could wrap under
heavy cgroup churn and trigger a UAF; resized to
CBW_NR_CGRP_LLC_MAX * 2. CBW_CGRP_TREE_HEIGHT_MAX was 32 with a
misleading "matches the kernel" comment (kernel default is INT_MAX);
bumped to 64 and scx_cgroup_bw_init() now rejects deeper trees with
-ENOMEM instead of indexing tree_levels[] out of bounds.
scx_cgroup_bw_dump() was hard-coded to bpf_printk(); add a mode
parameter so ops.dump*() callbacks can route output to the SCX dump
buffer. Factor the open-coded taskc-cached cgx/llcx lookup pattern
into helpers so the new pressure API picks up the caching for free.
arena_spin_lock() -ETIMEDOUT leaves the MCS chain with stale ->next
links; a running scheduler must tear down rather than retry against
an inconsistent queue, but a fresh load reinitialises the MCS state.
Surface the failure as scx_bpf_exit(SCX_ECODE_ACT_RESTART, ...) so
user-space orchestration can respawn the scheduler automatically.
Signed-off-by: Changwoo Min changwoo@igalia.com