board/raspberrypi: fix Pi 4/Yellow network freeze during cloud backup (reserve 256M CMA)#4652
board/raspberrypi: fix Pi 4/Yellow network freeze during cloud backup (reserve 256M CMA)#4652ocalvo wants to merge 7 commits intohome-assistant:devfrom
Conversation
On HA Yellow (CM4), the default 64 MiB CMA pool is fully consumed at boot by the VideoCore shared-memory driver (vc_sm_cma). Under sustained high-throughput I/O such as cloud backup uploads, the BCM GENET ethernet driver allocates additional DMA ring buffers from CMA; with 0 free pages these allocations fail silently and the interface stalls. The device remains network-dead (CPU alive, watchdog serviced) until a hardware power cycle. gpu_mem cannot be lowered below 32 MiB on Yellow (firmware codecs). Increasing the CMA reservation via cma=256M gives the GPU its memory while leaving headroom for ethernet DMA. Ref: home-assistant#4651
📝 WalkthroughWalkthroughMultiple Raspberry Pi board configurations receive kernel command-line parameter Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Same root cause as Yellow: the BCM2712 VideoCore shared-memory driver consumes the default 64 MiB CMA pool at boot, leaving 0 free pages for BCM GENET DMA and any other CMA user under sustained I/O load. Ref: home-assistant#4651
Same root cause as Yellow: the BCM2711 VideoCore shared-memory driver consumes the default 64 MiB CMA pool at boot, leaving 0 free pages for BCM GENET DMA and any other CMA user under sustained I/O load. Previously rpi4-64 had no cmdline.txt of its own and fell back to the parent raspberrypi/cmdline.txt. This change copies that content verbatim and appends cma=256M, so rpi3-64 (which still falls back to the parent) is unaffected. Ref: home-assistant#4651
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@buildroot-external/board/raspberrypi/cmdline.txt`:
- Line 1: rpi3-64 is unintentionally inheriting "cma=256M" via the
hassos-hook.sh fallback, so add a cmdline.txt for the rpi3-64 board that mirrors
the parent cmdline parameters but omits "cma=256M"; specifically create a
rpi3-64/cmdline.txt containing the same kernel cmdline entries shown in the
parent (e.g., dwc_otg.lpm_enable=0 console=tty0 usb-storage.quirks=... ) but
remove the "cma=256M" token so rpi3-64 does not receive the 256MiB CMA
reservation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5df28450-a234-4ea6-a0a9-5fb85f4b6f3c
📒 Files selected for processing (2)
buildroot-external/board/raspberrypi/cmdline.txtbuildroot-external/board/raspberrypi/rpi3-64/cmdline.txt
✅ Files skipped from review due to trivial changes (1)
- buildroot-external/board/raspberrypi/rpi3-64/cmdline.txt
|
cc @geerlingguy — flagging because your |
Validation on stock rpi4-64 HAOS (BCM2711, bcmgenet,
|
| Stock | cma=256M |
|
|---|---|---|
CmaTotal |
64 MiB | 256 MiB |
CmaFree idle |
68 kB | 256 092 kB |
| Survival | 3:12 — crash | 20:00 — clean |
| Uploads completed | 17 | 153 |
| Data egressed | ≈1.7 GiB | ≈15.3 GiB |
| Recovery | PoE power-cycle required | n/a |
Caveat on root cause
I did not capture a clean dmesg "cma: alloc failed" line for the crash itself — the kernel went away hard enough that systemd-journald's buffer was lost before flush, and HAOS does not configure /sys/fs/pstore by default, so I can't point at a kernel oops line. What I have is circumstantial but strong: stock-config device with all the preconditions present crashes at 3:12 under this workload and recovers only after a cold power cycle; the same device with cma=256M applied runs the identical workload for 20 minutes with CmaFree rock-steady.
For what it's worth, I also verified that ip link set end0 down / up is not on its own a trigger — bcmgenet retains its DMA rings across ndo_stop/ndo_open, so the CMA alloc path doesn't get exercised by a link bounce alone. The RX refill path under sustained TX load is what pulls from CMA and is what this PR's larger pool is protecting.
Harness usage (TL;DR)
mkdir -p /homeassistant/pof_eth_cma
cp uploadtest.sh status.sh start.sh stop.sh /homeassistant/pof_eth_cma/
chmod +x /homeassistant/pof_eth_cma/*.sh
/homeassistant/pof_eth_cma/start.sh 1200 # 20-minute run
/homeassistant/pof_eth_cma/status.sh # peek
/homeassistant/pof_eth_cma/stop.sh # kill early20 minutes is enough to detonate the bug on stock rpi4-64 or to demonstrate the fix holding. Non-destructive by design, but on affected hardware expect a network outage mid-run — plan to power-cycle.
LGTM from a validation standpoint. Happy to run additional configurations (different cma= values, longer durations, ARM64ec) if anyone wants.
sairon
left a comment
There was a problem hiding this comment.
This is not the right approach for several reasons. It tries to mitigate an issue in a network driver which doesn't appear to be directly connected with CMA, as explained in the linked issue. While this change may lead to improvement, it's not a real fix.
Using cma kernel parameter isn't an ideal fix either, as it effectively replaces any settings coming from device trees/overlays. For example, on Pi 4, default config contains the vc4-fkms-v3d overlay, which sets CMA to ~512M. This change will effectively shrink it, and make it impossible to change using the cma-XXX dtparam.
Last not least, I was at first unable to reproduce the bug, until I realized it's only affecting specific hardware setups (which are using DRAM-less NVMe boot drives), so while we could opt in for a better default for Yellow (as I also wrote in the linked issue), changing it across all RPi targets is unnecessary.
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
|
Thanks for the review @sairon — a few points where I think the picture is different than described, based on the hardware in hand. 1. "Only specific hardware setups (DRAM-less NVMe)" Repro matrix on Pi 4 / BCM2711:
The gating condition isn''t DRAM-less NVMe — the lab reproduction in the validation gist is a plain Pi 4 on microSD with no NVMe involved. What appears load-bearing is BCM2711 + 2. " On stock HAOS rpi4-64 (17.2), the parent 64 MiB total, 68 kB free at idle — not 512 MiB. The overlay exposes For anyone who prefers to tune via 3. "Not directly connected with CMA" bcmgenet allocates its TX/RX BD rings via 4. Scope Happy to narrow if preferred — e.g., drop the parent |
|
Follow-up with the full reproduction / validation matrix (should have led with this — thanks @sairon for the nudge):
Key point: the failure reproduces on Pi 4 with and without NVMe. The lab repro in the validation gist is specifically a plain Pi 4 on microSD to rule out NVMe/HMB pressure as the gating condition — and it still detonates at the same 3-minute mark under the same TLS upload workload. NVMe on the production box likely accelerates CMA pressure but is not required to trigger the stall. The post-fix PASS rows are empirical: same hardware, same workload, |
|
There's clearly something wrong in your setup. Like I said before, stock RPi 4 defaults to ~512MiB (actually 508 MiB), so there's no way clean Basically the same applies to Yellow with no NVMe, no stall there either: I am also doubtful of the claims made in the linked issue - there is no direct use of contiguous allocation and my understanding of the code corroborated with AI analysis confirms it:
I don't have the nerves to argue with an AI agent which keeps making up invalid claims. The attempt to link this issue with #4162 is also totally moot - that install was actually using wireless connection and using ethernet supposedly fixed the issues. This PR is using invalid approach - if it's possible to trigger ethernet driver lockups with default RPi device tree, the workaround should at least aim at the device tree itself, not |
|
I am using clean Hass os installs on both platforms, my RPi4 on SD card and RPi4 yellow with nvme On both platforms an install of 17.2 of the os defaults the cma to 64mb, your own run of the script confirms a cma of 64 contradicting your own assertion that cma is set to 512. What am I missing? |
No contradiction, as I wrote, on RPi 4 it was overridden with
On Yellow (the second snippet) it's indeed 64M by default but at the same time I'm not able to trigger the bug without an DRAM-less NVMe (and I don't have it available for testing to confirm it's reproducible at least in the - currently most probable - scenario).
Posting full unprocessed |
|
@sairon quick question — what''s the proper way to get a reproduction image of HAOS? All 3 of my test systems (RPi4 on microSD, RPi4 Yellow with NVMe, CM5) report |
|
I think we're both right and both wrong, and I finally see why we were talking past each other. Where you're right: fresh 2021.10+ installs get Where I'm right: That explains both our observations: your fresh Pi 4 reproduces with 508 MiB CMA and survives; my older installs have 64 MiB and freeze. It also explains why this isn't reported more widely — the affected population is narrow (pre-2021.10 installs that have ridden upgrades forward). Your own #3973 ( Happy to rework this PR along those lines if that's acceptable. |
|
Update with more data — I've extracted the 17.2 release artifacts for each Pi-class variant and parsed every shipped DTB. We were both partially right.
Method: extracted Where that leaves us:
Precedent for the shape of the fix: #3973 added a guarded Happy to rework this along either of these shapes:
Whichever you prefer. |
|
Please stop shoveling LLM output from your AI agent to this PR and re-read the last paragraph of my previous reply. |
Impact
Affects paying Nabu Casa cloud-backup customers on
rpi4-64and HA Yellow (CM4) running stock HAOS: sustained outbound TLS (the nightly cloud backup traffic pattern) exhausts the 64 MiB CMA pool after ~3 minutes,end0goes silently dead, and recovery requires a cold power-cycle. Reproducer and full before/after validated on stock HAOS 17.2 hardware over a 20-minute workload (9× the failure window) — see validation comment.Summary
Append
cma=256Mto the parentraspberrypi/cmdline.txtplus the per-board overrides foryellowandrpi5-64, so the CMA pool on all Broadcom-based HAOS targets has headroom for BCM GENET ethernet DMA ring allocations under sustained high-throughput I/O.Changes:
buildroot-external/board/raspberrypi/cmdline.txtbuildroot-external/board/raspberrypi/yellow/cmdline.txtcma=256Mappendedbuildroot-external/board/raspberrypi/rpi5-64/cmdline.txtcma=256Mappendedrpi3-64has no owncmdline.txtand will therefore inheritcma=256Mfrom the parent viahassos-hook.sh. This is acceptable and intentional:rpi3-64is out of support in HAOS, so no dedicated override is provided to opt it out. (An earlier revision of this PR did add anrpi3-64/cmdline.txtopt-out override and it was reverted in commit5273b8efor this reason.)Fixes the network-dead symptom reported in #4651.
Problem
On HA Yellow (CM4), the default 64 MiB CMA pool is fully consumed at boot by the VideoCore shared-memory driver (
vc_sm_cma):CMA stays at 0 free of 16384 total pages for the full uptime. BCM GENET comes up at 1 Gbps fine, but under sustained high-throughput upload (cloud backup, ~20 min) its attempts to allocate additional DMA ring buffers fail silently and the ethernet controller stalls. CPU stays alive and services the hardware watchdog, so no reboot — the device just becomes unreachable until a PoE power cycle.
The same root cause (Broadcom VideoCore + GENET sharing a 64 MiB CMA pool) affects all BCM2711/BCM2712-based HAOS targets: Yellow, rpi4-64, rpi5-64.
Why
cma=256Mand not lowergpu_memYellow''s
config.txtalready has:gpu_memcan''t be lowered further. Increasing the CMA reservation on the kernel command line is the minimal, safe change:gpu_memstays untouched and GENET DMA gets plenty of headroom. 256 MiB on a 4–8 GB board is ~3–6 % of RAM.On rpi5-64 the pressure is if anything higher (
dtoverlay=vc4-kms-v3d+max_framebuffers=2+ camera/display auto-detect all pull from CMA), so the same reservation applies there.Why not HA Green
HA Green uses a Rockchip SoC (
BOARD_ID=green, SPL boot,ttyS2@1500000), not a Broadcom chip — it doesn''t loadvc_sm_cmaand isn''t affected by this bug.Test plan
Discovery (CM4, production): The bug was hit on a HA Yellow (BCM2711 / CM4) running HAOS 17.2 during a nightly cloud backup. See #4651 for the full dmesg, reproduction steps, and failure chain. The device went network-dead mid-upload and required a PoE power cycle to recover.
Validation (CM5, lab): The fix was validated on a CM5 (Raspberry Pi Compute Module 5, BCM2712) on a Waveshare CM5 carrier board, 8 GB RAM, kernel
6.12.47-haos-raspi, withcma=256Mapplied to/proc/cmdlinevia manual edit matching this PR''s effect.Post-boot state:
CmaTotal = 262144 kB(256 MiB — matchescma=256M)CmaFree ≈ 250512 kB(~244 MiB free) — VideoCore takes only ~12 MiB now that it has roomWorkload: sustained outbound HTTPS POST to
speed.cloudflare.com/__upin a tight loop, streaming 100 MiB random chunks viacurl --data-binary @-. Approximates the cloud-backup traffic pattern from #4651.Results over a 20-minute run (2026-04-16 20:03 → 20:23 UTC):
CmaFreeunique values during run250512 kB— single value, zero drift, zero CMA allocationsend0operstate samplesup, zero dropscma: __cma_alloc: ... alloc failedin dmesgCmaTotal=262144 kB,CmaFree=250512 kB(unchanged from pre-test)Counter-test (reverting
cma=256Mto reproduce the failure on the same CM5) is running overnight; results will be appended to this PR.rpi4-64 and rpi5-64 share the same root cause as Yellow (Broadcom VideoCore + GENET on a shared CMA pool), so the fix is symmetric. No hardware-flashed test on rpi4-64 was performed — the author does not have a spare Yellow/rpi4 for the validation slot.
Related
vc_sm_cmapresent, still reproduces on v17. Different surface symptoms (wlan0 drops, CIFS warnings) but shares the Broadcom VideoCore + CMA exhaustion pattern. Users on HA crashing on Pi 4 after HAOS 16.0 upgrade #4162 may want to test acma=256MHAOS build.Side note: why isn''t
cma=256Mthe default upstream?A fair question, and worth leaving here for any other Pi-based project that hits this pattern. Short answer: historical inertia plus a 1 GB-SKU floor.
The 64 MiB CMA default dates from the Pi 2 era (≈2015). VideoCore IV was modest, most boards had 1 GB RAM, and 64 MiB was enough. Nobody revisited the default when boards got more RAM or when VideoCore got hungrier.
The Pi Foundation still ships a 1 GB Pi 4B.
cma=256Mis 25 % of a 1 GB board — not acceptable as a global default. Upstream has to pick one number for everything from 1 GB Pi 4B to 16 GB Pi 5, and 64 MiB is the lowest common denominator.BCM2711/BCM2712 got much more CMA-hungry than VC4.
gpu_mem=inconfig.txt— not from CMA. Tunable and predictable.vc4-kms-v3d+vc_sm_cma): V3D, KMS framebuffers, camera, HEVC all allocate from CMA at runtime.gpu_mem=no longer helps — VideoCore pulls from the 64 MiB CMA pool regardless of the split. The default never caught up with this transition.The bug is latent on almost every workload. Pi-hole, RetroPie, camera projects, desktop use — none of them sustain 20+ minutes at ~30 MB/s outbound hard enough to force GENET to grow a new DMA ring. Home Assistant''s Nabu Casa cloud-backup is a near-pathological workload for this bug; most Pi users will never trip it, so reports to the Pi Foundation stay rare and the default stays.
HAOS can do better than upstream defaults because it knows its targets. The Pi Foundation ships a single default for every SKU; HAOS ships per-board configs and can opt the higher-memory Broadcom targets into
cma=256Mwithout touching the 1 GB SKU.If you maintain another Pi-based distribution or project that does sustained networking (NAS, IoT gateway, stream ingestion, backup server), you probably want
cma=256Mon BCM2711/BCM2712 targets too. The symptom — ethernet silently stalling minutes into a long upload with no kernel panic and the CPU still alive — is easy to misdiagnose as a cable, switch, or Wi-Fi issue.