Symptom
docker exec scitex-cloud-prod-django-1 ps aux shows 95 [srun] <defunct> zombie processes, all with PPID 1 (daphne).
95 1 ← ppid 1 = daphne, zombies accumulate since Apr15
Root cause
Code that spawns srun via subprocess.Popen / subprocess.run is not awaiting/reaping the children. Daphne (asgi) runs as PID 1 and doesn't install a SIGCHLD handler / subprocess-reaper, so child exits stay as zombies.
Impact
Fix
- Find the srun invocation site:
grep -rn \"srun\" src/ apps/
- Replace
subprocess.Popen('srun ...') fire-and-forget with either:
asyncio.create_subprocess_exec(...) + await proc.wait() in async views
subprocess.run(..., check=False) in sync paths
- Alternatively run daphne under
tini (docker run --init) so PID 1 reaps orphans: add init: true in the django service of docker-compose.prod.yml.
Acceptance
After 24 h of prod traffic: ps -eo stat | grep -c ^Z in django container is 0.
Symptom
docker exec scitex-cloud-prod-django-1 ps auxshows 95[srun] <defunct>zombie processes, all with PPID 1 (daphne).Root cause
Code that spawns
srunviasubprocess.Popen/subprocess.runis not awaiting/reaping the children. Daphne (asgi) runs as PID 1 and doesn't install a SIGCHLD handler /subprocess-reaper, so child exits stay as zombies.Impact
Fix
subprocess.Popen('srun ...')fire-and-forget with either:asyncio.create_subprocess_exec(...)+await proc.wait()in async viewssubprocess.run(..., check=False)in sync pathstini(docker run --init) so PID 1 reaps orphans: addinit: truein the django service ofdocker-compose.prod.yml.Acceptance
After 24 h of prod traffic:
ps -eo stat | grep -c ^Zin django container is 0.