Skip to content

bug(slurm): 95 zombie [srun] processes leaking under daphne PID 1 #148

@ywatanabe1989

Description

@ywatanabe1989

Symptom

docker exec scitex-cloud-prod-django-1 ps aux shows 95 [srun] <defunct> zombie processes, all with PPID 1 (daphne).

     95 1   ← ppid 1 = daphne, zombies accumulate since Apr15

Root cause

Code that spawns srun via subprocess.Popen / subprocess.run is not awaiting/reaping the children. Daphne (asgi) runs as PID 1 and doesn't install a SIGCHLD handler / subprocess-reaper, so child exits stay as zombies.

Impact

Fix

  1. Find the srun invocation site:
    grep -rn \"srun\" src/ apps/
  2. Replace subprocess.Popen('srun ...') fire-and-forget with either:
    • asyncio.create_subprocess_exec(...) + await proc.wait() in async views
    • subprocess.run(..., check=False) in sync paths
  3. Alternatively run daphne under tini (docker run --init) so PID 1 reaps orphans: add init: true in the django service of docker-compose.prod.yml.

Acceptance

After 24 h of prod traffic: ps -eo stat | grep -c ^Z in django container is 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    scitex-cloudTouches scitex-cloud / scitex.ai

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions