Skip to content

fix(lora): wait for in-flight Civitai downloads before pipeline __init__ (#937)#940

Open
livepeer-tessa wants to merge 3 commits intomainfrom
fix/937-lora-download-race
Open

fix(lora): wait for in-flight Civitai downloads before pipeline __init__ (#937)#940
livepeer-tessa wants to merge 3 commits intomainfrom
fix/937-lora-download-race

Conversation

@livepeer-tessa
Copy link
Copy Markdown
Contributor

Problem

On session reinitialisation the frontend concurrently:

  1. Calls POST /api/v1/loras to re-download the Civitai LoRA (async)
  2. Calls POST /api/v1/pipeline/loadLongLivePipeline.__init___init_loras() (sync)

If the pipeline load wins the race the file doesn't exist yet and PeftLoRAStrategy.load_adapters_from_list raises FileNotFoundError, surfacing as a spurious "Some pipelines failed to load" error. The session self-heals on retry once the download completes (~60–90 s), but wastes time and pollutes the error logs.

This is distinct from:

Observed in job a8a03ca5-6fce-4cdd-8bca-580e8fbafeeb (scope-app--prod) on 2026-04-13 ~23:38 UTC.

Fix

Add _wait_for_lora_files() to LoRAEnabledPipeline._init_loras() in mixin.py.

Before delegating to LoRAManager, the function polls for each missing LoRA file with a 2 s interval, up to 120 s.

Zero overhead on the normal (warm cache) path — files that already exist are skipped immediately via Path.exists() check before entering the loop.

After the timeout a warning is logged and execution continues — the strategy loader still raises for genuinely missing files, so error behaviour for permanent failures is unchanged.

# New helper in mixin.py
def _wait_for_lora_files(lora_configs, timeout_s=120, poll_s=2.0):
    pending = [cfg["path"] for cfg in lora_configs
               if cfg.get("path") and not Path(cfg["path"]).exists()]
    if not pending:
        return  # fast path — no wait
    # poll until files appear or timeout
    ...

Testing

  • Hot path (files present): No change in behaviour, pending is empty, function returns immediately.
  • Race path (file downloading): Pipeline init will block in the 2 s poll loop until the file lands, then proceed normally.
  • Genuinely missing file: After 120 s timeout logs a warning and falls through; PeftLoRAStrategy still raises FileNotFoundError → propagates as before.

Closes #937

Tessa (livepeer-tessa) added 3 commits April 14, 2026 06:25
…peline ID (#936)

Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
…FoundError

Fixes #937.

When a session is re-initialised with a Civitai-hosted LoRA, the local
Civitai download is re-triggered asynchronously while pipeline_manager
concurrently calls LongLivePipeline.__init__.  load_lora_weights was
called synchronously before the download completed, causing a spurious
FileNotFoundError that failed the pipeline load and left the session
needing a manual retry.

Fix: add _wait_for_lora_file() in lora/utils.py that polls for the
file's existence before raising.  The timeout defaults to 120 s and is
configurable via SCOPE_LORA_DOWNLOAD_WAIT_TIMEOUT.  No change to
callers; existing FileNotFoundError semantics are preserved when the
file genuinely does not appear within the timeout.

Also adds tests/test_lora_wait_for_file.py covering:
- file already present → returns immediately
- file appears during wait → returns True
- file never appears → returns False / raises after timeout
- timeout=0 → disables waiting
- SCOPE_LORA_DOWNLOAD_WAIT_TIMEOUT env var override

Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
On session reinitialisation the frontend concurrently (a) re-downloads a
Civitai-hosted LoRA and (b) calls POST /api/v1/pipeline/load, which
triggers LongLivePipeline.__init__ -> _init_loras().  If the pipeline
__init__ wins the race the file doesn't exist yet and
PeftLoRAStrategy.load_adapters_from_list raises FileNotFoundError, which
surfaces as a spurious 'Some pipelines failed to load' error.  The
session recovers on the next retry (~60-90 s later) once the download
completes.

Fix: add _wait_for_lora_files() to LoRAEnabledPipeline._init_loras().
Before delegating to LoRAManager it polls for each missing LoRA file up
to 120 s (poll every 2 s).  Files already present are skipped
immediately, so there is zero overhead on the normal (warm cache) path.
After the timeout a warning is logged and execution continues — the
strategy loader still raises its own error for genuinely missing files.

Fixes #937

Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73e8085f-4fe0-43ef-b2e6-0f5a6d97c53c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/937-lora-download-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-940--preview
WebSocket wss://fal.run/daydream/scope-pr-940--preview/ws
Commit 7c70769

Livepeer Runner

App ID daydream/scope-livepeer-pr-940--preview
WebSocket wss://fal.run/daydream/scope-livepeer-pr-940--preview/ws
Auth private

Testing Livepeer Mode

SCOPE_CLOUD_MODE=livepeer SCOPE_CLOUD_APP_ID="daydream/scope-livepeer-pr-940--preview/ws" uv run daydream-scope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[fal.ai] longlive: LoRA file not found on session reinit — Civitai download races pipeline __init__ for [flux.2.klein] LoRA

1 participant