Skip to content

update vllm dockerfile for ray#1392

Open
wedu-nvidia wants to merge 3 commits intomainfrom
wedu/update-ray
Open

update vllm dockerfile for ray#1392
wedu-nvidia wants to merge 3 commits intomainfrom
wedu/update-ray

Conversation

@wedu-nvidia
Copy link
Copy Markdown
Collaborator

@wedu-nvidia wedu-nvidia commented Apr 21, 2026

Summary by CodeRabbit

  • New Features

    • Multi-node deployments now automatically enable the distributed execution backend for improved scalability.
  • Chores

    • Updated container base image to a newer release.
    • Added performance-related packages to the container build to improve runtime behavior.

Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia requested a review from Kipok April 21, 2026 22:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: dc97fd16-7797-4c91-b08a-19f0ebc9c8f0

📥 Commits

Reviewing files that changed from the base of the PR and between 2a64753 and a3b8eed.

📒 Files selected for processing (1)
  • nemo_skills/inference/server/serve_vllm.py

📝 Walkthrough

Walkthrough

Upgraded the vLLM Docker base image and added a new Python dependency; adjusted the vLLM server launcher to inject a Ray distributed-executor flag when running with multiple nodes unless an explicit backend flag is already provided.

Changes

Cohort / File(s) Summary
Docker Base Image & Dependencies
dockerfiles/Dockerfile.vllm
Bumped base image from vllm/vllm-openai:v0.14.1 to vllm/vllm-openai:v0.18.1 and added pip install "ray[cgraph]" alongside existing installs.
vLLM Serve CLI Behavior
nemo_skills/inference/server/serve_vllm.py
When args.num_nodes > 1, detect if --distributed-executor-backend was supplied (exact or = form); if not, inject --distributed-executor-backend=ray into the constructed vLLM api_server command.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'update vllm dockerfile for ray' accurately describes the main change: upgrading vLLM Docker image and adding ray support. It is concise and specific.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wedu/update-ray

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Wei Du <wedu@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dockerfiles/Dockerfile.vllm (1)

2-2: Pin ray[cgraph] to a specific version to avoid dependency drift.

Line 2 installs Ray without a version pin, making rebuilds non-reproducible. Ray's transitive dependencies (particularly aiohttp) can vary across releases, introducing unexpected conflicts.

♻️ Proposed change
+ARG RAY_VERSION=2.55.0
-RUN pip install "ray[cgraph]"
+RUN pip install --no-cache-dir "ray[cgraph]==${RAY_VERSION}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dockerfiles/Dockerfile.vllm` at line 2, The Dockerfile.vllm currently runs
pip install "ray[cgraph]" without a version, which leads to non-reproducible
builds; update the RUN pip install step that references "ray[cgraph]" to pin Ray
to the project's tested version (e.g., add an explicit version specifier or use
your constraints file) so transitive deps like aiohttp remain stable, and ensure
you update any corresponding documentation/constraints (requirements.txt or
constraints.txt) and CI images to match the pinned version.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dockerfiles/Dockerfile.vllm`:
- Around line 1-5: The Dockerfile currently runs as root (base image FROM
vllm/vllm-openai:v0.18.1) and lacks a USER directive; add steps to create a
non-root user (e.g., create a group and user, set a home or workdir, chown any
needed directories after installing packages), switch to that user with a USER
<username> instruction before finishing the image, and ensure any runtime
writable paths (WORKDIR or volumes) are owned by that user so the container does
not default to root at runtime.

---

Nitpick comments:
In `@dockerfiles/Dockerfile.vllm`:
- Line 2: The Dockerfile.vllm currently runs pip install "ray[cgraph]" without a
version, which leads to non-reproducible builds; update the RUN pip install step
that references "ray[cgraph]" to pin Ray to the project's tested version (e.g.,
add an explicit version specifier or use your constraints file) so transitive
deps like aiohttp remain stable, and ensure you update any corresponding
documentation/constraints (requirements.txt or constraints.txt) and CI images to
match the pinned version.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9abc8c5a-b5b4-43a3-bfa3-9ca13996b899

📥 Commits

Reviewing files that changed from the base of the PR and between 94478e5 and 2a64753.

📒 Files selected for processing (1)
  • dockerfiles/Dockerfile.vllm

Comment on lines +1 to 5
FROM vllm/vllm-openai:v0.18.1
RUN pip install "ray[cgraph]"
RUN pip install "vllm[audio]"
# Required by vLLM for Qwen-VL model family (runtime dependency, not directly imported)
RUN pip install qwen-vl-utils
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run the container as a non-root user.

No explicit USER is set, so runtime may default to root. Please set a non-root user to reduce container breakout impact.

🔒 Example hardening
 FROM vllm/vllm-openai:v0.18.1
 RUN pip install "ray[cgraph]"
 RUN pip install "vllm[audio]"
 # Required by vLLM for Qwen-VL model family (runtime dependency, not directly imported)
 RUN pip install qwen-vl-utils
+RUN useradd --create-home --uid 10001 appuser
+USER appuser
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FROM vllm/vllm-openai:v0.18.1
RUN pip install "ray[cgraph]"
RUN pip install "vllm[audio]"
# Required by vLLM for Qwen-VL model family (runtime dependency, not directly imported)
RUN pip install qwen-vl-utils
FROM vllm/vllm-openai:v0.18.1
RUN pip install "ray[cgraph]"
RUN pip install "vllm[audio]"
# Required by vLLM for Qwen-VL model family (runtime dependency, not directly imported)
RUN pip install qwen-vl-utils
RUN useradd --create-home --uid 10001 appuser
USER appuser
🧰 Tools
🪛 Trivy (0.69.3)

[error] 1-1: Image user should not be 'root'

Specify at least 1 USER command in Dockerfile with non-root user as argument

Rule: DS-0002

Learn more

(IaC/Dockerfile)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dockerfiles/Dockerfile.vllm` around lines 1 - 5, The Dockerfile currently
runs as root (base image FROM vllm/vllm-openai:v0.18.1) and lacks a USER
directive; add steps to create a non-root user (e.g., create a group and user,
set a home or workdir, chown any needed directories after installing packages),
switch to that user with a USER <username> instruction before finishing the
image, and ensure any runtime writable paths (WORKDIR or volumes) are owned by
that user so the container does not default to root at runtime.

Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run gpu and slurm tests on this?

args, unknown = parser.parse_known_args()

extra_arguments = join(unknown)
has_distributed_backend = any(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need this change I think - the user args will always override the default anyway. So we cna just check for multinode and if that's set we add the argument always

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can get rid of this

@Kipok
Copy link
Copy Markdown
Collaborator

Kipok commented Apr 21, 2026

we could combine with upgrade to sglang and trtllm as well so that we do them all in one batch of tests

@wedu-nvidia
Copy link
Copy Markdown
Collaborator Author

we could combine with upgrade to sglang and trtllm as well so that we do them all in one batch of tests

Why I did not see sglang and trtllm docker files?

you suggest to revise?

_containers = {
"trtllm": "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc1",
"sglang": "lmsysorg/sglang:v0.5.8",
}

@Kipok
Copy link
Copy Markdown
Collaborator

Kipok commented Apr 21, 2026

yes, as well as all configs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants