Only update KV prefix cache on a good cache hit#1817
Merged
Conversation
dd89ace to
dbed914
Compare
dbed914 to
a6f3ffe
Compare
Evanev7
approved these changes
Mar 30, 2026
Member
Evanev7
left a comment
There was a problem hiding this comment.
requires testing but code looks good!
ttupper92618
pushed a commit
to Foxlight-Foundation/Skulk
that referenced
this pull request
Apr 1, 2026
Addresses exo-explore#1816 Update on min prefix cache > min_prefix_hit_length **and** hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly. Test on OpenCode and Claude Code
7 tasks
ttupper92618
added a commit
to Foxlight-Foundation/Skulk
that referenced
this pull request
Apr 1, 2026
* send error finish reason on failing to parse a tool call (exo-explore#1785) a simplification of exo-explore#1757 which is now stale * Add SSE-keepalive to not time out on long prefill on clients (exo-explore#1803) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> * fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (exo-explore#1769) DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's inference engine (architecture whitelisted in `model_cards.py`, DSML encoding added in exo-explore#1548), but **doesn't work out of the box** due to two bugs: `warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)` instead of using it. Since `_needs_dsml_encoding()` checks `"deepseek-v3.2" in task_params.model.lower()`, the empty string never matches → falls back to `tokenizer.apply_chat_template()` → **ValueError** because V3.2 has no Jinja chat template. **Fix:** `model=ModelId("")` → `model=model_id` (one line). `_needs_dsml_encoding()` returns `True` only when `task_params.tools` is present or tool messages exist in `chat_template_messages`. For warmup and regular chat requests without tools → `return False` → Jinja fallback → **ValueError**. Unlike V3.1 (which has a `.jinja` chat template file that transformers picks up automatically), V3.2 **has no Jinja template at all** — it uses Python-based DSML encoding for all message types. **Fix:** For V3.2, always return `True` — DSML encoding handles all message types. Added inference model cards for: - `mlx-community/DeepSeek-V3.2-8bit` - `mlx-community/DeepSeek-V3.2-4bit` Parameters taken from model `config.json` on HuggingFace, storage sizes from HF API. Capabilities include `thinking_toggle` (related: exo-explore#1456). - The model ID string matching approach (`"deepseek-v3.2" in model.lower()`) is acknowledged tech debt — see exo-explore#1371 for the planned architecture-based approach. - [x] Start exo with DeepSeek V3.2 model → warmup should complete without crash - [x] Send a regular chat message (no tools) → should get a response - [x] Send a chat message with tools → should work as before - [x] V3.2 cards should appear in the dashboard model catalog --------- Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com> * Fix Nemotron cache leak upstream (exo-explore#1819) Nemotron Cascade and Nano failing at long decodes. Fixed upstream, just change pyproject and uv lock here. Tested with a reproduce script upstream * Improve batch performance and stats reporting (exo-explore#1777) Batch generation reports incorrect statistics, as mlx lm never clears the original stats, meaning they get polluted over time. The dashboard also seems considerably slower than bench statistics. We also have a large discrepancy between B=1 batch generating and mlx_generate. Extracting logprobs is massively expensive, causing up to a 25% slowdown compared to pure batching. ``` [ 12:02:01.1240AM | INFO ] step overhead: 3.49ms (next=12.49ms total=15.99ms) [ 12:02:02.1600AM | INFO ] step overhead: 3.23ms (next=13.01ms total=16.24ms) [ 12:02:03.2228AM | INFO ] step overhead: 3.28ms (next=13.38ms total=16.66ms) [ 12:02:04.2798AM | INFO ] step overhead: 3.25ms (next=12.84ms total=16.10ms) [ 12:02:05.3152AM | INFO ] step overhead: 3.18ms (next=12.61ms total=15.79ms) [ 12:02:06.3522AM | INFO ] step overhead: 3.41ms (next=12.83ms total=16.25ms) [ 12:02:07.3987AM | INFO ] step overhead: 3.38ms (next=13.14ms total=16.52ms) [ 12:02:08.4537AM | INFO ] step overhead: 1.84ms (next=19.44ms total=21.28ms) ``` 1. Report stats ourselves instead of using mlx lm's stats for batch generation (they use perf_counter anyway). 2. Adjust exo bench to match 3. Improve logprobs extraction speed by 10x, improving tps for dashboard & any requests for logprobs 4. Use an SSE comment to align the speed to the real numbers at the end of generation 5. Patch mlx for several optimizations given our assumptions and use cases (e.g. use vllm style RoPE). 6. Switch MLX LM version to latest main, including support for Nemotron Super and some Qwen3.5 fixes. 1. Exo bench no longer reports polluted stats 2. Exo bench now handles the reported per-request stats rather than the aggregate stats 3. The decode speed now jumps back to a real number at the end of the generation 4. Large batch speedup for rotating KV cache models + 1:1 matching cache with vllm Needs testing on OpenCode and CC Needs eval testing Only going to show the performance optimization difference after the accurate reporting: **GPT OSS 20B MXFP4 Q8 (large change)** Before: <img width="2466" height="1534" alt="image" src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e" /> <img width="2410" height="1240" alt="image" src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68" /> After: <img width="2476" height="1472" alt="image" src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2" /> <img width="2454" height="1236" alt="image" src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a" /> **Qwen 3.5 35B A3B 8bit (No change)** Before: <img width="2414" height="1396" alt="image" src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3" /> After: <img width="2346" height="1234" alt="image" src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6" /> **Llama 3.2 1B Instruct 4bit (small change)** Before: <img width="2516" height="1220" alt="image" src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80" /> After: <img width="2566" height="1370" alt="image" src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543" /> * Only update KV prefix cache on a good cache hit (exo-explore#1817) Addresses exo-explore#1816 Update on min prefix cache > min_prefix_hit_length **and** hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly. Test on OpenCode and Claude Code * Prefer higher model download % for placement (exo-explore#1767) ## Motivation When placing a model instance across the cluster, the master previously only considered available RAM. This meant it could pick a node that hasn't downloaded the model yet, even when another node already has it (or is further along in downloading it). ## Changes - Added download_status parameter to place_instance() in placement.py - Added _get_node_download_fraction() to compute 0.0–1.0 download progress per node/model - Added _cycle_download_score() to sum download fractions across a cycle's nodes - Cycle selection now uses a (download_score, available_ram) tuple key — download progress is the primary sort, RAM is the tiebreaker - Passed self.state.downloads into place_instance() from master/main.py ## Why It Works Python's tuple comparison gives download progress strict priority over RAM, so a node with the model already downloaded will always be preferred over one with more free RAM but no download. ## Test Plan ### Automated Testing 3 new tests cover: completed download preferred, higher partial progress preferred, failed download not preferred over no-download node * Prefer higher % downloaded nodes for API placement previews (exo-explore#1795) Follow up to exo-explore#1767 Same thing for placement previews through API * docs: add module and function docstrings to cherry-picked patches New mlx-lm patch modules from upstream lacked documentation for docs generation. Adds module-level and public function docstrings to all four patch files explaining their purpose and behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: verify model exists on disk before re-emitting DownloadCompleted When an embedding model instance was deleted and re-run, the download coordinator still had a cached DownloadCompleted status and would re-emit it without checking if the model directory still existed on disk. This caused "Model not found on disk" errors requiring a manual node cache purge. Now _start_download validates that the model directory and config.json still exist before re-emitting DownloadCompleted. If the directory is gone, the stale cache entry is cleared and a fresh download begins. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: typos in comments and log messages - "compatability" -> "compatibility" in YarnRoPE docstring - "Fakse" -> "False" in utils_mlx comment - "succesfully" -> "successfully" in runner_supervisor logs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Evan Quiney <evanev7@gmail.com> Co-authored-by: rltakashige <rl.takashige@gmail.com> Co-authored-by: vskiwi <141816715+vskiwi@users.noreply.github.com> Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: ciaranbor <81697641+ciaranbor@users.noreply.github.com> Co-authored-by: Thomas Tupper <kite3@kite3.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Addresses #1816
Changes
Update on min prefix cache > min_prefix_hit_length and hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly.
Test Plan
Manual Testing
Test on OpenCode and Claude Code