Only update KV prefix cache on a good cache hit by rltakashige · Pull Request #1817 · exo-explore/exo

rltakashige · 2026-03-30T11:10:29Z

Motivation

Addresses #1816

Changes

Update on min prefix cache > min_prefix_hit_length and hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly.

Test Plan

Manual Testing

Test on OpenCode and Claude Code

Evanev7

requires testing but code looks good!

Addresses exo-explore#1816 Update on min prefix cache > min_prefix_hit_length **and** hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly. Test on OpenCode and Claude Code

* send error finish reason on failing to parse a tool call (exo-explore#1785) a simplification of exo-explore#1757 which is now stale * Add SSE-keepalive to not time out on long prefill on clients (exo-explore#1803) ## Motivation   ## Changes  ## Why It Works  ## Test Plan ### Manual Testing    ### Automated Testing   * fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (exo-explore#1769) DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's inference engine (architecture whitelisted in `model_cards.py`, DSML encoding added in exo-explore#1548), but **doesn't work out of the box** due to two bugs: `warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)` instead of using it. Since `_needs_dsml_encoding()` checks `"deepseek-v3.2" in task_params.model.lower()`, the empty string never matches → falls back to `tokenizer.apply_chat_template()` → **ValueError** because V3.2 has no Jinja chat template. **Fix:** `model=ModelId("")` → `model=model_id` (one line). `_needs_dsml_encoding()` returns `True` only when `task_params.tools` is present or tool messages exist in `chat_template_messages`. For warmup and regular chat requests without tools → `return False` → Jinja fallback → **ValueError**. Unlike V3.1 (which has a `.jinja` chat template file that transformers picks up automatically), V3.2 **has no Jinja template at all** — it uses Python-based DSML encoding for all message types. **Fix:** For V3.2, always return `True` — DSML encoding handles all message types. Added inference model cards for: - `mlx-community/DeepSeek-V3.2-8bit` - `mlx-community/DeepSeek-V3.2-4bit` Parameters taken from model `config.json` on HuggingFace, storage sizes from HF API. Capabilities include `thinking_toggle` (related: exo-explore#1456). - The model ID string matching approach (`"deepseek-v3.2" in model.lower()`) is acknowledged tech debt — see exo-explore#1371 for the planned architecture-based approach. - [x] Start exo with DeepSeek V3.2 model → warmup should complete without crash - [x] Send a regular chat message (no tools) → should get a response - [x] Send a chat message with tools → should work as before - [x] V3.2 cards should appear in the dashboard model catalog --------- Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com> * Fix Nemotron cache leak upstream (exo-explore#1819) Nemotron Cascade and Nano failing at long decodes. Fixed upstream, just change pyproject and uv lock here. Tested with a reproduce script upstream * Improve batch performance and stats reporting (exo-explore#1777) Batch generation reports incorrect statistics, as mlx lm never clears the original stats, meaning they get polluted over time. The dashboard also seems considerably slower than bench statistics. We also have a large discrepancy between B=1 batch generating and mlx_generate. Extracting logprobs is massively expensive, causing up to a 25% slowdown compared to pure batching. ``` [ 12:02:01.1240AM | INFO ] step overhead: 3.49ms (next=12.49ms total=15.99ms) [ 12:02:02.1600AM | INFO ] step overhead: 3.23ms (next=13.01ms total=16.24ms) [ 12:02:03.2228AM | INFO ] step overhead: 3.28ms (next=13.38ms total=16.66ms) [ 12:02:04.2798AM | INFO ] step overhead: 3.25ms (next=12.84ms total=16.10ms) [ 12:02:05.3152AM | INFO ] step overhead: 3.18ms (next=12.61ms total=15.79ms) [ 12:02:06.3522AM | INFO ] step overhead: 3.41ms (next=12.83ms total=16.25ms) [ 12:02:07.3987AM | INFO ] step overhead: 3.38ms (next=13.14ms total=16.52ms) [ 12:02:08.4537AM | INFO ] step overhead: 1.84ms (next=19.44ms total=21.28ms) ``` 1. Report stats ourselves instead of using mlx lm's stats for batch generation (they use perf_counter anyway). 2. Adjust exo bench to match 3. Improve logprobs extraction speed by 10x, improving tps for dashboard & any requests for logprobs 4. Use an SSE comment to align the speed to the real numbers at the end of generation 5. Patch mlx for several optimizations given our assumptions and use cases (e.g. use vllm style RoPE). 6. Switch MLX LM version to latest main, including support for Nemotron Super and some Qwen3.5 fixes. 1. Exo bench no longer reports polluted stats 2. Exo bench now handles the reported per-request stats rather than the aggregate stats 3. The decode speed now jumps back to a real number at the end of the generation 4. Large batch speedup for rotating KV cache models + 1:1 matching cache with vllm Needs testing on OpenCode and CC Needs eval testing Only going to show the performance optimization difference after the accurate reporting: **GPT OSS 20B MXFP4 Q8 (large change)** Before: <img width="2466" height="1534" alt="image" src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e" /> <img width="2410" height="1240" alt="image" src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68" /> After: <img width="2476" height="1472" alt="image" src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2" /> <img width="2454" height="1236" alt="image" src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a" /> **Qwen 3.5 35B A3B 8bit (No change)** Before: <img width="2414" height="1396" alt="image" src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3" /> After: <img width="2346" height="1234" alt="image" src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6" /> **Llama 3.2 1B Instruct 4bit (small change)** Before: <img width="2516" height="1220" alt="image" src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80" /> After: <img width="2566" height="1370" alt="image" src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543" /> * Only update KV prefix cache on a good cache hit (exo-explore#1817) Addresses exo-explore#1816 Update on min prefix cache > min_prefix_hit_length **and** hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly. Test on OpenCode and Claude Code * Prefer higher model download % for placement (exo-explore#1767) ## Motivation When placing a model instance across the cluster, the master previously only considered available RAM. This meant it could pick a node that hasn't downloaded the model yet, even when another node already has it (or is further along in downloading it). ## Changes - Added download_status parameter to place_instance() in placement.py - Added _get_node_download_fraction() to compute 0.0–1.0 download progress per node/model - Added _cycle_download_score() to sum download fractions across a cycle's nodes - Cycle selection now uses a (download_score, available_ram) tuple key — download progress is the primary sort, RAM is the tiebreaker - Passed self.state.downloads into place_instance() from master/main.py ## Why It Works Python's tuple comparison gives download progress strict priority over RAM, so a node with the model already downloaded will always be preferred over one with more free RAM but no download. ## Test Plan ### Automated Testing 3 new tests cover: completed download preferred, higher partial progress preferred, failed download not preferred over no-download node * Prefer higher % downloaded nodes for API placement previews (exo-explore#1795) Follow up to exo-explore#1767 Same thing for placement previews through API * docs: add module and function docstrings to cherry-picked patches New mlx-lm patch modules from upstream lacked documentation for docs generation. Adds module-level and public function docstrings to all four patch files explaining their purpose and behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: verify model exists on disk before re-emitting DownloadCompleted When an embedding model instance was deleted and re-run, the download coordinator still had a cached DownloadCompleted status and would re-emit it without checking if the model directory still existed on disk. This caused "Model not found on disk" errors requiring a manual node cache purge. Now _start_download validates that the model directory and config.json still exist before re-emitting DownloadCompleted. If the directory is gone, the stale cache entry is cleared and a fresh download begins. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: typos in comments and log messages - "compatability" -> "compatibility" in YarnRoPE docstring - "Fakse" -> "False" in utils_mlx comment - "succesfully" -> "successfully" in runner_supervisor logs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Evan Quiney <evanev7@gmail.com> Co-authored-by: rltakashige <rl.takashige@gmail.com> Co-authored-by: vskiwi <141816715+vskiwi@users.noreply.github.com> Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: ciaranbor <81697641+ciaranbor@users.noreply.github.com> Co-authored-by: Thomas Tupper <kite3@kite3.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rltakashige force-pushed the leo/fix-prefix-cache-updates branch from dd89ace to dbed914 Compare March 30, 2026 11:18

Only update KV prefix cache on a good cache hit

a6f3ffe

rltakashige force-pushed the leo/fix-prefix-cache-updates branch from dbed914 to a6f3ffe Compare March 30, 2026 13:29

Evanev7 approved these changes Mar 30, 2026

View reviewed changes

Evanev7 merged commit c6815bf into main Mar 30, 2026
6 checks passed

Evanev7 deleted the leo/fix-prefix-cache-updates branch March 30, 2026 14:04

Evanev7 mentioned this pull request Mar 30, 2026

[BUG] KVPrefixCache overwrites unrelated conversation cache slots due to or logic in save condition #1816

Closed

ttupper92618 mentioned this pull request Apr 1, 2026

feat: cherry-pick upstream stability fixes Foxlight-Foundation/Skulk#62

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only update KV prefix cache on a good cache hit#1817

Only update KV prefix cache on a good cache hit#1817
Evanev7 merged 1 commit into
mainfrom
leo/fix-prefix-cache-updates

rltakashige commented Mar 30, 2026 •

edited

Loading

Uh oh!

Evanev7 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rltakashige commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Test Plan

Manual Testing

Uh oh!

Evanev7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rltakashige commented Mar 30, 2026 •

edited

Loading