Skip to content

Only update KV prefix cache on a good cache hit#1817

Merged
Evanev7 merged 1 commit into
mainfrom
leo/fix-prefix-cache-updates
Mar 30, 2026
Merged

Only update KV prefix cache on a good cache hit#1817
Evanev7 merged 1 commit into
mainfrom
leo/fix-prefix-cache-updates

Conversation

@rltakashige
Copy link
Copy Markdown
Collaborator

@rltakashige rltakashige commented Mar 30, 2026

Motivation

Addresses #1816

Changes

Update on min prefix cache > min_prefix_hit_length and hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly.

Test Plan

Manual Testing

Test on OpenCode and Claude Code

@rltakashige rltakashige force-pushed the leo/fix-prefix-cache-updates branch from dd89ace to dbed914 Compare March 30, 2026 11:18
@rltakashige rltakashige force-pushed the leo/fix-prefix-cache-updates branch from dbed914 to a6f3ffe Compare March 30, 2026 13:29
Copy link
Copy Markdown
Member

@Evanev7 Evanev7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires testing but code looks good!

@Evanev7 Evanev7 merged commit c6815bf into main Mar 30, 2026
6 checks passed
@Evanev7 Evanev7 deleted the leo/fix-prefix-cache-updates branch March 30, 2026 14:04
ttupper92618 pushed a commit to Foxlight-Foundation/Skulk that referenced this pull request Apr 1, 2026
Addresses exo-explore#1816

Update on min prefix cache > min_prefix_hit_length **and** hit ratio >
_MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system
prompts must match exactly.

Test on OpenCode and Claude Code
ttupper92618 added a commit to Foxlight-Foundation/Skulk that referenced this pull request Apr 1, 2026
* send error finish reason on failing to parse a tool call (exo-explore#1785)

a simplification of exo-explore#1757 which is now stale

* Add SSE-keepalive to not time out on long prefill on clients (exo-explore#1803)

## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->

* fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (exo-explore#1769)

DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's
inference engine (architecture whitelisted in `model_cards.py`, DSML
encoding added in exo-explore#1548), but **doesn't work out of the box** due to two
bugs:

`warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a
parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)`
instead of using it. Since `_needs_dsml_encoding()` checks
`"deepseek-v3.2" in task_params.model.lower()`, the empty string never
matches → falls back to `tokenizer.apply_chat_template()` →
**ValueError** because V3.2 has no Jinja chat template.

**Fix:** `model=ModelId("")` → `model=model_id` (one line).

`_needs_dsml_encoding()` returns `True` only when `task_params.tools` is
present or tool messages exist in `chat_template_messages`. For warmup
and regular chat requests without tools → `return False` → Jinja
fallback → **ValueError**.

Unlike V3.1 (which has a `.jinja` chat template file that transformers
picks up automatically), V3.2 **has no Jinja template at all** — it uses
Python-based DSML encoding for all message types.

**Fix:** For V3.2, always return `True` — DSML encoding handles all
message types.

Added inference model cards for:
- `mlx-community/DeepSeek-V3.2-8bit`
- `mlx-community/DeepSeek-V3.2-4bit`

Parameters taken from model `config.json` on HuggingFace, storage sizes
from HF API. Capabilities include `thinking_toggle` (related: exo-explore#1456).

- The model ID string matching approach (`"deepseek-v3.2" in
model.lower()`) is acknowledged tech debt — see exo-explore#1371 for the planned
architecture-based approach.

- [x] Start exo with DeepSeek V3.2 model → warmup should complete
without crash
- [x] Send a regular chat message (no tools) → should get a response
- [x] Send a chat message with tools → should work as before
- [x] V3.2 cards should appear in the dashboard model catalog

---------

Co-authored-by: user <user@m1.note>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>

* Fix Nemotron cache leak upstream (exo-explore#1819)

Nemotron Cascade and Nano failing at long decodes.

Fixed upstream, just change pyproject and uv lock here.

Tested with a reproduce script upstream

* Improve batch performance and stats reporting (exo-explore#1777)

Batch generation reports incorrect statistics, as mlx lm never clears
the original stats, meaning they get polluted over time.
The dashboard also seems considerably slower than bench statistics.
We also have a large discrepancy between B=1 batch generating and
mlx_generate.
Extracting logprobs is massively expensive, causing up to a 25% slowdown
compared to pure batching.
```
[ 12:02:01.1240AM | INFO    ] step overhead: 3.49ms (next=12.49ms total=15.99ms)
[ 12:02:02.1600AM | INFO    ] step overhead: 3.23ms (next=13.01ms total=16.24ms)
[ 12:02:03.2228AM | INFO    ] step overhead: 3.28ms (next=13.38ms total=16.66ms)
[ 12:02:04.2798AM | INFO    ] step overhead: 3.25ms (next=12.84ms total=16.10ms)
[ 12:02:05.3152AM | INFO    ] step overhead: 3.18ms (next=12.61ms total=15.79ms)
[ 12:02:06.3522AM | INFO    ] step overhead: 3.41ms (next=12.83ms total=16.25ms)
[ 12:02:07.3987AM | INFO    ] step overhead: 3.38ms (next=13.14ms total=16.52ms)
[ 12:02:08.4537AM | INFO    ] step overhead: 1.84ms (next=19.44ms total=21.28ms)
```

1. Report stats ourselves instead of using mlx lm's stats for batch
generation (they use perf_counter anyway).
2. Adjust exo bench to match
3. Improve logprobs extraction speed by 10x, improving tps for dashboard
& any requests for logprobs
4. Use an SSE comment to align the speed to the real numbers at the end
of generation
5. Patch mlx for several optimizations given our assumptions and use
cases (e.g. use vllm style RoPE).
6. Switch MLX LM version to latest main, including support for Nemotron
Super and some Qwen3.5 fixes.

1. Exo bench no longer reports polluted stats
2. Exo bench now handles the reported per-request stats rather than the
aggregate stats
3. The decode speed now jumps back to a real number at the end of the
generation
4. Large batch speedup for rotating KV cache models + 1:1 matching cache
with vllm

Needs testing on OpenCode and CC
Needs eval testing

Only going to show the performance optimization difference after the
accurate reporting:

**GPT OSS 20B MXFP4 Q8 (large change)**
Before:
<img width="2466" height="1534" alt="image"
src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e"
/>
<img width="2410" height="1240" alt="image"
src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68"
/>

After:
<img width="2476" height="1472" alt="image"
src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2"
/>
<img width="2454" height="1236" alt="image"
src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a"
/>

**Qwen 3.5 35B A3B 8bit (No change)**
Before:
<img width="2414" height="1396" alt="image"
src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3"
/>

After:
<img width="2346" height="1234" alt="image"
src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6"
/>

**Llama 3.2 1B Instruct 4bit (small change)**
Before:
<img width="2516" height="1220" alt="image"
src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80"
/>

After:
<img width="2566" height="1370" alt="image"
src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543"
/>

* Only update KV prefix cache on a good cache hit (exo-explore#1817)

Addresses exo-explore#1816

Update on min prefix cache > min_prefix_hit_length **and** hit ratio >
_MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system
prompts must match exactly.

Test on OpenCode and Claude Code

* Prefer higher model download % for placement (exo-explore#1767)

## Motivation

When placing a model instance across the cluster, the master previously
only considered available RAM. This meant it could pick a node that
hasn't downloaded the model yet, even when another node already has it
(or is further along in downloading it).

## Changes

- Added download_status parameter to place_instance() in placement.py
- Added _get_node_download_fraction() to compute 0.0–1.0 download
progress per node/model
- Added _cycle_download_score() to sum download fractions across a
cycle's nodes
- Cycle selection now uses a (download_score, available_ram) tuple key —
download progress is the primary sort, RAM is the tiebreaker
- Passed self.state.downloads into place_instance() from master/main.py

## Why It Works

Python's tuple comparison gives download progress strict priority over
RAM, so a node with the model already downloaded will always be
preferred over one with more free RAM but no download.

## Test Plan

### Automated Testing

3 new tests cover: completed download preferred, higher partial progress
preferred, failed download not preferred over no-download node

* Prefer higher % downloaded nodes for API placement previews (exo-explore#1795)

Follow up to exo-explore#1767

Same thing for placement previews through API

* docs: add module and function docstrings to cherry-picked patches

New mlx-lm patch modules from upstream lacked documentation for
docs generation. Adds module-level and public function docstrings
to all four patch files explaining their purpose and behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: verify model exists on disk before re-emitting DownloadCompleted

When an embedding model instance was deleted and re-run, the download
coordinator still had a cached DownloadCompleted status and would
re-emit it without checking if the model directory still existed on
disk. This caused "Model not found on disk" errors requiring a manual
node cache purge.

Now _start_download validates that the model directory and config.json
still exist before re-emitting DownloadCompleted. If the directory is
gone, the stale cache entry is cleared and a fresh download begins.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: typos in comments and log messages

- "compatability" -> "compatibility" in YarnRoPE docstring
- "Fakse" -> "False" in utils_mlx comment
- "succesfully" -> "successfully" in runner_supervisor logs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Evan Quiney <evanev7@gmail.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
Co-authored-by: vskiwi <141816715+vskiwi@users.noreply.github.com>
Co-authored-by: user <user@m1.note>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: ciaranbor <81697641+ciaranbor@users.noreply.github.com>
Co-authored-by: Thomas Tupper <kite3@kite3.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants