Live tracking doc for the 0.9.x rewrite line. Updated as work lands.
Current state: 0.9.0-dev.1 tagged. Apple xcframework, macOS dylib, Android CPU AAR, and Android Hexagon AAR all build green in CI and attach to GitHub Releases. No on-device validation yet — that's the next gate.
| ID | Milestone | Status | Notes |
|---|---|---|---|
| M1 | Skeleton + macOS dylib loader | ✅ shipped | 4 modules, smoke test |
| M2 | Generation core | ✅ shipped | Sampler/Request/Generator/Session, streaming token API |
| M3 | Isolate engine | ✅ shipped | LlamaEngine + EngineSession; the public API |
| M4 | Chat (template-driven) | ✅ shipped | EngineChat, ChatTemplate, KnownChatTemplates |
| M5 | Multimodal (vision + audio) | ✅ shipped | LlamaMedia, MultimodalContext, mtmd_helper_eval_chunks |
| M6 | Persistence | ✅ shipped | StateCodec with metadata-validated reload |
| M6.5 | Context shift | ✅ shipped | llama-server-style auto-shift, gated on canShift |
| M7 | iOS xcframework | ✅ shipped | 3 slices: ios-arm64, ios-arm64-simulator, macos-arm64 |
| M8 | Android CPU AAR | ✅ shipped | arm64-v8a, ~2 MB stripped |
| M8.5 | Android Hexagon AAR | 6 HTP variants v68→v81, ~3.7 MB | |
| M9 | Release pipeline | ✅ shipped | CI green; v0.9.0-dev.1 attached artifacts |
| M10 | On-device validation | Android-Hexagon proven on Galaxy S23 Ultra; iOS + vision smoke + binding-gap fixes still pending | |
| M11 | Real Jinja chat templates | 🟡 deferred | needed for Gemma-4-Unsloth and similar |
| M12 | pub.dev publish | 🟡 deferred | needs dart pub publish from CI |
Status icons: ✅ done ·
- On-device validation (M10). Everything we ship for iOS / Android is unverified on real hardware. Likely surprises:
- Library-load path edge cases on iOS (xcframework +
loadFromProcess). - Metal shader compile time on first-run iPhones.
- Hexagon HTP runtime variant selection (which
libggml-htp-v*.soactually loads on a given Snapdragon). - Android model file path access (assets vs. external storage).
- Library-load path edge cases on iOS (xcframework +
- Real Jinja for chat templates (M11).
llama_chat_apply_templatesubstring matching covers ~55 families but fails on Gemma-4-Unsloth and similar custom Jinja. Today's workaround is manualEngineSession.generate(prompt:)formatting — works butEngineChat.addUseretc. silently break for those models. Two implementation paths:- Ship a minimal Jinja subset in Dart (~500 lines).
- C shim around
common_chat_templates_apply(links libllama-common into the AAR/xcframework).
- pub.dev publishing (M12). Add
dart pub publish --dry-runto CI on tag, then a manual confirm. - Stop strings. Token-level stop only today (
StopEog/StopMaxTokens). Real chat apps wantstopAt: ['<|user|>', 'User:']. - Thread-safe log silencing in worker.
LlamaLog.silence()is off in the worker becausePointer.fromFunctioncallbacks crash when ggml's Metal init logs from a non-Dart thread. Switch toNativeCallable.isolateGroupShared. - Multi-image / multi-audio per turn. Implementation supports it (media is a list); no probe or test exercises it.
- Concurrent decode per engine. One in-flight generate per worker today.
- Benchmarking probe. No tok/s harness for tuning.
- Cosmetic Metal residency assert at process exit. Annoying but harmless; needs a refcounted backend init.
- Context shift on multimodal turns. llama-server disables; hugind raises
n_keepto cover the multimodal prefix instead — adopt that when a real long-multimodal use case appears.
These were called out as goals/non-goals at the start of the rewrite. Recording them so we don't re-litigate.
- Target platforms: iOS + Android primary, macOS as dev/test target. Linux / Windows / desktop CUDA / Vulkan-desktop dropped.
- Distribution: Dart package contains no binaries. Native artifacts ship from GitHub Releases (
.dylib, xcframework,.aar). Users embed at app build time. - Acceleration: Apple Metal, Android Hexagon NPU + OpenCL + CPU. No Vulkan on Android. Hexagon is upstream-merged (
ggml-hexagon); we don't use the haozixu fork. - Single ffigen output covers every platform. mtmd-helper.h included for image/audio file decoding.
- Architecture: modular FFI. No god-class. Coupling flows one direction:
session → context → model → ffi. Each*ParamsDTO is immutable, JSON-serializable, no FFI types. - Chat templates: rendered via
llama_chat_apply_templateagainst the GGUF's embedded Jinja. Per-format Dart classes (ChatML/Gemma/Alpaca/Harmony) all dropped.KnownChatTemplatesprovides sentinel substrings as a fallback for misclassified models. - No MCP, no HTTP server, no agent runtime, no
TextChunker— application-layer concerns. - Single in-flight generate per engine. Multi-session uses sequence ids in the same context.
- Persistence file format carries metadata (model identity, context params, mmproj fingerprint, token checksum) so loading into an incompatible engine throws with a discriminator instead of corrupting state.
- Context shift is opt-in (
ContextShiftPolicy.auto), gated onengine.canShift. Recurrent and iSWA caches return false and the policy throws. - Versioning:
0.9.xis the rewrite line. Public API may break once more before 1.0.
Reported by the demo-app instance. Same Gemma-4-E2B-it-Q8_0 + mmproj-F16
on both AAR flavors, push-and-go from /sdcard/Android/data/<app>/files/.
| Test | CPU AAR | Hexagon AAR |
|---|---|---|
| Spawn (warm) | 3.7 s | 17 s |
| Decode (67-tok reply) | 7.8 tok/s | 12.2 tok/s (1.6×) |
canShift |
true | true |
supportsVision / supportsAudio |
true | true |
| Reply quality (T=0.3) | identical | identical |
Conclusion: Hexagon AAR is the right default. The 1.6× is Hexagon AAR vs CPU AAR — could be NPU, OpenCL, or both; backend attribution requires the binding gaps below.
Queued for 0.9.0-dev.2:
ContextParamsmissingtypeK/typeV— no Dart knob for KV cache quantization. Patch: add both fields, thread tollama_context_params.- HANDOFF.md said
minSdk 26— Hexagon AAR ships with manifestminSdkVersion 31because OpenCL needs API 31's<uses-native-library>opt-in. Fixed in HANDOFF.md. - Hexagon
libllama.sohas hardDT_NEEDEDonlibggml-opencl.so. Apps need<uses-native-library android:name="libOpenCL.so" android:required="false" />in their manifest ordlopenoflibllama.sofails on first load. Documented in HANDOFF.md; consider whether a soft-load path is worth it on the binding side. No backend-inspection API.✅ shipped.engine.devicesreturns the full list captured at spawn;BackendDevicecarries name / description / type (cpu/gpu/igpu/accel/meta) / registry / memory.engine.hasAcceleratorandengine.primaryAcceleratorNamegive the common-case shortcut. Probes:example/probes/{list_backends,engine_backends}.dart. Doesn't identify the specific device that ran a single generation (ggml-backend's scheduler distributes ops in one graph) but answers "is Hexagon loaded?" / "is OpenCL loaded?" cleanly.- No log redirect.
LlamaLoghassilence()/useDefault()but noonMessage(cb)— backend selection messages don't reach logcat. Patch: switch worker log toNativeCallable.isolateGroupShared, expose Dart callback.
- Vision smoke on device (push JPEG, describe via
LlamaMedia.imageFile). - iOS validation (xcframework load, Metal shader cold-start time, model on disk via app bundle).
- Q4_K_M tok/s rerun once a smaller model is on hand. Expectation: ~22–28 tok/s on Hexagon (memory bandwidth halves vs Q8_0).
- Context shifting went from "skipped — too complex" → "shipped" (M6.5) once we found the llama-server algorithm is ~10 lines on top of the existing primitives.
- Hexagon NPU went from "post-1.0, blocked on SDK" → "shipped, pending validation" (M8.5) once we realized the upstream Snapdragon Docker image bundles the Hexagon SDK and works on Apple Silicon under Rosetta 2.
- Three xcframework slices (ios-arm64 + ios-arm64-simulator + macos-arm64) instead of two. macOS slice goes in the xcframework so iOS app developers can also use macOS for native debugging without a separate dylib.
- Custom Jinja support went from "drop the Dart prompt formats, llama.cpp handles it" → "llama.cpp's matcher misses some custom templates; manual prompt rendering is the workaround". The Jinja gap is the headline thing 0.9.x has to address.
- Audio support wasn't on the original radar; turned out mtmd handles audio with the same API surface as vision (LlamaMedia + mtmd_helper_eval_chunks). So we ship audio for free with vision.
- Hexagon device selection UX. Should the Dart API auto-pick the highest-scoring backend, or always require an explicit
LlamaBackend.list()choice? Currently auto via ggml-backend, which seems right but unverified on hardware. - Android packaging of
libggml-htp-v*.so. All six HTP variants in one AAR — works in our local build, but Play Store APK / AAB delivery hasn't been tested. May need per-device split if app stores complain about the bundle size or unused variants. - Real Jinja path. Dart subset vs C shim — the trade-off depends on how much we want libllama-common bound. Wait for users to push back before deciding.
lib/
llama_cpp_dart.dart — public exports
src/
ffi/ bindings.dart, library_loader.dart, log.dart
model/ LlamaModel + vocab + ModelParams
context/ LlamaContext + ContextParams
batch/ LlamaBatch
sampling/ Sampler + SamplerFactory + SamplerParams
tokenizer/ Tokenizer + Utf8Accumulator
generation/ Generator + Request + GenerationEvent + ContextShift
chat/ ChatMessage + ChatTemplate + KnownChatTemplates
multimodal/ MultimodalContext + LlamaMedia + MultimodalParams
session/ LlamaSession + StateCodec
isolate/ LlamaEngine + EngineSession + EngineChat + worker
types/ exception hierarchy
tool/
build_native.sh — macOS dylib for dev/test
build_apple_xcframework.sh — iOS device + iOS sim + macOS slices
build_android_aar.sh — CPU AAR (arm64-v8a)
build_android_hexagon_aar.sh — CPU + OpenCL + Hexagon NPU AAR (Docker)
example/probes/ — runnable scripts per subsystem
test/ — pure-Dart and integration tests
.github/workflows/ — test, build-apple, build-android
Module rules:
- Each module owns one llama.cpp concept. No cross-cutting state.
- All FFI pointers are wrapped in finalizable handle objects.
- DTOs (
*Params) are immutable, JSON-serializable, no FFI types.
// dev / dart test (macOS dylib)
final engine = await LlamaEngine.spawn(
libraryPath: '/path/to/libllama.dylib',
modelParams: ModelParams(path: '/path/to/model.gguf', gpuLayers: 99),
contextParams: const ContextParams(nCtx: 4096),
);
// iOS / macOS app (xcframework embedded by Xcode)
final engine = await LlamaEngine.spawnFromProcess(
modelParams: ModelParams(path: '...'),
contextParams: const ContextParams(nCtx: 4096),
);
// Android (AAR or jniLibs/)
final engine = await LlamaEngine.spawn(
libraryPath: 'libllama.so', // basename — Android's linker resolves
modelParams: ModelParams(path: '...'),
contextParams: const ContextParams(nCtx: 4096),
);
// raw streaming
final session = await engine.createSession();
await for (final ev in session.generate(prompt: 'Hello', maxTokens: 64)) {
if (ev is TokenEvent) stdout.write(ev.text);
}
// chat
final chat = await engine.createChat();
chat.addSystem('You are concise.');
chat.addUser('Hi');
await for (final ev in chat.generate(maxTokens: 64)) { ... }
// multimodal (vision + audio)
final engineMM = await LlamaEngine.spawn(
...,
multimodalParams: const MultimodalParams(mmprojPath: 'mmproj.gguf'),
);
chat.addUser('Describe.', media: [LlamaMedia.imageFile('cat.jpg')]);
// persistence
await session.saveState('/tmp/state.lcdc');
await otherSession.loadState('/tmp/state.lcdc');
// context shift
session.generate(
prompt: '...',
shiftPolicy: ContextShiftPolicy.auto,
);Lower-level types (LlamaModel, LlamaContext, LlamaSession, LlamaBatch, Tokenizer, Sampler) are exported for advanced use but LlamaEngine is the recommended entry point.
.github/workflows/test.yml—dart analyze+dart format --set-exit-if-changed+ pure-Dart utf8 accumulator test on every push/PR..github/workflows/build-apple.yml—macos-latest. Builds macOS dylib zip + 3-slice xcframework zip. Attaches to release onv*tag..github/workflows/build-android.yml—ubuntu-latestwith two parallel jobs:cpu(uses NDK r27c) andhexagon(uses Snapdragon Docker image). Both attach AAR + sha256 to release onv*tag.
All workflows have permissions: contents: write so softprops/action-gh-release can create / update the release.
35 tests across 8 files, all passing on macOS:
| File | Tests | Coverage |
|---|---|---|
smoke_test.dart |
4 | model load, vocab, batch, decode |
utf8_accumulator_test.dart |
7 | streaming UTF-8 boundary detection (pure Dart) |
generation_test.dart |
3 | Generator stream, greedy determinism, Session bookkeeping |
engine_test.dart |
3 | Engine spawn, mid-stream cancel, multi-session |
chat_test.dart |
4 + 1 skip | EngineChat history, multi-turn, templateOverride |
multimodal_test.dart |
4 | image via session + chat, imageBytes, capability flags |
persistence_test.dart |
7 | codec round-trip, mismatch detection, EngineSession + EngineChat round-trips |
context_shift_test.dart |
3 | canShift, shiftContext primitive |
/Users/adel/Workspace/hugind (Rust) is the canonical solved-form for these problems. Modules mirrored in Dart (scaled down for single-app mobile):
- module-level separation:
model,context,batch,sampling,tokenizer,multimodal,chat,generation,session - per-request sampler (not singleton)
Requestas a value type with state machine- multimodal behind its own context, not interleaved with text
- KV tiering as a dedicated manager (we use it lighter — RAM/disk only, no VRAM tier on mobile)
We don't mirror hugind's continuous-batching scheduler (mobile = single active session) or its multi-process server architecture.
Pinned at the b8965 era (src/llama.cpp submodule). Bumps require regen via dart run ffigen and a full test pass before retag.