An Android application that brings local LLM chat to your phone, fully on-device, private, and fast.
It supports ONNX-based Qwen models and LiteRT-based Qwen 3 and Gemma 4 models, with streaming responses, persistent local chat history, markdown-rendered replies, downloadable on-device models, and in-app model switching.
Unlike older builds, the app now ships as a small base APK. Users choose and download one or more models after installation, then switch between them later from inside the app.
- Switched to a smaller base APK
- Models are no longer bundled inside the APK
- Added first-launch model selection with on-device model downloads
- Added support for downloading multiple models
- Added support for switching between downloaded models inside the app
- Retained thinking text in a collapsible section for supported models
- Added basic UI and usability improvements
local-document-intelligence
A privacy-first offline document intelligence system with persistent local RAG, hybrid retrieval, and source-grounded answers.
- 📱 Fully on-device LLM inference for private offline use
- 📦 Smaller base APK with on-demand model downloads
- 📥 First-launch model picker after install
- 🔁 Download multiple models and switch between them inside the app
- 🧠 Supports Qwen2.5, Qwen3, Qwen3 LiteRT, Gemma 4 E2B, and Gemma 4 E4B
- ⚡ Supports ONNX and LiteRT backends
- 🚀 Hardware-accelerated LiteRT inference on supported devices
- 🔤 Hugging Face-compatible tokenizer support for ONNX Qwen models
- 💬 Persistent multi-turn chat with local history and Previous Chats
- 📝 Markdown rendering for assistant replies, including table support
- 🤔 Thinking Mode for supported models
- 🗂️ Retained thinking text in a collapsible section
- 🎨 Built-in themes and adjustable chat font size
- 🛑 Stop-generation support
- 🔐 Offline after model download, with no telemetry
![]() Chat inference |
![]() Theme switching |
![]() Thinking mode |
Figure: App interface showing local LLM chat and streaming responses on Android.
The app now ships as a single smaller base APK.
➡️ Download APK
Models are not bundled inside the APK anymore. After installation, the app prompts the user to select a model and download it directly on device.
Users can download multiple models and switch between them later from inside the app, without reinstalling.
- Gemma 4 E4B LiteRT - Best for flagship mobiles
- Gemma 4 E2B LiteRT - Best for decent to mid-range mobiles
- Qwen3 0.6B LiteRT - Best for low-end mobiles
- Qwen3 0.6B Q4F16 ONNX - Good for low to mid-range mobiles
- Qwen2.5 0.5B ONNX - Best for mid to high-end mobiles, full precision
Note: internet is required only for downloading models. Chat and inference remain fully on-device after the model is installed.
This app supports ONNX-based Qwen models and LiteRT-based Qwen 3 and Gemma 4 models.
- ONNX backend: supports Qwen2.5 and Qwen3
- LiteRT backend: supports Qwen3 and Gemma 4
- Qwen3 and Gemma 4 support Thinking Mode
- The toggle is shown only for models that support it
LiteRT is a strong fit for fast local Android chat because:
- It is designed for high-performance on-device LLM deployment
- It supports hardware acceleration, including GPU and NPU acceleration on supported devices
- It helps reduce startup and generation latency for local chat workloads
- It expands the range of practical Android model builds beyond a single backend path
- It fits well with a privacy-first app design focused on fully offline usage
Note: model capability and performance still depend on the specific model build and the hardware of the target Android device.
- Android Studio
- A physical Android device for deployment and testing
- 4 GB or more RAM for smaller models
- More RAM is recommended for larger models such as Gemma 4 E2B and Gemma 4 E4B
- A temporary internet connection for downloading models inside the app
- Real hardware is preferred; emulators are mainly useful for UI checks
-
Clone this repository.
-
Install the latest Android Studio.
-
Open the Android project folder in Android Studio:
pocket_llm_src/ -
Build and install the app on your Android device.
-
Launch the app.
-
On first launch, choose a model from the built-in model picker.
-
Download the selected model directly inside the app.
-
Start chatting locally on device
Gemma 4 is provided by Google under the Apache License 2.0. Google's Gemma documentation also states that Gemma models are provided with open weights and support responsible commercial use.
- Gemma 4 license: https://ai.google.dev/gemma/apache_2
- Gemma 4 overview: https://ai.google.dev/gemma/docs/core
Qwen model files follow the upstream Qwen license terms.
Please review the original model license before redistribution or commercial use.


