Autonomous Responsible Intelligence Architecture
One protocol. Every CPU. Every NPU. Every model.
ARIA is a universal distributed inference protocol. A single peer-to-peer network routes queries to the right model on the right hardware, whether that's a 1.58-bit ternary model running on a low-power laptop, a standard 4B GGUF model on a desktop, or a code/reasoning/vision specialist on a machine with a real NPU. Nodes advertise the tiers they serve in a v2 handshake and the smart router matches every query to a peer that can answer it.
| Tier | Models | Backend | Memory floor | Use case |
|---|---|---|---|---|
| 🌱 Efficiency | BitNet b1.58, Falcon-E, Falcon3 1.58-bit (8 models) | bitnet.cpp |
0.4 GB | Always-on chat on any CPU; low-power laptops; background nodes |
| ⚡ Quality | Gemma 4, Qwen 3.5, SmolLM3, Phi-4 mini (5 models) | mainline llama.cpp |
1.1 GB | Multilingual chat, longer context, multimodal (vision+audio) |
| 🛠️ Specialist | Qwen2.5-Coder, DeepSeek-R1-Distill, MiniCPM-V (3 models) | mainline llama.cpp |
1.9 GB | Code generation, chain-of-thought reasoning, vision |
A node operator picks one of five profile presets — minimal,
efficient (default), balanced, full, or specialist_only — and the
profile decides which tiers light up. The default keeps nodes lean
(efficiency only); balanced adds Quality on machines with 8 GB+ RAM;
full enables all three tiers on 16 GB+ workstations.
See docs/MODELS.md for the full catalog, license
matrix, and HuggingFace URLs.
pip install aria-protocol# Default profile is "efficient" — 1.58-bit only, ~4 GB RAM, any CPU
aria node start --port 8765
# OpenAI-compatible API
aria api start --port 3000# Tier Efficiency (default)
aria model download BitNet-b1.58-2B-4T
# Tier Quality (after switching profile)
aria node profile set balanced
aria model download Gemma-4-E2B
# Tier Specialist (after switching profile)
aria node profile set full
aria model download Qwen2.5-Coder-7B-Instructfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="aria")
response = client.chat.completions.create(
model="BitNet-b1.58-2B-4T",
messages=[{"role": "user", "content": "What is quantum computing?"}],
)
print(response.choices[0].message.content)The smart router picks a model automatically when none is specified — pass
the catalog ID (BitNet-b1.58-2B-4T, Gemma-4-E2B,
Qwen2.5-Coder-7B-Instruct, …) when you want a specific one.
For the full walkthrough see docs/getting-started.md.
ARIA v0.9.0 ships with 16 models across three tiers. Every entry
passes a strict license gate at import — the catalog only contains
models under MIT, Apache 2.0, or TII Falcon 2.0 licenses to keep P2P
redistribution friction-free. Models considered and rejected on
licensing grounds (Llama 3.x, Gemma 3, Mistral research, Yi, Command-R)
are listed in docs/MODELS.md with the rejection reasoning.
| Tier | # models | License surface |
|---|---|---|
| Efficiency | 8 | MIT, TII Falcon 2.0 |
| Quality | 5 | Apache 2.0, MIT |
| Specialist | 3 | Apache 2.0, MIT |
Adding a model is a pull request against aria/model_catalog.py. The
gate refuses non-permissive licenses at import time so the roster
cannot drift.
Full table: docs/MODELS.md.
ARIA detects the local CPU/NPU at startup and ships the snapshot in the v2 peer hello so remote routers can prefer hardware-friendly peers:
aria hardware infov0.9.0 ships NPU detection only — the protocol learns about AMD XDNA/XDNA2, Intel NPU, Qualcomm Hexagon, and Apple ANE devices, but inference itself still runs on the CPU. Real NPU acceleration ships in v1.0 via vendor-specific stubs (OpenVINO for Intel, QNN for Qualcomm, Core ML for Apple).
CPU detection covers Intel/AMD/Apple Silicon/Qualcomm Snapdragon,
including AVX-512 capability used by bitnet.cpp for native 512-bit
ternary kernels.
See docs/NPU_SUPPORT.md for the per-vendor
roadmap and how to verify detection on your machine.
Real numbers, reproducible from the repo. All measurements on a single host so they're comparable to each other; cross-host comparisons should treat the absolute throughput as indicative.
Hardware: AMD Ryzen 9 7845HX (12C/24T, Zen 4, 64 GB DDR5)
Build: bitnet.cpp + Clang 20.1.8, AVX-512 VNNI+VBMI enabled
Protocol: 8 threads, 256 tokens, 5 runs per model, median selected
| Model | Params | Type | tok/s |
|---|---|---|---|
| BitNet-b1.58-large | 0.7B | post-quantized | 118.25 |
| Falcon-E-1B-Instruct | 1.0B | native 1-bit | 80.19 |
| Falcon3-1B-Instruct | 1.0B | post-quantized | 56.31 |
| Falcon-E-3B-Instruct | 3.0B | native 1-bit | 49.80 |
| BitNet-b1.58-2B-4T | 2.4B | native 1-bit | 37.76 |
| Falcon3-3B-Instruct | 3.0B | post-quantized | 33.21 |
| Falcon3-7B-Instruct | 7.0B | post-quantized | 19.89 |
| Falcon3-10B-Instruct | 10.0B | post-quantized | 15.12 |
Key finding: Models natively trained in 1-bit (Falcon-E) outperform post-training quantized models by +42% at 1B and +50% at 3B on identical hardware. Native ternary training matters more than absolute parameter count below 7B.
Benchmarked on AMD Ryzen AI 9 HX 370 (Zen 5, native 512-bit AVX-512). Average improvement: +35% across 7 models.
| Model | Zen 4 (t/s) | Zen 5 (t/s) | Δ |
|---|---|---|---|
| Falcon-E-1B | 80.19 | 103.59 | +29% |
| Falcon3-1B | 56.31 | 78.16 | +39% |
| BitNet-2B-4T | 37.76 | 51.82 | +37% |
| Falcon-E-3B | 49.80 | 65.19 | +31% |
| Falcon3-3B | 33.21 | 46.77 | +41% |
| Falcon3-7B | 19.89 | 28.45 | +43% |
| Falcon3-10B | 15.12 | 19.39 | +28% |
Big.LITTLE CPUs require model-size-aware thread tuning: 1B peaks at 6 threads, 7B peaks at 20 threads.
Full results and reproduction harness: benchmarks/.
┌──────────────────────────────────────────────────────────────────┐
│ ARIA PROTOCOL v0.9.0 │
├──────────────────────────────────────────────────────────────────┤
│ SERVICE OpenAI-compatible API · Desktop App · CLI · Dashboard│
├──────────────────────────────────────────────────────────────────┤
│ CONSENSUS Provenance Ledger · Proof of Useful Work · Proof of │
│ Sobriety · Consent Contracts │
├──────────────────────────────────────────────────────────────────┤
│ COMPUTE SmartRouterV2 → ┬→ BitnetBackend (port 8081) │
│ └→ LlamacppBackend (port 8082) │
│ P2P Network (WebSocket, Kademlia DHT, NAT traversal, │
│ Ed25519 auth, Protocol v2 with tier capabilities) │
└──────────────────────────────────────────────────────────────────┘
The router is a pure function: it takes a query, runs it through the
classifier, picks (tier, model_id) from a routing table, and returns
a RoutingDecision with a fallback chain. The two backends — one
wrapping bitnet.cpp's llama-server, one wrapping mainline
llama.cpp's llama-server — are independent processes the router
dispatches to by model ID.
Detailed view: docs/architecture.md.
P2P wire format: docs/protocol-spec.md.
| Document | Description |
|---|---|
| Getting Started | Install, first node, per-tier examples |
| Models | Full catalog, license matrix, HuggingFace URLs |
| NPU Support | Per-vendor roadmap (AMD / Intel / Qualcomm / Apple) |
| Architecture | Three-tier compute, dual backend, P2P v2 |
| Protocol Spec | WebSocket protocol v2 with HELLO message |
| Migration v0.8 → v0.9 | Breaking changes and how to upgrade |
| API Reference | OpenAI-compatible HTTP endpoints |
| Threat Model | Security analysis, tier-specific threats |
| Security Architecture | Defense-in-depth model |
| Smart Router | Routing table, classifier, fallback logic |
| Benchmarks | Methodology and full result sets |
| Roadmap | All versions and tasks |
Download latest release — Windows, macOS (Intel + Apple Silicon), Linux.
Built with Electron (primary) and Tauri 2.0 (alternative). Includes a 3-tier badge in the header, profile preset switcher, hardware panel, chat-centric interface, and 4 layout presets. Mode switch separates Chat (AI conversations) from Node (Dashboard, Models, Energy, Network).
See desktop/README.md for build instructions.
| Problem | Solution |
|---|---|
| AI requires expensive GPUs | 1.58-bit and 4-bit models run efficiently on any CPU |
| One model can't cover every workload | Three tiers (efficiency / quality / specialist) routed per query |
| Centralized inference burns energy | Distributed across existing consumer devices with sobriety proofs |
| Outputs are untraceable | Every inference recorded on the provenance ledger |
| Models depend on one provider | 8+ organizations contribute to the catalog (Microsoft, TII, Google, Alibaba, DeepSeek, OpenBMB, HuggingFace, Microsoft Research) |
| Licenses leak into redistribution | Hard license gate — only MIT / Apache 2.0 / TII Falcon 2.0 enter the catalog |
Pull requests welcome. Areas where help is most useful:
- NPU stubs — wiring real inference for AMD XDNA, Intel NPU,
Qualcomm Hexagon, or Apple ANE under
aria/backends/ - New models — add a
ModelEntrytoaria/model_catalog.py(the license gate enforces P2P-compatible licenses at import) - Routing improvements — better classifiers or routing tables in
aria/smart_router.py - Mobile — React Native or native iOS/Android app
- Docs and examples — every example in
examples/is welcome
git clone https://github.com/spmfrance-cloud/aria-protocol.git
cd aria-protocol
pip install -e ".[dev]"
make test- Fork the repository
- Create a feature branch (
git checkout -b feature/your-change) - Write tests for your changes
- Ensure the suite passes (
make test) - Open a pull request with a clear description
Code style: PEP 8, type hints on public APIs, focused functions, tests alongside the change.
make test # full suite
make test-verbose # verbose output
make test-cov # with coverage report
pytest tests/test_smart_router.py -vMIT. See LICENSE.
@misc{aria2026,
author = {Anthony MURGO},
title = {ARIA: Autonomous Responsible Intelligence Architecture},
year = {2026},
url = {https://github.com/spmfrance-cloud/aria-protocol}
}- Microsoft Research BitNet — 1.58-bit ternary research and
bitnet.cpp - TII Falcon — Falcon-Edge and Falcon3 1.58-bit families
- ggml-org — mainline
llama.cppand Gemma 4 GGUF builds - Qwen team — Qwen 3.5 and Qwen2.5-Coder
- DeepSeek — R1 distilled reasoning weights
- OpenBMB — MiniCPM-V vision specialist
One protocol. Every CPU. Every NPU. Every model.