Skip to content

spmfrance-cloud/aria-protocol

Repository files navigation

ARIA Protocol

Version Python License Tests Desktop

Autonomous Responsible Intelligence Architecture

One protocol. Every CPU. Every NPU. Every model.

ARIA is a universal distributed inference protocol. A single peer-to-peer network routes queries to the right model on the right hardware, whether that's a 1.58-bit ternary model running on a low-power laptop, a standard 4B GGUF model on a desktop, or a code/reasoning/vision specialist on a machine with a real NPU. Nodes advertise the tiers they serve in a v2 handshake and the smart router matches every query to a peer that can answer it.


Three tiers, one protocol

Tier Models Backend Memory floor Use case
🌱 Efficiency BitNet b1.58, Falcon-E, Falcon3 1.58-bit (8 models) bitnet.cpp 0.4 GB Always-on chat on any CPU; low-power laptops; background nodes
Quality Gemma 4, Qwen 3.5, SmolLM3, Phi-4 mini (5 models) mainline llama.cpp 1.1 GB Multilingual chat, longer context, multimodal (vision+audio)
🛠️ Specialist Qwen2.5-Coder, DeepSeek-R1-Distill, MiniCPM-V (3 models) mainline llama.cpp 1.9 GB Code generation, chain-of-thought reasoning, vision

A node operator picks one of five profile presetsminimal, efficient (default), balanced, full, or specialist_only — and the profile decides which tiers light up. The default keeps nodes lean (efficiency only); balanced adds Quality on machines with 8 GB+ RAM; full enables all three tiers on 16 GB+ workstations.

See docs/MODELS.md for the full catalog, license matrix, and HuggingFace URLs.


Quick start

Install

pip install aria-protocol

Start a node

# Default profile is "efficient" — 1.58-bit only, ~4 GB RAM, any CPU
aria node start --port 8765

# OpenAI-compatible API
aria api start --port 3000

Load a model

# Tier Efficiency (default)
aria model download BitNet-b1.58-2B-4T

# Tier Quality (after switching profile)
aria node profile set balanced
aria model download Gemma-4-E2B

# Tier Specialist (after switching profile)
aria node profile set full
aria model download Qwen2.5-Coder-7B-Instruct

Use with the OpenAI client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/v1", api_key="aria")

response = client.chat.completions.create(
    model="BitNet-b1.58-2B-4T",
    messages=[{"role": "user", "content": "What is quantum computing?"}],
)
print(response.choices[0].message.content)

The smart router picks a model automatically when none is specified — pass the catalog ID (BitNet-b1.58-2B-4T, Gemma-4-E2B, Qwen2.5-Coder-7B-Instruct, …) when you want a specific one.

For the full walkthrough see docs/getting-started.md.


Supported models

ARIA v0.9.0 ships with 16 models across three tiers. Every entry passes a strict license gate at import — the catalog only contains models under MIT, Apache 2.0, or TII Falcon 2.0 licenses to keep P2P redistribution friction-free. Models considered and rejected on licensing grounds (Llama 3.x, Gemma 3, Mistral research, Yi, Command-R) are listed in docs/MODELS.md with the rejection reasoning.

Tier # models License surface
Efficiency 8 MIT, TII Falcon 2.0
Quality 5 Apache 2.0, MIT
Specialist 3 Apache 2.0, MIT

Adding a model is a pull request against aria/model_catalog.py. The gate refuses non-permissive licenses at import time so the roster cannot drift.

Full table: docs/MODELS.md.


Hardware

ARIA detects the local CPU/NPU at startup and ships the snapshot in the v2 peer hello so remote routers can prefer hardware-friendly peers:

aria hardware info

v0.9.0 ships NPU detection only — the protocol learns about AMD XDNA/XDNA2, Intel NPU, Qualcomm Hexagon, and Apple ANE devices, but inference itself still runs on the CPU. Real NPU acceleration ships in v1.0 via vendor-specific stubs (OpenVINO for Intel, QNN for Qualcomm, Core ML for Apple).

CPU detection covers Intel/AMD/Apple Silicon/Qualcomm Snapdragon, including AVX-512 capability used by bitnet.cpp for native 512-bit ternary kernels.

See docs/NPU_SUPPORT.md for the per-vendor roadmap and how to verify detection on your machine.


Benchmarks

Real numbers, reproducible from the repo. All measurements on a single host so they're comparable to each other; cross-host comparisons should treat the absolute throughput as indicative.

v0.5.5 — Ecosystem benchmark (Zen 4)

Hardware: AMD Ryzen 9 7845HX (12C/24T, Zen 4, 64 GB DDR5) Build: bitnet.cpp + Clang 20.1.8, AVX-512 VNNI+VBMI enabled Protocol: 8 threads, 256 tokens, 5 runs per model, median selected

Model Params Type tok/s
BitNet-b1.58-large 0.7B post-quantized 118.25
Falcon-E-1B-Instruct 1.0B native 1-bit 80.19
Falcon3-1B-Instruct 1.0B post-quantized 56.31
Falcon-E-3B-Instruct 3.0B native 1-bit 49.80
BitNet-b1.58-2B-4T 2.4B native 1-bit 37.76
Falcon3-3B-Instruct 3.0B post-quantized 33.21
Falcon3-7B-Instruct 7.0B post-quantized 19.89
Falcon3-10B-Instruct 10.0B post-quantized 15.12

Key finding: Models natively trained in 1-bit (Falcon-E) outperform post-training quantized models by +42% at 1B and +50% at 3B on identical hardware. Native ternary training matters more than absolute parameter count below 7B.

Zen 5 cross-generation (April 2026)

Benchmarked on AMD Ryzen AI 9 HX 370 (Zen 5, native 512-bit AVX-512). Average improvement: +35% across 7 models.

Model Zen 4 (t/s) Zen 5 (t/s) Δ
Falcon-E-1B 80.19 103.59 +29%
Falcon3-1B 56.31 78.16 +39%
BitNet-2B-4T 37.76 51.82 +37%
Falcon-E-3B 49.80 65.19 +31%
Falcon3-3B 33.21 46.77 +41%
Falcon3-7B 19.89 28.45 +43%
Falcon3-10B 15.12 19.39 +28%

Big.LITTLE CPUs require model-size-aware thread tuning: 1B peaks at 6 threads, 7B peaks at 20 threads.

Full results and reproduction harness: benchmarks/.


Architecture

┌──────────────────────────────────────────────────────────────────┐
│                     ARIA PROTOCOL v0.9.0                          │
├──────────────────────────────────────────────────────────────────┤
│  SERVICE     OpenAI-compatible API · Desktop App · CLI · Dashboard│
├──────────────────────────────────────────────────────────────────┤
│  CONSENSUS   Provenance Ledger · Proof of Useful Work · Proof of  │
│              Sobriety · Consent Contracts                         │
├──────────────────────────────────────────────────────────────────┤
│  COMPUTE     SmartRouterV2 → ┬→ BitnetBackend  (port 8081)        │
│                              └→ LlamacppBackend (port 8082)       │
│              P2P Network (WebSocket, Kademlia DHT, NAT traversal, │
│              Ed25519 auth, Protocol v2 with tier capabilities)    │
└──────────────────────────────────────────────────────────────────┘

The router is a pure function: it takes a query, runs it through the classifier, picks (tier, model_id) from a routing table, and returns a RoutingDecision with a fallback chain. The two backends — one wrapping bitnet.cpp's llama-server, one wrapping mainline llama.cpp's llama-server — are independent processes the router dispatches to by model ID.

Detailed view: docs/architecture.md. P2P wire format: docs/protocol-spec.md.


Documentation

Document Description
Getting Started Install, first node, per-tier examples
Models Full catalog, license matrix, HuggingFace URLs
NPU Support Per-vendor roadmap (AMD / Intel / Qualcomm / Apple)
Architecture Three-tier compute, dual backend, P2P v2
Protocol Spec WebSocket protocol v2 with HELLO message
Migration v0.8 → v0.9 Breaking changes and how to upgrade
API Reference OpenAI-compatible HTTP endpoints
Threat Model Security analysis, tier-specific threats
Security Architecture Defense-in-depth model
Smart Router Routing table, classifier, fallback logic
Benchmarks Methodology and full result sets
Roadmap All versions and tasks

Desktop app

Download latest release — Windows, macOS (Intel + Apple Silicon), Linux.

Built with Electron (primary) and Tauri 2.0 (alternative). Includes a 3-tier badge in the header, profile preset switcher, hardware panel, chat-centric interface, and 4 layout presets. Mode switch separates Chat (AI conversations) from Node (Dashboard, Models, Energy, Network).

See desktop/README.md for build instructions.


Why ARIA?

Problem Solution
AI requires expensive GPUs 1.58-bit and 4-bit models run efficiently on any CPU
One model can't cover every workload Three tiers (efficiency / quality / specialist) routed per query
Centralized inference burns energy Distributed across existing consumer devices with sobriety proofs
Outputs are untraceable Every inference recorded on the provenance ledger
Models depend on one provider 8+ organizations contribute to the catalog (Microsoft, TII, Google, Alibaba, DeepSeek, OpenBMB, HuggingFace, Microsoft Research)
Licenses leak into redistribution Hard license gate — only MIT / Apache 2.0 / TII Falcon 2.0 enter the catalog

Contributing

Pull requests welcome. Areas where help is most useful:

  • NPU stubs — wiring real inference for AMD XDNA, Intel NPU, Qualcomm Hexagon, or Apple ANE under aria/backends/
  • New models — add a ModelEntry to aria/model_catalog.py (the license gate enforces P2P-compatible licenses at import)
  • Routing improvements — better classifiers or routing tables in aria/smart_router.py
  • Mobile — React Native or native iOS/Android app
  • Docs and examples — every example in examples/ is welcome

Development setup

git clone https://github.com/spmfrance-cloud/aria-protocol.git
cd aria-protocol
pip install -e ".[dev]"
make test

Guidelines

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/your-change)
  3. Write tests for your changes
  4. Ensure the suite passes (make test)
  5. Open a pull request with a clear description

Code style: PEP 8, type hints on public APIs, focused functions, tests alongside the change.


Running tests

make test               # full suite
make test-verbose       # verbose output
make test-cov           # with coverage report
pytest tests/test_smart_router.py -v

License

MIT. See LICENSE.

Citation

@misc{aria2026,
  author = {Anthony MURGO},
  title  = {ARIA: Autonomous Responsible Intelligence Architecture},
  year   = {2026},
  url    = {https://github.com/spmfrance-cloud/aria-protocol}
}

Acknowledgments


One protocol. Every CPU. Every NPU. Every model.

About

Peer-to-peer distributed AI inference using 1-bit quantized models. CPU-only, 70-82% energy savings, 103+ tokens/sec. Validated on Zen 4 & Zen 5 (+35% cross-gen improvement).

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors