A hierarchical multi-agent reasoning system that tackles complex QA, math, and logic tasks using Templated Graph Reasoning (TGR), Retrieval-Augmented Generation (RAG), swarm consensus, code-backed research, and web evidence. The system supports both template-guided DAG execution (TGR fast-path) and a standard Supervisor–Worker–Verifier pipeline.
- High-Level Overview
- Goals & Methodology
- System Architecture
- Execution Flow
- Core Components (Low-Level)
- Data & Configuration
- Running the System
- Concepts & Extensibility
MAS combines:
| Layer | Role |
|---|---|
| Supervisor | Decomposes problems into subtasks, critiques worker outputs, synthesizes final answers, enforces question-type-aware output policies. |
| Swarm Workers | Parallel multi-model ensemble (math, logic, QA) with cooperative reconciliation and early return on quorum. |
| Research Worker | Code-first “Ouroboros” loop: generate Python → execute in sandbox → observe → refine (timeout-aware). |
| Verifier | Independent numeric recomputation; returns a bare number when used. |
| Templated Graph Reasoning (TGR) | Buffer-of-Thought templates + Graph-of-Thought controller: structured DAG execution with definition / enumeration / calculation / aggregation / verification (and optional retrieval) nodes. |
| RAG | Hybrid fusion (semantic + BM25, RRF) over a LanceDB vector store; optional seed augmentation and mid-reasoning retrieval nodes. |
| Web Evidence | Optional real-time search (DuckDuckGo) and page fetching; evidence injected into workers and synthesis, with grounding checks. |
Entry point: solve_with_budget(problem, config_path, timeout_s, ...) in apps/mas/graph/plan_graph.py. It either runs the TGR fast-path (when a template matches with score ≥ 5) or the standard path (decompose → dispatch → critique → synthesize → verify).
- Solve complex, multi-step reasoning with verifiable outputs (numeric, boolean, multi-value, explanatory).
- Reduce hallucinations via template-guided graphs (TGR), consensus, and verification.
- Combine modalities: structured decomposition, multi-model consensus, code-backed experiments, RAG, and optional web evidence.
- Question-type-aware synthesis: numeric → bare number; boolean → yes/no; multi-value → compact text or JSON when requested; explanatory → prose; factual → concise answer with optional citations.
- Cold-start mitigation: TGR templates (and optional dynamic generation) provide domain-specific blueprints so the system does not start from scratch on math-heavy or procedural tasks.
apps/mas/
├── agents/ # Worker agents and supervisor
│ ├── supervisor.py # Decomposition, critique, synthesis, output policies
│ ├── websearch.py # Web evidence (DuckDuckGo, extraction, grounding)
│ ├── swarm_worker.py # Multi-model parallel consensus
│ ├── worker_math.py # Math prompts
│ ├── worker_logic.py # Logic prompts
│ ├── worker_qa.py # QA prompts
│ ├── worker_researcher.py # Code-first Ouroboros loop
│ ├── verifier.py # Numeric verification
│ └── latent/ # Optional inter-agent hidden state (embedding, attention)
│
├── graph/ # Orchestration and TGR
│ ├── plan_graph.py # solve_with_budget(), standard path, parallel dispatch
│ ├── template_distiller.py # Template selection (keyword + RAG + dynamic gen)
│ ├── template_generator.py # LLM-based dynamic template creation
│ ├── got_controller.py # Graph-of-Thought execution (TGR DAG)
│ ├── node_verifier.py # Type-specific node output verification
│ ├── backtrack_manager.py # Retry and state management for TGR
│ └── archetype_verifier.py # Domain-specific answer clamping
│
├── rag/ # Retrieval-Augmented Generation
│ ├── embeddings.py # Codestral embedder (1536-dim)
│ ├── indexer.py # Wikipedia → LanceDB ingestion
│ ├── retriever.py # Hybrid fusion search (RRF)
│ ├── chunker.py # Document chunking
│ └── evidence.py # RAGEvidencePack, quality detection, query expansion
│
├── learning/ # Optional distillation loop
│ ├── trace_recorder.py # Execution trace capture
│ ├── trace_store.py # Trace persistence
│ ├── pattern_analyzer.py # Pattern extraction from traces
│ ├── prompt_enhancer.py # Prompt augmentation with patterns
│ └── distillation_manager.py # Coordination
│
├── infra/ # LLM and env
│ └── openrouter/client.py # LLM API client (retries, optional caching)
│
├── tools/ # Execution and web
│ ├── executor.py # Sandboxed Python execution
│ ├── search.py # DuckDuckGo web search
│ ├── fetch.py # URL fetch, concurrent fetch, relevance extraction
│ └── timeline.py # Timeline extraction and constraint solving
│
├── configs/ # YAML config and TGR templates
│ ├── openrouter.yaml # Models, swarm, TGR, RAG, parallel, caching
│ ├── learning.yaml # Distillation, backtracking, latent
│ └── templates/*.json # TGR template blueprints
│
├── benchmarks/ # Evaluation
│ ├── gsm8k.py, hotpotqa.py, drop.py, gpqa.py, bbh.py
│ └── ...
└── web/ # UI
└── chat_ui.py # Gradio chat (with web toggle)
flowchart TB
subgraph input [Input]
Q[User Query]
end
subgraph rag_layer [RAG & Web Layer]
RAG[RAG Template Distiller / Seed Retrieval]
WEB[Web Evidence Optional]
end
subgraph routing [Routing]
TGR_CHECK{TGR enabled & template score ≥ 5?}
end
subgraph tgr_path [TGR Fast-Path]
GOT[GoTController]
NODES[Definition / Enum / Calc / Agg / Verify / Retrieval Nodes]
GOT --> NODES
end
subgraph std_path [Standard Path]
DEC[Supervisor.decompose]
DISP[Dispatch: Swarm + ResearchWorker]
CRIT[Supervisor.critique]
SYN[Supervisor.synthesize]
DEC --> DISP --> CRIT --> SYN
end
subgraph post [Post-Processing]
VERIFY[Verifier for numeric]
FMT[Question-type-aware formatting]
VERIFY --> FMT
end
Q --> RAG
Q --> WEB
RAG --> TGR_CHECK
TGR_CHECK -->|yes| tgr_path
TGR_CHECK -->|no| std_path
tgr_path --> FMT
std_path --> post
FMT --> OUT[Final Answer]
sequenceDiagram
participant User
participant PG as plan_graph.solve_with_budget
participant WS as WebSearchAgent
participant Sup as SupervisorAgent
participant Swarm as SwarmWorkerManager
participant Res as ResearchWorker
participant Ver as VerifierAgent
User->>PG: problem
PG->>WS: build_evidence(problem) [if web_enabled]
WS-->>PG: WebEvidencePack
PG->>Sup: decompose(problem)
Sup-->>PG: Plan(SubTasks)
loop For each (ready) SubTask
alt role == research
PG->>Res: run(instruction, context)
Res-->>PG: result
else role in qa/logic/math
PG->>Swarm: run(instruction, role, context + web_evidence)
Swarm-->>PG: responses[]
end
end
PG->>Sup: critique(problem, results, web_evidence)
Sup-->>PG: critique_text
PG->>Sup: synthesize(problem, results, web_evidence)
Sup-->>PG: final_answer
opt Critique indicates issues
PG->>Sup: resynthesize_with_critique(...)
end
opt Numeric question
PG->>Ver: verify_numeric(problem, candidate, context)
Ver-->>PG: verified_candidate?
end
PG-->>User: final answer
flowchart LR
subgraph TGR [TGR DAG]
N1[definition] --> N2[enumeration]
N1 --> N3[calculation]
N2 --> N4[aggregation]
N3 --> N4
N4 --> N5[verification]
end
N1 & N2 & N3 --> Swarm[SwarmWorker] or Res[ResearchWorker]
N4 --> Swarm
N5 --> Verifier[VerifierAgent]
Templates (e.g. hotel_toggle.json, spectral_cayley.json) define nodes (id, type, role, instruction) and edges. The GoTController topologically sorts nodes, runs same-level nodes in parallel where possible, and uses Swarm/ResearchWorker/Verifier by node type and role.
flowchart LR
subgraph RA_TGR [RA-TGR]
Q[Problem] --> TS[Template Selection + RAG]
TS --> SEEDS[Augment knowledge_seeds with RAG]
SEEDS --> GOT[GoTController]
GOT --> RN[Retrieval nodes optional]
RN --> GOT
end
- Template selection: RAG can boost template scores using retrieved context.
- Seed augmentation: Knowledge seeds in the template can be augmented with RAG retrieval before DAG execution.
- Retrieval nodes: Node type
retrieval(or rolerag) runs HybridRetriever during the graph and injects results into context.
User Query
|
v
+------------------------------------------+
| RAG seed retrieval (optional) |
| Web evidence (optional) |
+------------------------------------------+
|
v
+------------------------------------------+
| TGR? (template score >= 5) |
+------------------------------------------+
| yes | no
v v
+-------------+ +----------------------------------+
| GoTController| | Supervisor.decompose -> Plan |
| (DAG nodes) | | Dispatch (Swarm + Research) |
| -> final | | Supervisor.critique |
+-------------+ | Supervisor.synthesize |
| | [grounding check if web evidence] |
| | Verifier (if numeric) |
| +----------------------------------+
| |
+---------------------+
|
v
Final Answer
- Template selection: RAGTemplateDistiller (or TemplateDistiller) selects a template from
configs/templates/using keyword + optional RAG boost; optional dynamic generation if no match. - Score threshold: Template is used only if score ≥ 5 (avoids misrouting factual QA to math templates).
- GoTController.run(): Template is hydrated into a DAG; knowledge seeds can be augmented with RAG; each node runs via Swarm or ResearchWorker; verification nodes call VerifierAgent.
- Early exit: If TGR returns a non-empty
final_answer, it is returned and the standard path is skipped.
- Decomposition: Supervisor builds a
Planof subtasks (roles: math, logic, qa, research) with optional dependencies. Numeric/simulation patterns can auto-inject math/research tasks. - Web evidence (optional): If
web_enabled, WebSearchAgent builds a WebEvidencePack (intent, extracted answer, sources); used as context for workers and synthesis, and for a later grounding check. - Dispatch: Independent subtasks can run in parallel. Research subtasks → ResearchWorker (code execution). Others → SwarmWorkerManager (multi-model, cooperative rounds, early termination on quorum).
- Context: Workers receive dependency context, optional RAG evidence (fusion search), optional web evidence, and (when web-enabled) fetched Wikipedia pages from RAG URLs.
- Critique: Supervisor critiques worker outputs for consistency.
- Synthesis: Supervisor synthesizes the final answer; question type (numeric, boolean, multi_quantity, explanatory, factual) drives output policy. If critique indicates issues, resynthesize_with_critique; JSON repair can be rejected when not allowed.
- Grounding check: If web evidence contains an extracted answer, the final answer must contain it or a strict repair / deterministic fallback is applied.
- Verification: For single-number questions, VerifierAgent independently recomputes; candidate can be replaced if the verifier disagrees.
- Overall: e.g. 300s default; configurable.
- TGR: Per-node timeout (e.g. 90s), overall TGR timeout (e.g. 240s).
- Standard path: Decomposition, per-subtask, synthesis, and verification each get a fraction of the remaining budget (see
plan_graph.py).
| Component | File | Purpose |
|---|---|---|
| SupervisorAgent | agents/supervisor.py |
decompose(), critique(), synthesize(), resynthesize_with_critique(); question-type detection; output policies. |
| WebSearchAgent | agents/websearch.py |
build_evidence(): intent detection, multi-hop queries, extraction, confidence; WebEvidencePack for workers and grounding. |
| SwarmWorkerManager | agents/swarm_worker.py |
Parallel LLM calls, consensus, cooperative reconciliation, optional early termination when quorum agrees. |
| ResearchWorker | agents/worker_researcher.py |
Ouroboros loop: generate code → execute (executor) → observe → refine; timeout-aware. |
| VerifierAgent | agents/verifier.py |
verify_numeric(): independent low-temperature recomputation; returns bare number. |
| TemplateDistiller / RAGTemplateDistiller | graph/template_distiller.py |
Keyword scoring + optional RAG boost + optional dynamic template generation. |
| GoTController | graph/got_controller.py |
Load template → augment seeds (optional RAG) → topological execution of nodes (parallel by level) → Swarm/Research/Verifier per node. |
| NodeVerifier | graph/node_verifier.py |
Type-specific checks on node outputs (definition, enumeration, calculation, aggregation, verification). |
| BacktrackManager | graph/backtrack_manager.py |
Retry strategies and state management when node verification fails. |
| HybridRetriever | rag/retriever.py |
Semantic + lexical search, RRF fusion. |
| CodestralEmbedder | rag/embeddings.py |
Dense embeddings (e.g. 1536-dim) for RAG. |
| OpenRouterClient | infra/openrouter/client.py |
LLM API with retries; optional response caching. |
- Config:
apps/mas/configs/openrouter.yaml— model family, models, swarm (models, min responses, cooperative rounds), TGR (enabled, templates path, node/overall timeouts), RAG (enabled, db path, top_k, RRF weights, augment_seeds), parallel (concurrent subtasks/TGR nodes/fetches, early termination, speculative prefetch), caching. - Templates:
apps/mas/configs/templates/*.json— template_id, domain_tags, description, knowledge_seeds, graph_blueprint (entrypoint, nodes, edges). - Learning:
apps/mas/configs/learning.yaml— distillation, backtracking, latent communication (optional). - RAG store: LanceDB at
rag_db_path(e.g.apps/mas/data/wiki_lance); ingestion via scripts (e.g.scripts/index_wikipedia.py). - Sandbox: Python executor with configurable timeout for ResearchWorker code.
- Environment: Set
OPENROUTER_API_KEY(or compatible) and ensure Python deps fromrequirements.txtare installed. - Chat UI:
python -m apps.mas.web.chat_ui --config apps/mas/configs/openrouter.yaml --server-name 127.0.0.1 --server-port 7860
Use the web toggle to enable/disable web search and page fetching. - Benchmarks:
- Humanity’s Last Exam:
python scripts/test_humanity_exam.py(respects config timeouts and TGR). - HotpotQA/GSM8K/etc.: run the corresponding module under
apps/mas/benchmarks/or scripts inapps/mas/scripts/.
- Humanity’s Last Exam:
- RAG indexing:
python scripts/index_wikipedia.py --arrow-path <path> --max-docs 500(see script and docs for options).
- Swarm consensus: Parallel LLM calls with reconciliation and optional early return on quorum.
- Code-backed reasoning: Prefer executable simulation/enumeration (ResearchWorker) for brittle domains.
- Verification: Independent numeric recomputation to catch drift.
- Template-guided graphs: Buffer-of-Thought templates drive Graph-of-Thought execution to avoid cold starts and enforce domain structure.
- RA-TGR: RAG augments template selection, seed augmentation, and optional retrieval nodes in the TGR DAG.
- Timeout & budgeting: Per-node and overall budgets keep the system responsive.
Extensibility:
- Add new templates under
configs/templates/(nodes, edges, seeds). - Improve the distiller (e.g. semantic or embedding-based retrieval for template selection).
- Swap or add models in
openrouter.yamlwithout changing core code.
For deeper technical detail, see docs/ARCHITECTURE.md and docs/SYSTEM_DOCUMENTATION.md.