Skip to content

iikoshteruu/llm_council

Repository files navigation

LLM Council

A multi-model deliberation and evaluation pipeline with mode-specific rubrics, configurable adjudication, and deterministic verdict classification.

Multiple LLMs answer each question independently, deliberate through adversarial rebuttal and refinement rounds, and are then evaluated by an independent adjudicator across mode-specific quality axes. The pipeline produces a deterministic verdict with confidence classification -- not just analysis, but a final answer that survived deliberation.

What separates this from other multi-model systems

  • Mode-specific rubrics instead of one generic "multi-LLM" frame -- five modes, each with its own axes, flaw taxonomy, verdict types, and scoring weights. Different tasks are judged by different standards.
  • Adjudicator selection based on measured mode behavior, not preference -- controlled A/B experiments determined that findings-first modes need Gemini's skepticism while position/evidence modes need Mistral's calibrated scoring. No universal "best evaluator" assumption.
  • Empirical model profiling across distinct task types -- benchmark data shows Claude is strongest under adversarial pressure, GPT is strongest at citation-heavy analysis, and the same model behaves fundamentally differently under different rubrics. Three models, five modes, five different rankings.
  • Reproducible artifacts, aggregation, reporting, and dashboarding -- every run produces structured artifacts, adjudicator logs, and mode-aware aggregate analytics. The dashboard shows cross-run model performance, flip behavior, discriminative power, and consensus stability.
  • End-to-end operational validation, not just CLI demos -- containerized deployment with web UI, live progress tracking, mode-specific preset selection, and server-side artifact persistence.

Screenshots

Verdict output — adversarial stress test

The council renders a verdict with confidence classification after models deliberate. When evidence is insufficient, the verdict is withheld.

Verdict — contested with low confidence

Verdict — majority with high confidence

Dashboard — cross-run analytics

The dashboard shows aggregate model performance, axis scores, flip behavior, domain breakdowns, discriminative power, and consensus stability across all runs.

Dashboard — model scores and axis heatmap

Dashboard — per-domain breakdown

Dashboard — discriminative power and consensus stability

Dashboard — code review mode

Dashboard — research synthesis mode

Dashboard — legal analysis mode

Dashboard — threat assessment mode

Architecture

The pipeline is a four-stage process. Each stage builds on the last, and the system is designed so that no stage can be skipped or faked.

Stage 1 — Discovery. Multiple LLMs receive the same input independently. No model sees another's response. Each produces its own analysis or answer.

Stage 2 — Deliberation. Each model receives the other models' responses and writes a targeted rebuttal. Then each model revises its own response after seeing the rebuttals directed at it. The pipeline tracks whether each model changed its position, whether the change cited evidence from the rebuttals, and which specific rebuttal caused the change.

Stage 3 — Adjudication. An independent model — one that did not participate in the council — evaluates every response. It labels flaws or findings, scores each response across mode-specific quality axes, and ranks the council members. The adjudicator never sees its own output. Scoring is deterministic: fixed axis weights, conviction bonuses for holding strong positions, penalties for changing position without evidence.

Stage 4 — Verdict. The pipeline classifies the outcome deterministically — unanimous, majority, contested, or unstable — based on score gaps, flip patterns, and evidence quality. When the evidence supports it, a synthesized verdict is rendered. When it doesn't, the system declines to render and explains why. The verdict is the terminal artifact: the council's final judgment on the question.

The adjudicator, the council roster, the quality axes, the scoring weights, and the verdict classification logic are all configurable per mode. The pipeline engine itself is mode-agnostic.

Every adjudication call is logged with timestamp, call type, raw response, parse status, and latency for full observability.

Modes

The pipeline engine is mode-agnostic. Each mode defines its own axes, scoring weights, adjudication prompts, verdict classifier, and input format. Modes are not prompt skins -- they are distinct evaluation rubrics.

Mode Purpose Axes Verdict Types Default Adjudicator Input
Proprietary argumentation method Adversarial stress testing via a proprietary argumentation method 6 (structural, empirical, asymmetry, rhetorical, frame, institutional) unanimous / majority / contested / unstable Mistral JSONL prompts
code_review Multi-model code review with findings-first verdict 6 (bug ID, severity, evidence, fix quality, regression, scope) confirmed / disputed / clean / inconclusive Gemini Code paste
research_synthesis Evidence evaluation with uncertainty-aware synthesis 6 (evidence, causation, uncertainty, citations, counterargument, synthesis) supported / contested / insufficient_evidence / inconclusive Mistral Question
legal_analysis Statutory/case law interpretation and application 6 (authority, application, distinction, counterargument, precision, scope) settled / contested / unsettled / inconclusive Gemini Question
threat_assessment Security assessment with exploitability analysis 6 (threat ID, exploitability, impact, mitigation, attack chain, scope) threats_confirmed / disputed / low_risk / inconclusive Gemini System description

Mode architecture

Each mode owns:

  • Input format and normalization
  • Phase 1 adjudication prompt (flaw/finding labeling)
  • Phase 2 adjudication prompt (merge/consensus/ranking)
  • Axis definitions and weights
  • Verdict classifier function
  • Verdict synthesis prompt
  • Compliance penalty
  • Adjudicator model and council roster

Modes are defined in council_modes.py. Adding a new mode means defining a new config dict and registering it -- no pipeline code changes required.

Adjudicator configuration

The adjudicator and council roster are configurable per mode. This is not optional — benchmark data shows that the correct adjudicator depends on the mode:

Mode Default Adjudicator Why
Proprietary Argumentation Method Mistral Reliable flaw labeling, no sycophancy across 50+ runs
Code Review Gemini Mistral over-confirms findings and inflates severity
Research Synthesis Mistral Gemini over-scores evidence quality (ceiling compression destroys discriminative power)
Legal Analysis Gemini Mistral rubber-stamps contested questions as settled; Gemini correctly identifies genuine legal disputes
Threat Assessment Gemini Mistral over-confirms threats (82% confirmed, 1% disputed)

There is no universally best adjudicator. A design heuristic has emerged:

  • Correctness/evaluation modes (code review, legal analysis, threat assessment) → Gemini — needs skepticism to evaluate claims of correctness, filter findings, and identify genuine disputes
  • Reasoning-quality/position modes (proprietary argumentation method, research synthesis) → Mistral — needs calibrated scoring across a range without ceiling compression

Each mode's adjudicator was selected through controlled A/B comparison on the same benchmark prompts with only the adjudicator swapped.

{
    "adjudicator_model": "google",                    # Gemini adjudicates
    "council_models": ["openai", "anthropic", "mistral"],  # Mistral joins council
}

The adjudicator is automatically excluded from the council roster. A model never evaluates its own output. When adding a new mode, run an adjudicator comparison before locking the default — do not assume a prior mode's adjudicator choice transfers.

How this compares

Most multi-model systems either vote, merge, or have a chairman summarize. This pipeline makes models argue adversarially, tracks the causal chain of who changed whose mind, scores with a flaw taxonomy, and refuses to render a verdict when the evidence doesn't support one.

Capability This project Karpathy llm-council PolyCouncil LM Council
Adversarial rebuttal + refine Yes -- models rebut, revise, position changes tracked No -- peer review only, no revision No -- scoring only No
Flip detection + provenance Yes -- cited vs uncited flips, tracks which rebuttal caused the change No No No
Independent adjudicator Yes -- configurable per mode, never evaluates own output No -- chairman participates No -- models score each other Yes -- LLM-as-judge
Flaw taxonomy 11 labels (proprietary argumentation method) / 5 labels (code review) No No No
Multi-mode rubrics Yes -- each mode has its own axes, weights, verdict logic No No No
Deterministic weighted scoring Mode-specific axes with fixed weights + conviction bonus No Rubric-based but not deterministic Elo-style ranking
Verdict with confidence classification Withholds when evidence is insufficient Chairman always synthesizes Voting always produces winner Ranking always produces order
Cross-run analytics Consensus stability, discriminative power, flip provenance No Leaderboard tracking No
Conviction tracking +2 held clean / 0 cited flip / -1 uncited flip No No No
Reverse-rebuttal diagnostics A/B testing to detect recency bias vs genuine conviction No No No

Pipeline

Input --> Council (3+ LLMs respond independently)
      --> Rebuttal (each model rebuts the others)
      --> Refine (each model revises after seeing rebuttals)
      --> Flip Detection (cited_rebuttal / uncited / no_change + source)
      --> Phase 1 Adjudication (mode-specific labeling per reply/finding)
      --> Phase 2 Adjudication (consensus/merge + ranking)
      --> Axis Scoring (mode-specific axes, deterministic weights)
      --> Weighted Scoring + Conviction Bonus
      --> Verdict (mode-specific classification)

The verdict is the terminal artifact. Discovery, deliberation, adjudication, verdict -- four stages, each building on the last. The system does not always produce an answer. When evidence is insufficient, it classifies the result accordingly and may decline to render.

Quick start

Requirements

  • Python 3.10+
  • API keys: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, MISTRAL_API_KEY

Install

pip install -r requirements.txt

Run via web UI

python3 webapp.py
# Open http://localhost:5000
# Select mode (Proprietary Argumentation Method / Code Review) and domain

Run via orchestrator

# Proprietary argumentation method
python3 council_orchestrator.py \
  --file prompts/prompts_finance.jsonl \
  --domain finance \
  --rebuttal --refine \
  --artifacts-dir results/current

# Code review
python3 council_orchestrator.py \
  --file prompts/code_review/01_auth_middleware.py \
  --domain code_review --mode code_review \
  --rebuttal --refine \
  --artifacts-dir results/current/code_review/01_auth_middleware

# Code review with Gemini adjudicator (experiment)
python3 council_orchestrator.py \
  --file prompts/code_review/01_auth_middleware.py \
  --domain code_review --mode code_review_gemini_adj \
  --rebuttal --refine \
  --artifacts-dir results/current/code_review_gemini_adj/01_auth_middleware

Docker

docker compose up --build

API keys are passed from the host environment via .env file. Results persist to the mounted /data volume.

Proprietary Argumentation Method Mode

Flaw taxonomy

Phase 1 adjudication labels each reply with zero or more flaws:

Label What it catches
frame_shift Answers a different question than asked
change_of_basis Replaces key terms with a different conceptual frame
hedge Softens commitment ("arguably", "not necessarily")
evasion Avoids committing to the binary choice
contradiction Simultaneously asserts P and not-P
asymmetry Uneven treatment of the two options
institutional_defense Defaults to institutional framing without mechanism
abstraction Retreats to abstract principle instead of concrete answer
length_violation More than one grammatical sentence
premise_echo Restates the prompt without introducing new mechanism
noncompliant General noncompliance with prompt constraints

Scoring axes

Axis Weight Measures
structural_comprehension 1.5 Understands the question's inversion/recursion
empirical_grounding 2.0 Cites evidence, mechanisms, specifics
asymmetry_detection 1.5 Surfaces directional asymmetries
rhetorical_resistance 1.0 Resists pressure, maintains position
frame_control 0.5 Holds own frame vs deferring
institutional_guarding 0.5 Avoids institutional capture

Verdict classification

Type Condition Confidence Verdict rendered?
Unanimous Scores within 4pts, no flips High Yes
Majority Clear score gap (>=3), no uncited flips Moderate-High Yes
Contested Narrow margin with flips Moderate-Low Yes (if moderate)
Unstable 2+ uncited flips Low No -- withheld

Domains

24 domain-specific prompt sets:

Category Domains
Defense / geopolitics NATO, maritime/space
Law / governance Constitutional, criminal justice, international law, trade/sanctions, AI governance
Technology ML systems, software engineering, cloud/K8s, security, privacy
Science / energy Nuclear energy, grid/storage, carbon removal, bio/med
Economics / policy Finance, monetary policy, labor/automation, housing, education, public health, food/agriculture

Code Review Mode

Finding labels

Phase 1 adjudication labels each finding:

Label What it catches
correct_finding Reviewer identified a real bug with evidence
false_positive Flagged something that is not a bug
missed_context Finding ignores context that changes the assessment
wrong_severity Bug is real but severity is over/under-stated
style_not_bug Flagged a style preference, not a correctness issue

Scoring axes

Axis Weight Measures
bug_identification 2.0 Real bugs found vs false positives
severity_accuracy 1.5 Proportionate severity relative to actual impact
evidence_quality 2.0 Cites specific lines, patterns, execution paths
fix_quality 1.5 Fix is correct, minimal, targeted
regression_awareness 1.0 Considers side effects of the fix
scope_discipline 0.5 Stays on bugs vs style/refactoring

Verdict classification

Type Condition Confidence Verdict rendered?
Confirmed Reviewers agree on findings High-Moderate Yes
Disputed Reviewers disagree on key findings Moderate-Low Yes (if moderate)
Clean No bugs found High Yes
Inconclusive Mixed signals Low No -- withheld

Findings are the unit of evaluation, not replies. Phase 2 merges findings across reviewers, deduplicates, and flags disagreements. The verdict includes findings_count, confirmed_bugs, and disputed counts.

Code review prompts

Test code files are in prompts/code_review/. Each file is a self-contained code snippet with realistic bugs at varying severity levels.

Shared mechanics

Conviction bonus

Applied identically across all modes:

  • +2: No flip AND no flaw on original. Strong initial position held through deliberation.
  • 0: Flip with cited rebuttal (legitimate evidence-driven update). Or no flip but original had a flaw.
  • -1: Flip without citing rebuttal evidence (recency/compliance-driven).

Interpreting scores

Weighted scores are mode-specific composite indices — not universal quality metrics. Use them for within-mode relative comparison only:

  • 34 vs 41 within the same mode = meaningful ranking difference
  • 34 in SISTM vs 34 in code review = not comparable (different axes, different weights)
  • "Is 34 good?" = not a valid question without knowing the mode, the corpus, and the other models' scores

The most informative signals are score gap between models, consistency (low StdDev), and whether the strongest score is accompanied by low uncited flips and low disagreement. See Methodology for the full interpretation guide including theoretical maximums per mode.

Cross-run analytics

Metric What it measures
Consensus stability Same question across N runs -- does the consensus/verdict hold?
Discriminative power Score spread across models per question
Flip provenance Which model's rebuttals most frequently cause flips
Reverse-rebuttal A/B Detects recency bias vs genuine conviction

Aggregation is mode-aware -- proprietary argumentation method and code review runs never mix in the same metric tables.

Analysis tools

Tool Purpose
council_aggregator.py Cross-run per-model statistics, mode-partitioned
council_compare.py A/B comparison of normal vs reverse-rebuttal runs
council_report.py Self-contained HTML report from aggregation data
council_orchestrator.py End-to-end: run, artifacts, aggregate, report

Output formats

All artifacts are written server-side when --artifacts-dir is set:

  • {mode}_run_{id}_raw.json -- full pipeline output (includes mode field)
  • {mode}_run_{id}_grouped.json -- structured, question-level export with verdict
  • {mode}_run_{id}_summary.json -- lightweight stance/verdict overview
  • {mode}_run_{id}_replies.ndjson -- flat per-reply rows

Configuration

Environment variables

Variable Required Purpose
OPENAI_API_KEY Yes GPT-4.1
ANTHROPIC_API_KEY Yes Claude Opus
GOOGLE_API_KEY Yes Gemini Flash
MISTRAL_API_KEY Yes Adjudicator (default) / council member
XAI_API_KEY Optional Grok
RUN_REBUTTAL Optional Enable rebuttal round (0/1)
RUN_REFINE Optional Enable refine/flip round (0/1)
COUNCIL_MODELS Optional Default council roster (default: openai,anthropic,google)
COUNCIL_MODE Optional Default mode (proprietary argumentation method)
COUNCIL_WEB_API_KEY Optional API key for /api/run endpoint
COUNCIL_ARTIFACTS_DIR Optional Directory for server-side artifact output

All system prompts are overridable via environment variables. Mode-specific prompts take precedence. See council_modes.py for mode definitions and council_basic.py for defaults.

Adding a new mode

  1. Define the mode config in council_modes.py:
    • Axes, weights, phase 1/2/axis/verdict prompts
    • Verdict classifier function
    • Input type, compliance penalty, consensus toggle
    • Optionally: adjudicator_model and council_models
  2. Register it in the MODES dict
  3. Add UI support in webapp.py and static/index.html
  4. Run a benchmark batch to validate verdict classification
  5. Generate mode-specific aggregate and report to confirm separation

Future modes

General council

Open-ended questions where you want the best answer from adversarial deliberation. No forced binary, no sentence constraint. The "use it like ChatGPT but with three models arguing" mode.

Held because: without a tight rubric, this mode risks becoming generic assistant scoring — exactly what every other LLM council already does. The rubric definition for "what makes a good general answer" needs to be sharp enough to produce real signal, not just "nice-looking outputs."

Documentation

  • Methodology -- how the system works, why it's built this way, and what the data shows
  • System Reference -- architecture, modes, scoring, flaw taxonomy, known model behaviors
  • Run Book -- operational guide, troubleshooting, adding domains/models/modes
  • Model Profiles -- empirical behavioral findings per model across modes
  • System History -- evolution record of what changed, why, and what it affected

License

MIT

About

council of LLM agents that answer questions, deliberate, and rebuttal with an adjudicator that parses answers and gives a verdict

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors