The credit rating agency for AI agents — open-source evaluation platform with two layers: Task Eval (agent task completion) + Tool Eval (MCP server quality).
# Install dependencies
bun install
# Build CLI
bun run --filter agent-eval build
# Run CLI tests
bun run --filter agent-eval test
# Lint (all packages)
bun run lint
# Run CLI locally
node packages/agent-eval/dist/cli.js --helpagent-eval/
├── packages/
│ ├── agent-eval/ # CLI tool (@npm: agent-eval)
│ │ ├── src/
│ │ │ ├── cli.ts # Commander.js entry point
│ │ │ ├── commands/ # CLI subcommands (init, run, report)
│ │ │ ├── config/ # Config schema (Zod) and loader
│ │ │ ├── protocols/ # Protocol adapters (MCP, A2A, REST)
│ │ │ ├── eval/ # Evaluation engine, task gen, scoring
│ │ │ └── report/ # Report generation
│ │ └── tests/ # Unit tests (Vitest)
│ └── web/ # Website (Next.js, Phase 2)
├── docs/
│ ├── PRD.md # Product requirements document
│ └── IMPLEMENTATION_PLAN.md
└── turbo.json
- Runtime: Node.js 20+, Bun as package manager
- Language: TypeScript (strict mode)
- CLI Build: tsup (ESM output)
- Testing: Vitest
- Linting: Biome
- Monorepo: Turborepo + Bun workspaces
- LLM: Anthropic Claude API (for task generation + LLM-as-judge scoring)
- Agent protocols: MCP SDK (@modelcontextprotocol/sdk), A2A (HTTP)
- All code comments in English, detailed enough for newcomers to understand
- Prefer mature open-source solutions over reinventing
- All new code must have corresponding unit tests in the
tests/directory - Tests must pass before committing
- Use Zod for all runtime validation (config, API responses, etc.)
- ESM-only (
"type": "module"in package.json) - Use
node:prefix for Node.js built-in imports
- CRITICAL: When generating any content (blog posts, reports, documentation, eval configs):
- Check the current system date first
- Verify all model names/IDs are the LATEST versions as of that date (e.g. Claude Sonnet 4.6, not Sonnet 4; Opus 4.6, not Opus 4; check system context for current model IDs)
- Re-read the latest results/ files — never use stale data from earlier in the conversation
- Include the generation date in all published content
- Reference specific evaluation data with exact numbers from the latest results
- Model versions: Always use the latest model IDs when creating agent configs or referencing models. Outdated model names make the product look stale. Check the assistant system prompt for current model IDs before writing.
- Protocol adapters are pluggable — each protocol (MCP, A2A, REST) implements a common
ProtocolAdapterinterface - LLM-as-judge — quality scoring uses Claude with human-written YAML rubrics. The LLM scores outputs, it doesn't define what "good" means
- Verification layer — when users publish results, platform re-runs 20% of tests to prevent score manipulation
- Task sets are versioned — scores reference which task set version was used, ensuring fair comparison
Copy .env.example to .env:
ANTHROPIC_API_KEY=sk-ant-... # Required for eval scoring (LLM-as-judge)
- Website: Vercel (--scope orris), subdomain: eval.agenthunter.io
- CLI: npm registry as
agent-eval