Skip to content

Multimodal AI & Agent API Evaluation Framework (GSoC 2026 PoC)#76

Open
uddalak2005 wants to merge 10 commits intofoss42:mainfrom
uddalak2005:main
Open

Multimodal AI & Agent API Evaluation Framework (GSoC 2026 PoC)#76
uddalak2005 wants to merge 10 commits intofoss42:mainfrom
uddalak2005:main

Conversation

@uddalak2005
Copy link
Copy Markdown

EvalForge

EvalForge is a comprehensive evaluation framework for AI APIs across four core modalities:

  • Text (MMLU)
  • Multimodal (VQA)*
  • Agent Tool-Call Tracing (TFS)
  • MCP App Integration

Core Features

Multimodal VQA

Evaluate vision-language models with base64 image support.

Agent Trajectory Fidelity Score (TFS)

An original metric to measure how faithfully agents follow tool-call sequences compared to a gold standard.

MCP Apps Integration

The evaluation dashboard is itself an MCP App, allowing researchers to:

  • Trigger evaluations
  • View results directly inside an AI agent chat window

Live Analytics

A real-time React dashboard with:

  • Accuracy tracking
  • Latency monitoring
  • Token cost analysis
  • Historical insights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant