This repository contains lesson materials, code examples, and evaluation scripts for Unit 7 of the Agentic AI Developer Certification Program by Ready Tensor. This week is all about evaluating agentic systems — how to measure performance, diagnose behavior, and build trust in AI that thinks and acts.
- Why traditional evaluation breaks down for adaptive, agentic systems
- A practical toolkit of evaluation methods — from LLM-as-judge to red teaming
- How to select the right metrics based on your system’s goals and design
- How to use RAGAS for scoring retrieval-augmented generation pipelines
- How to use DeepEval for multi-step evaluation with custom metrics
- How to evaluate multi-agent systems — measuring coordination, not just components
Learn why adaptive AI systems require new thinking in evaluation. We go beyond test cases and accuracy scores to ask: What does “success” look like for a system that reasons, adapts, and collaborates?
Explore seven hands-on methods — from human review to red teaming to automated scoring. Understand strengths, weaknesses, and best-fit use cases for each.
Choose evaluation metrics that actually matter. We walk through seven design dimensions (like output type and interaction mode) to help you tailor metrics to your own system.
Hands-on walkthrough of RAGAS — a framework for evaluating RAG pipelines. Learn how to generate test sets, define evaluation workflows, and create custom metrics.
Dive into DeepEval, a flexible evaluation toolkit for real-world LLM apps. Learn how to create structured evaluation flows, define correctness and faithfulness, and integrate with your dev workflow.
A real case study for evaluating a multi-agent system using golden datasets, coordination metrics, and system-level scoring. Learn how to assess collaboration — not just component quality.
rt-agentic-ai-cert-unit7/
├── code/
│ ├── llm.py # LLM utility wrapper
│ ├── paths.py # Standardized file path management
│ ├── prompt_builder.py # Modular prompt construction functions
│ ├── run_lesson4_ragas_eval.py # Lesson 4: Example script for RAGAS-based evaluation
│ ├── run_lesson5_deepeval_demo.py # Lesson 5: Evaluation pipeline using DeepEval
│ ├── run_lesson6_multiagent_case_study.py # Lesson 6: Multi-agent evaluation case study
│ └── utils.py # Common utilities
├── config/ # Configuration files
├── data/ # Input data for code examples
├── outputs/ # Output files from code examples
├── lessons/ # Lesson content and assets
├── .env.example # Sample environment file (e.g., for API keys)
├── .gitignore
├── LICENSE
├── README.md # You are here
└── requirements.txt # Python dependencies for evaluation tools-
Clone the repository:
git clone https://github.com/readytensor/rt-agentic-ai-cert-week7.git cd rt-agentic-ai-cert-week7 -
Install dependencies:
pip install -r requirements.txt
-
Set up your environment variables:
Copy the
.env.exampleto.envand fill in required values (e.g., OpenAI or Groq API keys):cp .env.example .env
Each code example is runnable as a standalone script:
-
Lesson 4 – RAGAS Evaluation:
python code/run_lesson4_ragas_eval.py
-
Lesson 5 – DeepEval Evaluation:
python code/run_lesson5_deepeval_demo.py
-
Lesson 6 – Multi-Agent Evaluation:
python code/run_lesson6_multiagent_case_study.py
Evaluation reports will be saved to the outputs/evaluation_reports/ folder.
This project is licensed under the CC BY-NC-SA 4.0 License – see the LICENSE file for details.
Ready Tensor, Inc.
- Email: contact at readytensor dot com
- Issues & Contributions: Open an issue or PR on this repo
- Website: https://readytensor.ai