This guide explains how to use the codebase to conduct multi-turn dialogue evaluation. The evaluation process consists of three main steps:
- Generate multi-turn dialogue tests using scripts in the
predfolder - Merge output files using
merge.py - Evaluate the results using scripts in the
judgefolder
- Python 3.x
Install all required dependencies using the provided requirements.txt file:
pip install -r requirements.txtAlternatively, you can install packages individually as needed:
openaihttpxpandasopenpyxl(for Excel file handling)
Use pred/main.py to generate multi-turn dialogue tests. This script executes three types of QA tests: MT_Inter, MT_Cog, and MT_App.
cd pred
python main.py \
--data-dir ../data \
--output-dir ../output \
--model-name your_model_name \
--attack-api-key your_attack_api_key \
--attack-base-url your_attack_base_url \
--attack-model-name your_attack_model_name \
--defense-api-key your_defense_api_key \
--defense-base-url your_defense_base_url \
--defense-model-name your_defense_model_name--data-dir: Input data directory path (default:./data)- Should contain:
MT_Inter.xlsx,MT_Cog.xlsx,MT_App.xlsx
- Should contain:
--output-dir: Output directory path (default:./output)--model-name: Test model name, used as output subfolder name (default:model_name)--attack-api-key: API key for the attack model--attack-base-url: Base URL for the attack model API--attack-model-name: Name of the attack model--defense-api-key: API key for the defense model--defense-base-url: Base URL for the defense model API--defense-model-name: Name of the defense model
The script generates output files in {output-dir}/{model-name}/ directory. Each test type (MT_Inter, MT_Cog, MT_App) produces multiple JSONL files containing the dialogue records.
python main.py \
--data-dir ../data \
--output-dir ../output \
--model-name deepseek-v3 \
--attack-api-key sk-xxx \
--attack-base-url https://api.example.com/v1 \
--attack-model-name deepseek-r1 \
--defense-api-key sk-yyy \
--defense-base-url https://api.example.com/v1 \
--defense-model-name deepseek-v3Use pred/merge.py to merge all JSONL files in each subdirectory of the output directory into a single merged.jsonl file.
cd pred
python merge.py --output-dir ../output--output-dir: Output directory path (default: current script directory or default path)
- Scans all subdirectories in the output directory
- Finds all JSONL files in each subdirectory (excluding
merged.jsonl) - Merges records by
id, handling round-specific fields (e.g.,round_num,estimated_risk_escalation,strategy_chosen,reason) - Outputs
merged.jsonlin each subdirectory
Each subdirectory in the output directory will contain a merged.jsonl file with all dialogue records merged by ID.
python merge.py --output-dir ../outputUse judge/evaluate.py to evaluate the merged dialogue records. This script uses an LLM judge model to assess the defense performance in multi-turn dialogues.
python -m judge.evaluate \
--output-dir ./output \
--judge-api-key your_judge_api_key \
--judge-base-url your_judge_base_url \
--judge-model-name your_judge_model_name--output-dir: Output directory path (default:./output)--judge-api-key: API key for the evaluation model (optional, has default value)--judge-base-url: Base URL for the evaluation model API (optional, has default value)--judge-model-name: Name of the evaluation model (optional, has default value)--start-from: Start from specified subdirectory (alphabetically, includes the subdirectory)--end-at: End at specified subdirectory (alphabetically, includes the subdirectory)
- Scans all subdirectories in the output directory
- Looks for
merged.jsonlorattack_mergered.jsonlin each subdirectory - Builds conversation history from the records
- Selects appropriate scoring rules based on the
datasetfield:MT_Inter→ endogenous rules moduleMT_App→ application rules moduleMT_Cog→ cognitive rules module
- Calls the LLM judge API with the complete prompt
- Parses the JSON response and extracts evaluation results
- Saves results to
evaluation.jsonlin each subdirectory
Each subdirectory will contain an evaluation.jsonl file with evaluation results. The file includes:
- Original dialogue records
- Evaluation scores and deductions
- Detailed deduction reasons
- Defense assessment
Additionally, the script generates JSON and CSV summary files with statistics.
python -m judge.evaluate \
--output-dir ./output \
--judge-api-key sk-zzz \
--judge-base-url https://api.example.com/v1 \
--judge-model-name qwen3-235bHere's a complete example of running the entire evaluation pipeline:
# Step 1: Generate multi-turn dialogues
cd pred
python main.py \
--data-dir ../data \
--output-dir ../output \
--model-name test-model \
--attack-api-key sk-attack \
--attack-base-url https://api.example.com/v1 \
--attack-model-name deepseek-r1 \
--defense-api-key sk-defense \
--defense-base-url https://api.example.com/v1 \
--defense-model-name deepseek-v3
# Step 2: Merge output files
python merge.py --output-dir ../output
# Step 3: Evaluate results
cd ..
python -m judge.evaluate \
--output-dir ./output \
--judge-api-key sk-judge \
--judge-base-url https://api.example.com/v1 \
--judge-model-name qwen3-235bmulti-turn/
├── data/ # Input data files
│ ├── MT_Inter.xlsx
│ ├── MT_Cog.xlsx
│ └── MT_App.xlsx
├── pred/ # Prediction/generation scripts
│ ├── main.py # Main script to generate dialogues
│ ├── merge.py # Script to merge output files
│ └── ...
├── judge/ # Evaluation scripts
│ ├── evaluate.py # Main evaluation script
│ ├── prompt.py # Prompt templates
│ └── README.md # Detailed evaluation guide
├── output/ # Output directory (created after Step 1)
│ └── {model-name}/ # Subdirectory for each model
│ ├── *.jsonl # Generated dialogue files
│ ├── merged.jsonl # Merged file (after Step 2)
│ └── evaluation.jsonl # Evaluation results (after Step 3)
└── README.md # This file
-
API Configuration: Make sure to configure the correct API keys and base URLs for:
- Attack model (used to generate adversarial questions)
- Defense model (the model being evaluated)
- Judge model (used for evaluation)
-
File Requirements: The input Excel files in
data/directory must follow the expected format. Refer to the data files for the required structure. -
Error Handling: If any step fails, check the error messages and ensure:
- API keys are valid
- Network connectivity is available
- Input files are in the correct format
- Output directory has write permissions
-
Incremental Processing: The evaluation script processes all subdirectories. You can use
--start-fromand--end-atparameters to process specific ranges if needed. -
Output Files:
- After Step 1: Multiple JSONL files per test type
- After Step 2:
merged.jsonlin each subdirectory - After Step 3:
evaluation.jsonland summary files (JSON/CSV) in each subdirectory
- Import Errors: Make sure you're running scripts from the correct directory or using the
-mflag for module imports - API Errors: Verify API keys and base URLs are correct, and check network connectivity
- File Not Found: Ensure input data files exist in the
data/directory - Permission Errors: Check that the output directory has write permissions
For more detailed information about the evaluation process, see judge/README.md.