MedicalAiBenchEval
A comprehensive medical AI evaluation framework based on GAPS methodology. Features automated assessment pipeline, thoracic surgery dataset (92 cases), multi-model support, parallel processing, and clinical scoring. Supports OpenAI-compatible APIs including GPT, Claude, Gemini, Qwen with detailed analytics and visualization.
Install / Use
/learn @AQ-MedAI/MedicalAiBenchEvalQuality Score
Category
Data & AnalyticsSupported Platforms
README
Medical AI Evaluation Framework: Clinical Benchmark Dataset and Automated Assessment Pipeline
📋 Table of Contents
- Overview
- Key Features
- Quick Start
- System Architecture
- Data Format
- Configuration
- Usage Guide
- Evaluation Framework
- Dataset
- Advanced Features
- Development
- Community
- License
Overview
This Medical AI Evaluation Framework provides a comprehensive evaluation system designed specifically for assessing AI models in clinical scenarios. Based on the GAPS (Grounded, Automated, Personalized, Scalable) methodology, this framework includes both a curated clinical benchmark dataset and an automated assessment pipeline for medical AI systems.
The framework addresses the critical need for standardized evaluation of AI clinical decision-making by providing:
- Clinically Grounded Assessment: Evaluation criteria based on real medical guidelines and expert knowledge
- Automated Pipeline: Streamlined processing from raw responses to detailed performance metrics
- Multi-Model Support: Simultaneous evaluation of multiple AI models with comparative analysis
- Scalable Architecture: Efficient processing of large datasets with parallel execution capabilities
Key Features
- 🏥 Medical-Specific Evaluation: Specialized rubrics for clinical scenarios with positive/negative scoring
- 🔄 Parallel Processing: Simultaneous execution of Met/Not Met review and irrelevant content detection
- 📊 Comprehensive Analytics: Detailed statistical analysis with visualization reports
- 🎯 Multi-Model Assessment: Support for evaluating multiple AI models simultaneously
- ⚙️ Flexible Configuration: Customizable models, voting strategies, and evaluation parameters
- 📈 Rich Visualization: Automated generation of performance charts and comparative analysis
- 🔧 Modular Design: Independent modules for different evaluation stages
- 📋 Standardized Output: Consistent Excel-based reporting with detailed metrics
Quick Start
Installation
# Clone the repository
git clone <repository-url>
cd medical-ai-bench-eval
# Install dependencies
pip install -r requirements.txt
Basic Usage
# Run complete evaluation pipeline
python medical_evaluation_pipeline.py input_data.xlsx
# With custom output file
python medical_evaluation_pipeline.py input_data.xlsx -o results/evaluation_results.xlsx
# Enable verbose logging
python medical_evaluation_pipeline.py input_data.xlsx -v
System Architecture
The system processes medical AI responses through a sophisticated pipeline that includes:
- Input Processing: Reads Excel files containing medical questions, evaluation rubrics, and AI model responses
- Parallel Evaluation: Simultaneously executes Met/Not Met review and irrelevant content detection
- Intelligent Scoring: Calculates comprehensive scores based on clinical evaluation criteria
- Analysis & Reporting: Generates detailed statistical reports and visualizations
System Architecture
Input Excel File
↓
┌─────────────────────────────────────────┐
│ Parallel Processing Phase │
├─────────────────┬─────────────────────┤
│ Step 1: Met Review │ Step 2: Irrelevant Content │
│ - Multi-model Review │ - Content Extraction │
│ - Voting Decision │ - Level Assessment │
│ - Result Summary │ - Voting Classification │
└─────────────────┴─────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Result Merging │
│ - Intelligent merging of parallel results │
│ - Data integrity check │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 3: Score Calculation │
│ - Multi-dimensional score statistics │
│ - Irrelevant content deduction │
│ - Normalization processing │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 4: Data Analysis & Visualization │
│ - Statistical analysis reports │
│ - Diverse charts │
│ - CSV data export │
└─────────────────────────────────────────┘
↓
Output Excel File + Analysis Report
Data Format
Input Format
The system accepts Excel files with the following structure:
| Column Name | Description | Example | |-------------|-------------|---------| | question | Medical question | "Patient presents with chest pain symptoms, how to diagnose?" | | final_merged_json | Evaluation points JSON | [{"id":1,"claim":"Need to inquire about symptoms","level":"A1"}] | | gpt_5_answer | GPT model response | "First need to ask the patient about symptoms in detail..." | | gemini_2_5_pro_answer | Gemini model response | "Recommend performing ECG examination..." | | claude_opus_4_answer | Claude model response | "Should consider acute coronary syndrome..." |
Evaluation Points JSON Format
[
{
"id": 1,
"claim": "Need to ask about symptom duration",
"level": "A1",
"desc": "Detailed inquiry about chest pain duration, nature, etc."
},
{
"id": 2,
"claim": "Should avoid mentioning unrelated treatment plans",
"level": "S2",
"desc": "Should not mention treatments unrelated to chest pain"
}
]
Level Description:
- A1-A3: Positive points (A1=5 points, A2=3 points, A3=1 point)
- S1-S4: Negative points (S1=-1 point, S2=-2 points, S3=-3 points, S4=-4 points)
Output Format
The system generates multiple output files:
-
Final Score Excel (
processed_result_final_YYYYMMDD_HHMMSS.xlsx)- Contains all original data
- Met/Not Met review results
- Irrelevant content detection results
- Detailed scoring statistics
-
Data Analysis Report (
data/output/analysis/)medical_evaluation_report_YYYYMMDD_HHMMSS.png- Visualization chartsmedical_analysis_report_YYYYMMDD_HHMMSS.txt- Detailed analysis reportmodel_performance_summary_YYYYMMDD_HHMMSS.csv- Performance summary
Scoring Metrics
| Metric | Description | |--------|-------------| | max_possible | Theoretical maximum score (sum of all positive points) | | final_total_score | Actual score (Met items score - irrelevant content deduction) | | normalized | Normalized score (between 0-1) | | positive_total_count | Number of positive point hits | | rubric_total_count | Number of negative point hits | | irrelevant_total_count | Total irrelevant content count |
Configuration
Basic Configuration
Environment Requirements
- Python: 3.12+
- Operating System: Windows, macOS, Linux
- Core Dependencies: pandas, numpy, matplotlib, seaborn, asyncio, langchain, openpyxl, xlsxwriter
Installation Steps
- Clone the repository:
git clone <repository-url>
cd medical-ai-bench-eval
- Install dependencies:
pip install -r requirements.txt
-
Configure API Keys (if using external AI model services):
API Configuration Guide
Environment Variables Setup
Linux/Mac:
# OpenAI export OPENAI_API_KEY="sk-your-actual-key" export OPENAI_BASE_URL="https://api.openai.com/v1" # Claude (Anthropic) export ANTHROPIC_API_KEY="sk-ant-your-actual-key" export ANTHROPIC_BASE_URL="https://api.anthropic.com" # Google Gemini export GOOGLE_API_KEY="your-google-api-key" export GOOGLE_BASE_URL="https://generativelanguage.googleapis.com/v1beta" # Moonshot Kimi export MOONSHOT_API_KEY="sk-your-moonshot-key" export MOONSHOT_BASE_URL="https://api.moonshot.cn/v1" # Alibaba Qwen export DASHSCOPE_API_KEY="your-dashscope-key" export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/api/v1" # Baichuan export BAICHUAN_API_KEY="your-baichuan-key" export BAICHUAN_BASE_URL="https://api.baichuan-ai.com/v1" # DeepSeek export DEEPSEEK_API_KEY="sk-your-deepseek-key" export DEEPSEEK_BASE_URL="https://api.deepseek.com/v1" # Zhipu ChatGLM export ZHIPU_API_KEY="your-zhipu-key" export ZHIPU_BASE_URL="https://open.bigmodel.cn/api/paas/v4"Windows PowerShell:
$Env:OPENAI_API_KEY="sk-your-actual-key" $Env:OPENAI_BASE_URL="https://api.openai.com/v1"Model Configuration
⚠️ IMPORTANT: OpenAI-Compatible API Only
This system currently supports ONLY OpenAI-compatible API interfaces. All models must provide OpenAI-compatible endpoints, regardless of the actual provider.
🌐 Third-Party Multi-Model Platforms (Recommended)
For models that don't natively support
Related Skills
feishu-drive
348.2k|
things-mac
348.2kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
348.2kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
