MedicalAiBenchEval

A comprehensive medical AI evaluation framework based on GAPS methodology. Features automated assessment pipeline, thoracic surgery dataset (92 cases), multi-model support, parallel processing, and clinical scoring. Supports OpenAI-compatible APIs including GPT, Claude, Gemini, Qwen with detailed analytics and visualization.

Generate Convert Improve

Install / Use

/learn @AQ-MedAI/MedicalAiBenchEval

About this skill

Quality Score

0/100

README

Medical AI Evaluation Framework: Clinical Benchmark Dataset and Automated Assessment Pipeline

📋 Table of Contents

Overview
Key Features
Quick Start
- Installation
- Basic Usage
System Architecture
Data Format
- Input Format
- Output Format
Configuration
- Basic Configuration
- Advanced Configuration
Usage Guide
Evaluation Framework
Dataset
- Thoracic Surgery Dataset
- Data Quality
Advanced Features
Development
Community
- Contributing
- Citation
License

Overview

This Medical AI Evaluation Framework provides a comprehensive evaluation system designed specifically for assessing AI models in clinical scenarios. Based on the GAPS (Grounded, Automated, Personalized, Scalable) methodology, this framework includes both a curated clinical benchmark dataset and an automated assessment pipeline for medical AI systems.

The framework addresses the critical need for standardized evaluation of AI clinical decision-making by providing:

Clinically Grounded Assessment: Evaluation criteria based on real medical guidelines and expert knowledge
Automated Pipeline: Streamlined processing from raw responses to detailed performance metrics
Multi-Model Support: Simultaneous evaluation of multiple AI models with comparative analysis
Scalable Architecture: Efficient processing of large datasets with parallel execution capabilities

Key Features

🏥 Medical-Specific Evaluation: Specialized rubrics for clinical scenarios with positive/negative scoring
🔄 Parallel Processing: Simultaneous execution of Met/Not Met review and irrelevant content detection
📊 Comprehensive Analytics: Detailed statistical analysis with visualization reports
🎯 Multi-Model Assessment: Support for evaluating multiple AI models simultaneously
⚙️ Flexible Configuration: Customizable models, voting strategies, and evaluation parameters
📈 Rich Visualization: Automated generation of performance charts and comparative analysis
🔧 Modular Design: Independent modules for different evaluation stages
📋 Standardized Output: Consistent Excel-based reporting with detailed metrics

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd medical-ai-bench-eval

# Install dependencies
pip install -r requirements.txt

Basic Usage

# Run complete evaluation pipeline
python medical_evaluation_pipeline.py input_data.xlsx

# With custom output file
python medical_evaluation_pipeline.py input_data.xlsx -o results/evaluation_results.xlsx

# Enable verbose logging
python medical_evaluation_pipeline.py input_data.xlsx -v

System Architecture

The system processes medical AI responses through a sophisticated pipeline that includes:

Input Processing: Reads Excel files containing medical questions, evaluation rubrics, and AI model responses
Parallel Evaluation: Simultaneously executes Met/Not Met review and irrelevant content detection
Intelligent Scoring: Calculates comprehensive scores based on clinical evaluation criteria
Analysis & Reporting: Generates detailed statistical reports and visualizations

System Architecture

Input Excel File
↓
┌─────────────────────────────────────────┐
│          Parallel Processing Phase       │
├─────────────────┬─────────────────────┤
│   Step 1: Met Review │  Step 2: Irrelevant Content │
│   - Multi-model Review │  - Content Extraction     │
│   - Voting Decision    │  - Level Assessment       │
│   - Result Summary     │  - Voting Classification  │
└─────────────────┴─────────────────────┘
↓
┌─────────────────────────────────────────┐
│          Result Merging                 │
│   - Intelligent merging of parallel results │
│   - Data integrity check                │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│        Step 3: Score Calculation        │
│   - Multi-dimensional score statistics  │
│   - Irrelevant content deduction        │
│   - Normalization processing            │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│      Step 4: Data Analysis & Visualization │
│   - Statistical analysis reports        │
│   - Diverse charts                      │
│   - CSV data export                     │
└─────────────────────────────────────────┘
↓
Output Excel File + Analysis Report

Data Format

Input Format

The system accepts Excel files with the following structure:

| Column Name | Description | Example | |-------------|-------------|---------| | question | Medical question | "Patient presents with chest pain symptoms, how to diagnose?" | | final_merged_json | Evaluation points JSON | [{"id":1,"claim":"Need to inquire about symptoms","level":"A1"}] | | gpt_5_answer | GPT model response | "First need to ask the patient about symptoms in detail..." | | gemini_2_5_pro_answer | Gemini model response | "Recommend performing ECG examination..." | | claude_opus_4_answer | Claude model response | "Should consider acute coronary syndrome..." |

Evaluation Points JSON Format

[
  {
    "id": 1,
    "claim": "Need to ask about symptom duration",
    "level": "A1",
    "desc": "Detailed inquiry about chest pain duration, nature, etc."
  },
  {
    "id": 2,
    "claim": "Should avoid mentioning unrelated treatment plans",
    "level": "S2",
    "desc": "Should not mention treatments unrelated to chest pain"
  }
]

Level Description:

A1-A3: Positive points (A1=5 points, A2=3 points, A3=1 point)
S1-S4: Negative points (S1=-1 point, S2=-2 points, S3=-3 points, S4=-4 points)

Output Format

The system generates multiple output files:

Final Score Excel (processed_result_final_YYYYMMDD_HHMMSS.xlsx)
- Contains all original data
- Met/Not Met review results
- Irrelevant content detection results
- Detailed scoring statistics
Data Analysis Report (data/output/analysis/)
- medical_evaluation_report_YYYYMMDD_HHMMSS.png - Visualization charts
- medical_analysis_report_YYYYMMDD_HHMMSS.txt - Detailed analysis report
- model_performance_summary_YYYYMMDD_HHMMSS.csv - Performance summary

Scoring Metrics

| Metric | Description | |--------|-------------| | max_possible | Theoretical maximum score (sum of all positive points) | | final_total_score | Actual score (Met items score - irrelevant content deduction) | | normalized | Normalized score (between 0-1) | | positive_total_count | Number of positive point hits | | rubric_total_count | Number of negative point hits | | irrelevant_total_count | Total irrelevant content count |

Configuration

Basic Configuration

Environment Requirements

Python: 3.12+
Operating System: Windows, macOS, Linux
Core Dependencies: pandas, numpy, matplotlib, seaborn, asyncio, langchain, openpyxl, xlsxwriter

Installation Steps

Clone the repository:

git clone <repository-url>
cd medical-ai-bench-eval

Install dependencies:

pip install -r requirements.txt

Configure API Keys (if using external AI model services):

API Configuration Guide

Environment Variables Setup

Linux/Mac:

# OpenAI
export OPENAI_API_KEY="sk-your-actual-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"

# Claude (Anthropic)
export ANTHROPIC_API_KEY="sk-ant-your-actual-key"
export ANTHROPIC_BASE_URL="https://api.anthropic.com"

# Google Gemini
export GOOGLE_API_KEY="your-google-api-key"
export GOOGLE_BASE_URL="https://generativelanguage.googleapis.com/v1beta"

# Moonshot Kimi
export MOONSHOT_API_KEY="sk-your-moonshot-key"
export MOONSHOT_BASE_URL="https://api.moonshot.cn/v1"

# Alibaba Qwen
export DASHSCOPE_API_KEY="your-dashscope-key"
export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/api/v1"

# Baichuan
export BAICHUAN_API_KEY="your-baichuan-key"
export BAICHUAN_BASE_URL="https://api.baichuan-ai.com/v1"

# DeepSeek
export DEEPSEEK_API_KEY="sk-your-deepseek-key"
export DEEPSEEK_BASE_URL="https://api.deepseek.com/v1"

# Zhipu ChatGLM
export ZHIPU_API_KEY="your-zhipu-key"
export ZHIPU_BASE_URL="https://open.bigmodel.cn/api/paas/v4"

Windows PowerShell:

$Env:OPENAI_API_KEY="sk-your-actual-key"
$Env:OPENAI_BASE_URL="https://api.openai.com/v1"

Model Configuration

⚠️ IMPORTANT: OpenAI-Compatible API Only

This system currently supports ONLY OpenAI-compatible API interfaces. All models must provide OpenAI-compatible endpoints, regardless of the actual provider.

🌐 Third-Party Multi-Model Platforms (Recommended)

For models that don't natively support