SkillAgentSearch skills...

MedicalAiBenchEval

A comprehensive medical AI evaluation framework based on GAPS methodology. Features automated assessment pipeline, thoracic surgery dataset (92 cases), multi-model support, parallel processing, and clinical scoring. Supports OpenAI-compatible APIs including GPT, Claude, Gemini, Qwen with detailed analytics and visualization.

Install / Use

/learn @AQ-MedAI/MedicalAiBenchEval
About this skill

Quality Score

0/100

Supported Platforms

Claude Code
Claude Desktop
Gemini CLI

README

Medical AI Evaluation Framework: Clinical Benchmark Dataset and Automated Assessment Pipeline

License: MIT Python 3.12+

📋 Table of Contents

Overview

This Medical AI Evaluation Framework provides a comprehensive evaluation system designed specifically for assessing AI models in clinical scenarios. Based on the GAPS (Grounded, Automated, Personalized, Scalable) methodology, this framework includes both a curated clinical benchmark dataset and an automated assessment pipeline for medical AI systems.

The framework addresses the critical need for standardized evaluation of AI clinical decision-making by providing:

  • Clinically Grounded Assessment: Evaluation criteria based on real medical guidelines and expert knowledge
  • Automated Pipeline: Streamlined processing from raw responses to detailed performance metrics
  • Multi-Model Support: Simultaneous evaluation of multiple AI models with comparative analysis
  • Scalable Architecture: Efficient processing of large datasets with parallel execution capabilities

Key Features

  • 🏥 Medical-Specific Evaluation: Specialized rubrics for clinical scenarios with positive/negative scoring
  • 🔄 Parallel Processing: Simultaneous execution of Met/Not Met review and irrelevant content detection
  • 📊 Comprehensive Analytics: Detailed statistical analysis with visualization reports
  • 🎯 Multi-Model Assessment: Support for evaluating multiple AI models simultaneously
  • ⚙️ Flexible Configuration: Customizable models, voting strategies, and evaluation parameters
  • 📈 Rich Visualization: Automated generation of performance charts and comparative analysis
  • 🔧 Modular Design: Independent modules for different evaluation stages
  • 📋 Standardized Output: Consistent Excel-based reporting with detailed metrics

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd medical-ai-bench-eval

# Install dependencies
pip install -r requirements.txt

Basic Usage

# Run complete evaluation pipeline
python medical_evaluation_pipeline.py input_data.xlsx

# With custom output file
python medical_evaluation_pipeline.py input_data.xlsx -o results/evaluation_results.xlsx

# Enable verbose logging
python medical_evaluation_pipeline.py input_data.xlsx -v

System Architecture

The system processes medical AI responses through a sophisticated pipeline that includes:

  1. Input Processing: Reads Excel files containing medical questions, evaluation rubrics, and AI model responses
  2. Parallel Evaluation: Simultaneously executes Met/Not Met review and irrelevant content detection
  3. Intelligent Scoring: Calculates comprehensive scores based on clinical evaluation criteria
  4. Analysis & Reporting: Generates detailed statistical reports and visualizations

System Architecture

Input Excel File
↓
┌─────────────────────────────────────────┐
│          Parallel Processing Phase       │
├─────────────────┬─────────────────────┤
│   Step 1: Met Review │  Step 2: Irrelevant Content │
│   - Multi-model Review │  - Content Extraction     │
│   - Voting Decision    │  - Level Assessment       │
│   - Result Summary     │  - Voting Classification  │
└─────────────────┴─────────────────────┘
↓
┌─────────────────────────────────────────┐
│          Result Merging                 │
│   - Intelligent merging of parallel results │
│   - Data integrity check                │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│        Step 3: Score Calculation        │
│   - Multi-dimensional score statistics  │
│   - Irrelevant content deduction        │
│   - Normalization processing            │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│      Step 4: Data Analysis & Visualization │
│   - Statistical analysis reports        │
│   - Diverse charts                      │
│   - CSV data export                     │
└─────────────────────────────────────────┘
↓
Output Excel File + Analysis Report

Data Format

Input Format

The system accepts Excel files with the following structure:

| Column Name | Description | Example | |-------------|-------------|---------| | question | Medical question | "Patient presents with chest pain symptoms, how to diagnose?" | | final_merged_json | Evaluation points JSON | [{"id":1,"claim":"Need to inquire about symptoms","level":"A1"}] | | gpt_5_answer | GPT model response | "First need to ask the patient about symptoms in detail..." | | gemini_2_5_pro_answer | Gemini model response | "Recommend performing ECG examination..." | | claude_opus_4_answer | Claude model response | "Should consider acute coronary syndrome..." |

Evaluation Points JSON Format

[
  {
    "id": 1,
    "claim": "Need to ask about symptom duration",
    "level": "A1",
    "desc": "Detailed inquiry about chest pain duration, nature, etc."
  },
  {
    "id": 2,
    "claim": "Should avoid mentioning unrelated treatment plans",
    "level": "S2",
    "desc": "Should not mention treatments unrelated to chest pain"
  }
]

Level Description:

  • A1-A3: Positive points (A1=5 points, A2=3 points, A3=1 point)
  • S1-S4: Negative points (S1=-1 point, S2=-2 points, S3=-3 points, S4=-4 points)

Output Format

The system generates multiple output files:

  1. Final Score Excel (processed_result_final_YYYYMMDD_HHMMSS.xlsx)

    • Contains all original data
    • Met/Not Met review results
    • Irrelevant content detection results
    • Detailed scoring statistics
  2. Data Analysis Report (data/output/analysis/)

    • medical_evaluation_report_YYYYMMDD_HHMMSS.png - Visualization charts
    • medical_analysis_report_YYYYMMDD_HHMMSS.txt - Detailed analysis report
    • model_performance_summary_YYYYMMDD_HHMMSS.csv - Performance summary

Scoring Metrics

| Metric | Description | |--------|-------------| | max_possible | Theoretical maximum score (sum of all positive points) | | final_total_score | Actual score (Met items score - irrelevant content deduction) | | normalized | Normalized score (between 0-1) | | positive_total_count | Number of positive point hits | | rubric_total_count | Number of negative point hits | | irrelevant_total_count | Total irrelevant content count |

Configuration

Basic Configuration

Environment Requirements

  • Python: 3.12+
  • Operating System: Windows, macOS, Linux
  • Core Dependencies: pandas, numpy, matplotlib, seaborn, asyncio, langchain, openpyxl, xlsxwriter

Installation Steps

  1. Clone the repository:
git clone <repository-url>
cd medical-ai-bench-eval
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure API Keys (if using external AI model services):

    API Configuration Guide

    Environment Variables Setup

    Linux/Mac:

    # OpenAI
    export OPENAI_API_KEY="sk-your-actual-key"
    export OPENAI_BASE_URL="https://api.openai.com/v1"
    
    # Claude (Anthropic)
    export ANTHROPIC_API_KEY="sk-ant-your-actual-key"
    export ANTHROPIC_BASE_URL="https://api.anthropic.com"
    
    # Google Gemini
    export GOOGLE_API_KEY="your-google-api-key"
    export GOOGLE_BASE_URL="https://generativelanguage.googleapis.com/v1beta"
    
    # Moonshot Kimi
    export MOONSHOT_API_KEY="sk-your-moonshot-key"
    export MOONSHOT_BASE_URL="https://api.moonshot.cn/v1"
    
    # Alibaba Qwen
    export DASHSCOPE_API_KEY="your-dashscope-key"
    export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/api/v1"
    
    # Baichuan
    export BAICHUAN_API_KEY="your-baichuan-key"
    export BAICHUAN_BASE_URL="https://api.baichuan-ai.com/v1"
    
    # DeepSeek
    export DEEPSEEK_API_KEY="sk-your-deepseek-key"
    export DEEPSEEK_BASE_URL="https://api.deepseek.com/v1"
    
    # Zhipu ChatGLM
    export ZHIPU_API_KEY="your-zhipu-key"
    export ZHIPU_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
    

    Windows PowerShell:

    $Env:OPENAI_API_KEY="sk-your-actual-key"
    $Env:OPENAI_BASE_URL="https://api.openai.com/v1"
    

    Model Configuration

    ⚠️ IMPORTANT: OpenAI-Compatible API Only

    This system currently supports ONLY OpenAI-compatible API interfaces. All models must provide OpenAI-compatible endpoints, regardless of the actual provider.

    🌐 Third-Party Multi-Model Platforms (Recommended)

    For models that don't natively support

Related Skills

View on GitHub
GitHub Stars42
CategoryData
Updated9d ago
Forks3

Languages

Python

Security Score

90/100

Audited on Mar 26, 2026

No findings