DramaBench
A six-dimensional evaluation framework for drama script continuation with interactive leaderboard and case studies
Install / Use
/learn @IIIIQIIII/DramaBenchREADME
DramaBench
<div align="center">
A Six-Dimensional Evaluation Framework for Drama Script Continuation
🌐 Website • ✨ Interactive Demo • 📊 Leaderboard • 🤗 Dataset
</div>📋 Table of Contents
- Overview
- Quick Start
- Project Components
- Web Demo
- Dataset
- Evaluation Framework
- Leaderboard
- Documentation
- Contributing
- Citation
- License
<a id="overview"></a>
🎯 Overview
DramaBench is a comprehensive benchmark for evaluating drama script continuation capabilities of large language models. It provides:
Core Components
- 🌐 Project Website - Interactive showcase with evaluation results and case studies
- ✨ Interactive Demo - Try script continuation with multiple LLM models (user-provided API key)
- 💾 Large-Scale Dataset - 1,103 drama scripts with human annotations
- 📊 Evaluation Framework - 6 independent dimensions with rigorous metrics
- 🏆 Model Leaderboard - Compare 8 SOTA language models
- 📝 Case Studies - 24 curated examples with detailed analysis
- 🔧 Evaluation Prompts - LLM-based labeling templates for all 6 dimensions
Six Evaluation Dimensions
- Format Standards (Rule-based) - Screenplay format compliance
- Narrative Efficiency (LLM-labeled) - Story progression effectiveness
- Character Consistency (LLM-labeled) - Character voice and behavior
- Emotional Depth (LLM-labeled) - Emotional arc development
- Logic Consistency (LLM-labeled) - Factual coherence and continuity
- Conflict Handling (LLM-labeled) - Conflict development quality
Key Statistics
- 1,103 unique drama scripts
- 8,824 total evaluations (1,103 scripts × 8 models)
- 8 state-of-the-art language models
- 6 independent evaluation dimensions
- 252 statistical significance tests (65.9% significant)
- 24 curated case studies
<a id="quick-start"></a>
🚀 Quick Start
Prerequisites
- Python 3.10+
- Web browser (Chrome, Safari, Firefox, or Edge)
Launch Web Demo
Method 1: One-Click Start (Easiest)
cd DramaBench
./start_demo.sh
This will automatically:
- ✅ Start a local HTTP server on port 8000
- ✅ Open the demo in your default browser
- ✅ Navigate to http://localhost:8000
Method 2: Manual Server Start
cd DramaBench
# Using uv (if available)
uv run python -m http.server 8000
# Or using Python 3 directly
python3 -m http.server 8000
# Then open http://localhost:8000 in your browser
⚠️ Important Note
Due to browser CORS restrictions, you must use a local HTTP server to view the demo. Opening HTML files directly (file:// protocol) will cause data loading errors.
<a id="project-components"></a>
🧩 Project Components
1. Project Website & Interactive Demo
An interactive, Apple-inspired web interface for exploring evaluation results and trying script continuation.
Website Features:
- 📊 Interactive leaderboard with dimension filters
- 📝 Case studies explorer with 24 examples
- 🎨 Premium dark gradient design
- 📱 Fully responsive (mobile/tablet/desktop)
- ⚡ Pure HTML/CSS/JavaScript (no frameworks)
Interactive Demo Features:
- ✨ Try script continuation with 4 SOTA models (GPT-5.2, Gemini 3, GLM-4.7, MiniMax M2.1)
- 🔑 User-provided OpenRouter API key (stored locally)
- 📜 500 drama scripts from DramaBench dataset
- 🎭 Official prompt template for generation
- 📊 Compare AI-generated vs ground truth continuations
- 🎨 Matching Apple-style design
Pages:
index.html- Main landing pageweb/leaderboard.html- Model rankingsweb/cases.html- Case studies browserweb/demo.html- Interactive script continuation demo
→ View Live Website | → Try Interactive Demo
2. Dataset
🎉 Now Available on Hugging Face!
The DramaBench dataset is being released progressively to ensure quality and gather community feedback.
Current Release (v2.0):
- ✅ 500 Drama Scripts - Available now on Hugging Face
- 📥 Download: FutureMa/DramaBench
- 📄 Format: JSONL with structured metadata
- 🔓 License: MIT License
- 📊 Usage: Load with
datasetslibrary
Quick Start:
from datasets import load_dataset
# Load dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")
# Access samples
sample = dataset[0]
print(sample['title'])
print(sample['context'])
print(sample['continuation'])
Release Roadmap: | Version | Samples | Status | Expected Release | |---------|---------|--------|------------------| | v1.0 | 100 | ✅ Released | 2025-12-23 | | v2.0 | 500 | ✅ Available | 2026-01-01 | | v3.0 (Full) | 1,103 | 📋 Planned | Q2 2026 |
Full Dataset Contents (v3.0):
- 1,103 drama script contexts and continuations
- Model-generated continuations (8 SOTA models)
- Human annotations and quality assessments
- Multi-dimensional evaluation metrics
- Error taxonomy and classification
3. Evaluation Prompts
✅ Now Available: LLM-based evaluation prompt templates for all 6 dimensions.
Location: prompts/ directory
Contents:
narrative_efficiency_prompt.txt- Story progression effectivenesscharacter_consistency_prompt.txt- Character voice and behavior consistencyemotional_depth_prompt.txt- Emotional arc developmentlogic_consistency_prompt.txt- Factual coherence and continuityconflict_handling_prompt.txt- Conflict development and resolutiondialogue_quality_prompt.txt- Dialogue naturalness and purpose
Quick Start:
# Load a prompt template
with open('prompts/narrative_efficiency_prompt.txt', 'r') as f:
prompt = f.read()
# Fill placeholders
prompt = prompt.replace('{CONTEXT}', script_context)
prompt = prompt.replace('{CONTINUATION}', generated_continuation)
prompt = prompt.replace('{MODEL}', 'GPT-4')
prompt = prompt.replace('{SCRIPT_ID}', 'script_001')
# Send to LLM and get structured JSON output
response = llm_api_call(prompt)
evaluation = json.loads(response)
See prompts/README.md for detailed usage instructions.
Coming Soon: Full evaluation pipeline including:
- Statistical analysis scripts
- Visualization generation tools
- Reproducibility automation scripts
<a id="web-demo"></a>
🌐 Website & Interactive Demo
Live Website
Visit dramabench.pages.dev to explore:
- Homepage - Project overview and statistics
- Leaderboard - Compare 8 SOTA models across 6 dimensions
- Case Studies - Browse 24 curated examples with detailed analysis
- Interactive Demo - Try script continuation yourself
Interactive Demo
Try it now: dramabench.pages.dev/web/demo.html
Experience drama script continuation with state-of-the-art language models:
Features:
- 🎭 500 Drama Scripts - Select from DramaBench v2.0 dataset
- 🤖 4 SOTA Models - GPT-5.2, Gemini 3 Flash, GLM-4.7, MiniMax M2.1
- 🔑 Your API Key - Uses OpenRouter API (bring your own key)
- 📊 Compare Results - View AI-generated vs ground truth side-by-side
- 🎨 Apple Design - Beautiful, responsive interface
How to Use:
- Get your free API key from OpenRouter
- Visit the demo page
- Enter your API key (stored locally in your browser)
- Select a script from 500 options
- Choose your preferred model
- Generate and compare continuations
Cost: Pay-as-you-go through OpenRouter (typically $0.01-0.10 per generation)
Website Features
Interactive Leaderboard
- Filter by dimension (overall + 6 dimensions)
- Expandable model details with per-dimension scores
- Rank badges (gold/silver/bronze)
- Real-time filtering and sorting
Case Studies Explorer
- 24 curated success/failure examples
- Filter by dimension and type
- Script excerpts with metrics
- Analysis insights and takeaways
Design
- Apple-inspired UI with premium dark gradients
- SF Pro font family (system fonts)
- Glassmorphism effects
- Smooth animations and transitions
- Fully responsive layout
Technologies
- Pure HTML/CSS/JavaScript (no frameworks)
- Apple Design Language principles
- CSS Grid & Flexbox layouts
- Backdrop filters for glassmorphism
- CSS animations for smooth transitions
Local Development
Regenerate web demo data from source:
cd DramaBench
uv run python web/scripts/process_data.py
This processes:
- 6 dimension metrics CSV files (8,824 evaluations)
- 24 case studies with detailed analysis
- Generates web-friendly JSON in
web/data/
<a id="dataset"></a>
💾 Dataset
Dataset Access
🤗 Hugging Face Dataset: FutureMa/DramaBench
Current Release: v2.0 (500 samples) - Available Now!
Quick Start
Load with Datasets Library:
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")
# Access a sample
sample = dataset[0]
print(f"Title: {sample['title']}")
print(f"Context: {sample['context'][:200]}...")
print(f"Continuation: {sample['continuation'][:200]}...")
print(f"Stats: {sample['stats']}")
Analyze Dataset:
