DramaBench

DramaBench Cover

A Six-Dimensional Evaluation Framework for Drama Script Continuation

🌐 Website • ✨ Interactive Demo • 📊 Leaderboard • 🤗 Dataset

</div>

📋 Table of Contents

Overview
Quick Start
Project Components
Web Demo
Dataset
Evaluation Framework
Leaderboard
Documentation
Contributing
Citation
License

🎯 Overview

DramaBench is a comprehensive benchmark for evaluating drama script continuation capabilities of large language models. It provides:

Core Components

🌐 Project Website - Interactive showcase with evaluation results and case studies
✨ Interactive Demo - Try script continuation with multiple LLM models (user-provided API key)
💾 Large-Scale Dataset - 1,103 drama scripts with human annotations
📊 Evaluation Framework - 6 independent dimensions with rigorous metrics
🏆 Model Leaderboard - Compare 8 SOTA language models
📝 Case Studies - 24 curated examples with detailed analysis
🔧 Evaluation Prompts - LLM-based labeling templates for all 6 dimensions

Six Evaluation Dimensions

Format Standards (Rule-based) - Screenplay format compliance
Narrative Efficiency (LLM-labeled) - Story progression effectiveness
Character Consistency (LLM-labeled) - Character voice and behavior
Emotional Depth (LLM-labeled) - Emotional arc development
Logic Consistency (LLM-labeled) - Factual coherence and continuity
Conflict Handling (LLM-labeled) - Conflict development quality

Key Statistics

1,103 unique drama scripts
8,824 total evaluations (1,103 scripts × 8 models)
8 state-of-the-art language models
6 independent evaluation dimensions
252 statistical significance tests (65.9% significant)
24 curated case studies

🚀 Quick Start

Prerequisites

Python 3.10+
Web browser (Chrome, Safari, Firefox, or Edge)

Launch Web Demo

Method 1: One-Click Start (Easiest)

cd DramaBench
./start_demo.sh

This will automatically:

✅ Start a local HTTP server on port 8000
✅ Open the demo in your default browser
✅ Navigate to http://localhost:8000

Method 2: Manual Server Start

cd DramaBench

# Using uv (if available)
uv run python -m http.server 8000

# Or using Python 3 directly
python3 -m http.server 8000

# Then open http://localhost:8000 in your browser

⚠️ Important Note

Due to browser CORS restrictions, you must use a local HTTP server to view the demo. Opening HTML files directly (file:// protocol) will cause data loading errors.

🧩 Project Components

1. Project Website & Interactive Demo

An interactive, Apple-inspired web interface for exploring evaluation results and trying script continuation.

Website Features:

📊 Interactive leaderboard with dimension filters
📝 Case studies explorer with 24 examples
🎨 Premium dark gradient design
📱 Fully responsive (mobile/tablet/desktop)
⚡ Pure HTML/CSS/JavaScript (no frameworks)

Interactive Demo Features:

✨ Try script continuation with 4 SOTA models (GPT-5.2, Gemini 3, GLM-4.7, MiniMax M2.1)
🔑 User-provided OpenRouter API key (stored locally)
📜 500 drama scripts from DramaBench dataset
🎭 Official prompt template for generation
📊 Compare AI-generated vs ground truth continuations
🎨 Matching Apple-style design

Pages:

index.html - Main landing page
web/leaderboard.html - Model rankings
web/cases.html - Case studies browser
web/demo.html - Interactive script continuation demo

→ View Live Website | → Try Interactive Demo

2. Dataset

🎉 Now Available on Hugging Face!

The DramaBench dataset is being released progressively to ensure quality and gather community feedback.

Current Release (v2.0):

✅ 500 Drama Scripts - Available now on Hugging Face
📥 Download: FutureMa/DramaBench
📄 Format: JSONL with structured metadata
🔓 License: MIT License
📊 Usage: Load with datasets library

Quick Start:

from datasets import load_dataset

# Load dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")

# Access samples
sample = dataset[0]
print(sample['title'])
print(sample['context'])
print(sample['continuation'])

Release Roadmap: | Version | Samples | Status | Expected Release | |---------|---------|--------|------------------| | v1.0 | 100 | ✅ Released | 2025-12-23 | | v2.0 | 500 | ✅ Available | 2026-01-01 | | v3.0 (Full) | 1,103 | 📋 Planned | Q2 2026 |

Full Dataset Contents (v3.0):

1,103 drama script contexts and continuations
Model-generated continuations (8 SOTA models)
Human annotations and quality assessments
Multi-dimensional evaluation metrics
Error taxonomy and classification

3. Evaluation Prompts

✅ Now Available: LLM-based evaluation prompt templates for all 6 dimensions.

Location: prompts/ directory

Contents:

narrative_efficiency_prompt.txt - Story progression effectiveness
character_consistency_prompt.txt - Character voice and behavior consistency
emotional_depth_prompt.txt - Emotional arc development
logic_consistency_prompt.txt - Factual coherence and continuity
conflict_handling_prompt.txt - Conflict development and resolution
dialogue_quality_prompt.txt - Dialogue naturalness and purpose

Quick Start:

# Load a prompt template
with open('prompts/narrative_efficiency_prompt.txt', 'r') as f:
    prompt = f.read()

# Fill placeholders
prompt = prompt.replace('{CONTEXT}', script_context)
prompt = prompt.replace('{CONTINUATION}', generated_continuation)
prompt = prompt.replace('{MODEL}', 'GPT-4')
prompt = prompt.replace('{SCRIPT_ID}', 'script_001')

# Send to LLM and get structured JSON output
response = llm_api_call(prompt)
evaluation = json.loads(response)

See prompts/README.md for detailed usage instructions.

Coming Soon: Full evaluation pipeline including:

Statistical analysis scripts
Visualization generation tools
Reproducibility automation scripts

🌐 Website & Interactive Demo

Live Website

Visit dramabench.pages.dev to explore:

Homepage - Project overview and statistics
Leaderboard - Compare 8 SOTA models across 6 dimensions
Case Studies - Browse 24 curated examples with detailed analysis
Interactive Demo - Try script continuation yourself

Interactive Demo

Try it now: dramabench.pages.dev/web/demo.html

Experience drama script continuation with state-of-the-art language models:

Features:

🎭 500 Drama Scripts - Select from DramaBench v2.0 dataset
🤖 4 SOTA Models - GPT-5.2, Gemini 3 Flash, GLM-4.7, MiniMax M2.1
🔑 Your API Key - Uses OpenRouter API (bring your own key)
📊 Compare Results - View AI-generated vs ground truth side-by-side
🎨 Apple Design - Beautiful, responsive interface

How to Use:

Get your free API key from OpenRouter
Visit the demo page
Enter your API key (stored locally in your browser)
Select a script from 500 options
Choose your preferred model
Generate and compare continuations

Cost: Pay-as-you-go through OpenRouter (typically $0.01-0.10 per generation)

Website Features

Interactive Leaderboard

Filter by dimension (overall + 6 dimensions)
Expandable model details with per-dimension scores
Rank badges (gold/silver/bronze)
Real-time filtering and sorting

Case Studies Explorer

24 curated success/failure examples
Filter by dimension and type
Script excerpts with metrics
Analysis insights and takeaways

Design

Apple-inspired UI with premium dark gradients
SF Pro font family (system fonts)
Glassmorphism effects
Smooth animations and transitions
Fully responsive layout

Technologies

Pure HTML/CSS/JavaScript (no frameworks)
Apple Design Language principles
CSS Grid & Flexbox layouts
Backdrop filters for glassmorphism
CSS animations for smooth transitions

Local Development

Regenerate web demo data from source:

cd DramaBench
uv run python web/scripts/process_data.py

This processes:

6 dimension metrics CSV files (8,824 evaluations)
24 case studies with detailed analysis
Generates web-friendly JSON in web/data/

💾 Dataset

Dataset Access

🤗 Hugging Face Dataset: FutureMa/DramaBench

Current Release: v2.0 (500 samples) - Available Now!

Quick Start

Load with Datasets Library:

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")

# Access a sample
sample = dataset[0]
print(f"Title: {sample['title']}")
print(f"Context: {sample['context'][:200]}...")
print(f"Continuation: {sample['continuation'][:200]}...")
print(f"Stats: {sample['stats']}")

Analyze Dataset:

DramaBench

Install / Use

README

DramaBench

📋 Table of Contents

🎯 Overview

Core Components

Six Evaluation Dimensions

Key Statistics

🚀 Quick Start

Prerequisites

Launch Web Demo

⚠️ Important Note

🧩 Project Components

1. Project Website & Interactive Demo

2. Dataset

3. Evaluation Prompts

🌐 Website & Interactive Demo

Live Website

Interactive Demo

Website Features

Technologies

Local Development

💾 Dataset

Dataset Access

Quick Start