SkillAgentSearch skills...

Writing

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

Install / Use

/learn @lechmazur/Writing
About this skill

Quality Score

0/100

Supported Platforms

Claude Code
Claude Desktop
Gemini CLI

README

LLM Creative Story‑Writing Benchmark V4

This benchmark evaluates how well large language models (LLMs) follow a creative brief while still producing engaging fiction. Every story must meaningfully incorporate ten required elements: character, object, concept, attribute, action, method, setting, timeframe, motivation, and tone. With these building blocks standardized and length tightly controlled, differences in constraint satisfaction and literary quality become directly comparable. Multiple independent “grader” LLMs score each story on an 18‑question rubric, and we aggregate those judgments into model‑level results.


Overall scores


What’s measured

1) Craft and coherence (Q1–Q8)

Eight questions focus on narrative craft: character depth and motivation, plot structure and coherence, world building and atmosphere, story impact, originality, thematic cohesion, voice/point‑of‑view, and line‑level prose quality.

2) Element integration (Q9A–Q9J)

Ten questions check whether the story organically uses each required element: the specified character, object, core concept, attribute, action, method, setting, timeframe, motivation, and tone. If a category in the prompt is “None,” graders mark the corresponding 9‑series item as N/A.

3) Overall story score

We score each story per grader with a 60/40 weighted power mean (Hölder mean, p = 0.5) over the 18 rubric items (Q1–Q8 = 60%, 9A–9J = 40%, split evenly within each group). Compared with a plain average, p = 0.5 acts like a soft‑minimum: it sits closer to the lowest dimensions, so weaknesses pull more than highs can offset and well‑rounded craft is rewarded. The final story score is the mean of the per‑grader scores.


Results

Overall model means

The top bar chart summarizes mean story quality for each model with uncertainty bands. (Grader‑unweighted means; questions weighted 60/40.)

Full overall leaderboard

| Rank | LLM | Mean Score | Samples | SEM | |-----:|------------------------|-----------:|--------:|----:| | 1 | Claude Opus 4.6 Thinking 16K | 8.561 | 2795 | 0.0118 | | 2 | Claude Opus 4.6 (no reasoning) | 8.533 | 2796 | 0.0123 | | 3 | GPT-5.2 (medium reasoning) | 8.511 | 2796 | 0.0145 | | 4 | GPT-5 Pro | 8.474 | 2796 | 0.0158 | | 5 | GPT-5.1 (medium reasoning) | 8.438 | 2796 | 0.0134 | | 6 | GPT-5 (medium reasoning) | 8.434 | 2796 | 0.0162 | | 7 | Kimi K2-0905 | 8.331 | 2796 | 0.0199 | | 8 | Gemini 3 Pro Preview | 8.221 | 2796 | 0.0170 | | 9 | Gemini 2.5 Pro | 8.219 | 2796 | 0.0169 | | 10 | Mistral Medium 3.1 | 8.201 | 2796 | 0.0185 | | 11 | Claude Opus 4.5 Thinking 16K | 8.200 | 2796 | 0.0170 | | 12 | Claude Opus 4.5 (no reasoning) | 8.195 | 2796 | 0.0172 | | 13 | Claude Sonnet 4.5 Thinking 16K | 8.169 | 2796 | 0.0176 | | 14 | Claude Sonnet 4.5 (no reasoning) | 8.112 | 2796 | 0.0179 | | 15 | Qwen 3 Max Preview | 8.091 | 2796 | 0.0233 | | 16 | Kimi K2.5 Thinking | 8.068 | 2796 | 0.0220 | | 17 | Claude Opus 4.1 (no reasoning) | 8.068 | 2796 | 0.0197 | | 18 | Qwen3 Max (2026-01-23) | 7.842 | 2796 | 0.0256 | | 19 | MiniMax-M2.1 | 7.777 | 2795 | 0.0226 | | 20 | Kimi K2 Thinking | 7.687 | 2796 | 0.0286 | | 21 | Deepseek V3.2 | 7.601 | 2796 | 0.0279 | | 22 | Mistral Large 3 | 7.595 | 2796 | 0.0215 | | 23 | Grok 4.1 Fast Reasoning | 7.567 | 2796 | 0.0297 | | 24 | Baidu Ernie 4.5 300B A47B | 7.506 | 2796 | 0.0252 | | 25 | GLM-4.6 | 7.452 | 2796 | 0.0285 | | 26 | Deepseek V3.2 Exp | 7.159 | 2796 | 0.0322 | | 27 | GLM-4.5 | 7.120 | 2796 | 0.0315 | | 28 | GPT-OSS-120B | 7.030 | 2796 | 0.0336 | | 29 | Cohere Command A | 6.794 | 2796 | 0.0302 | | 30 | Llama 4 Maverick | 5.777 | 2796 | 0.0304 |

Full normalized leaderboard

| Rank | LLM | Normalized Mean | |-----:|------------------------|-----------------:| | 1 | GPT-5.2 (medium reasoning) | 0.719 | | 2 | Claude Opus 4.6 Thinking 16K | 0.711 | | 3 | GPT-5 Pro | 0.705 | | 4 | Claude Opus 4.6 (no reasoning) | 0.684 | | 5 | GPT-5 (medium reasoning) | 0.666 | | 6 | Kimi K2-0905 | 0.588 | | 7 | GPT-5.1 (medium reasoning) | 0.584 | | 8 | Gemini 3 Pro Preview | 0.399 | | 9 | Mistral Medium 3.1 | 0.396 | | 10 | Gemini 2.5 Pro | 0.394 | | 11 | Qwen 3 Max Preview | 0.371 | | 12 | Claude Opus 4.5 (no reasoning) | 0.359 | | 13 | Claude Opus 4.5 Thinking 16K | 0.345 | | 14 | Claude Sonnet 4.5 Thinking 16K | 0.331 | | 15 | Kimi K2.5 Thinking | 0.293 | | 16 | Claude Sonnet 4.5 (no reasoning) | 0.251 | | 17 | Claude Opus 4.1 (no reasoning) | 0.238 | | 18 | Qwen3 Max (2026-01-23) | 0.095 | | 19 | Kimi K2 Thinking | -0.094 | | 20 | MiniMax-M2.1 | -0.122 | | 21 | Grok 4.1 Fast Reasoning | -0.198 | | 22 | Deepseek V3.2 | -0.244 | | 23 | Mistral Large 3 | -0.384 | | 24 | Baidu Ernie 4.5 300B A47B | -0.393 | | 25 | GLM-4.6 | -0.397 | | 26 | Deepseek V3.2 Exp | -0.708 | | 27 | GLM-4.5 | -0.772 | | 28 | GPT-OSS-120B | -0.844 | | 29 | Cohere Command A | -1.257 | | 30 | Llama 4 Maverick | -2.715 |

Element integration only (9A–9J)

A valid concern is whether LLM graders can accurately score questions 1 to 8 (Major Story Aspects), such as Character Development & Motivation. However, questions 9A to 9J (Element Integration) are clearly easier for graders to evaluate reliably. We observe high correlation between the per‑(grader, LLM) means for craft (Q1–Q8) and element‑fit (9A–9J), and a strong overall correlation aggregated across all files. While we cannot be certain these ratings are correct without human validation, their consistency suggests that something real is being measured. For an element‑only view, you can ignore Q1–Q8 and use only 9A–9J:

Element integration (9A–9J)

Normalized view (per‑grader z‑scores):

Element integration — normalized (9A–9J)

Craft only (Q1–Q8)

Craft (Q1–Q8)

Normalized view (per‑grader z‑scores):

Craft — normalized (Q1–Q8)


LLM vs. Question (Detailed)

The detailed heatmap shows each model’s mean score on each rubric question. It is a fast way to spot models that excel at voice and prose but trail on plot or element integration, or vice versa.

LLM per question


Which model “wins” the most prompts?

For every prompt, we rank models by their cross‑grader story score and tally the number of #1 finishes. This captures consistency at the very top rather than just the average.

#1 stories pie chart


Grader ↔ LLM interactions

We publish two complementary views:

  • Mean heatmap (Grader × LLM). Useful for seeing whether any model is especially favored or disfavored by a particular grader.

Grader vs LLM

  • Normalized heatmap. Z‑scores each grader’s scale so only relative preferences remain.

Grader vs LLM normalized

Additional view: grader–grader correlation (how graders align with each other).

Grader correlation


Method Summary

Stories and length. Each model contributes short stories that must land in a strict 600–800 word range. We verify counts, flag outliers, and generate compliance charts before any grading.

Grading. Each story is scored independently by seven grader LLMs using the 18‑question rubric above.

Aggregation. For every story: compute the power mean (Hölder mean) with p = 0.5 across the 18 questions with a 60/40 per‑question weighting (Q1–Q8 vs. 9A–9J), then average across graders. For every model: average across its stories. We also compute per‑question means so readers can see where a model is strong (e.g., prose) or weak (e.g., plot or tone fit).

Grading LLMs

The following grader models scored stories:

  • Claude Sonnet 4.5 (no reasoning)
  • DeepSeek V3.2 Exp
  • Gemini 3 Pro Preview
  • GPT-5.1 (low reasoning)
  • Grok 4.1 Fast Reasoning
  • Kimi K2-0905
  • Qwen 3 Max

How the ten required elements are chosen

We use a two‑stage LLM‑assisted pipeline that starts from large curated pools and converges on one coherent set per prompt:

  • The ten categories are defined in the elements catalog (character, object, core concept, attribute, action, method, setting, timeframe, motivation, tone).
  • Seed prompts with candidates: For each seed index, we randomly sample ten options per category (plus the literal option “None”) from those pools and write a selection prompt.
  • Proposer selection: Multiple proposer LLMs each pick exactly one element per category, allowing “None” in at most one category when that improves coherence. Each proposer returns a complete 10‑line set.
  • Rate for fit: We deduplicate sets per seed and have several independent rater LLMs score how well each set “hangs together” (1–10). Scores are z‑normalized per rater to remove leniency differences and then averaged.
  • Choose the winner: For each seed we take the top normalized‑mean set (ties are broken consistently). That set becomes the “required elements” block for the final story prompt.

Notes

  • Within a prompt, categories never repeat; one category may be “None,” which means that element is not required for that prompt.
  • There is no separate cross‑prompt coverage optimizer. Variety comes from the breadth of the curated pools and independent per‑seed sampling plus LLM selection and rating. As a result, duplicates across different prompts are possible but uncommon.

Scoring scale

  • Scale: 0.0–10.0 per question, in 0.1 increments (e.g., 7.3).
  • Story score: Power mean (Hölder mean) with p = 0.5 across the 18 questions with 60/40 per‑question weights (Q1–Q8 vs. 9A–9J), then averaged across graders.
  • Model score: average of its story scores. Uncertainty bands reflec

Related Skills

View on GitHub
GitHub Stars357
CategoryContent
Updated1d ago
Forks8

Languages

Batchfile

Security Score

85/100

Audited on Mar 30, 2026

No findings