Writing
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story
Install / Use
/learn @lechmazur/WritingQuality Score
Category
Content & MediaSupported Platforms
README
LLM Creative Story‑Writing Benchmark V4
This benchmark evaluates how well large language models (LLMs) follow a creative brief while still producing engaging fiction. Every story must meaningfully incorporate ten required elements: character, object, concept, attribute, action, method, setting, timeframe, motivation, and tone. With these building blocks standardized and length tightly controlled, differences in constraint satisfaction and literary quality become directly comparable. Multiple independent “grader” LLMs score each story on an 18‑question rubric, and we aggregate those judgments into model‑level results.

What’s measured
1) Craft and coherence (Q1–Q8)
Eight questions focus on narrative craft: character depth and motivation, plot structure and coherence, world building and atmosphere, story impact, originality, thematic cohesion, voice/point‑of‑view, and line‑level prose quality.
2) Element integration (Q9A–Q9J)
Ten questions check whether the story organically uses each required element: the specified character, object, core concept, attribute, action, method, setting, timeframe, motivation, and tone. If a category in the prompt is “None,” graders mark the corresponding 9‑series item as N/A.
3) Overall story score
We score each story per grader with a 60/40 weighted power mean (Hölder mean, p = 0.5) over the 18 rubric items (Q1–Q8 = 60%, 9A–9J = 40%, split evenly within each group). Compared with a plain average, p = 0.5 acts like a soft‑minimum: it sits closer to the lowest dimensions, so weaknesses pull more than highs can offset and well‑rounded craft is rewarded. The final story score is the mean of the per‑grader scores.
Results
Overall model means
The top bar chart summarizes mean story quality for each model with uncertainty bands. (Grader‑unweighted means; questions weighted 60/40.)
Full overall leaderboard
| Rank | LLM | Mean Score | Samples | SEM | |-----:|------------------------|-----------:|--------:|----:| | 1 | Claude Opus 4.6 Thinking 16K | 8.561 | 2795 | 0.0118 | | 2 | Claude Opus 4.6 (no reasoning) | 8.533 | 2796 | 0.0123 | | 3 | GPT-5.2 (medium reasoning) | 8.511 | 2796 | 0.0145 | | 4 | GPT-5 Pro | 8.474 | 2796 | 0.0158 | | 5 | GPT-5.1 (medium reasoning) | 8.438 | 2796 | 0.0134 | | 6 | GPT-5 (medium reasoning) | 8.434 | 2796 | 0.0162 | | 7 | Kimi K2-0905 | 8.331 | 2796 | 0.0199 | | 8 | Gemini 3 Pro Preview | 8.221 | 2796 | 0.0170 | | 9 | Gemini 2.5 Pro | 8.219 | 2796 | 0.0169 | | 10 | Mistral Medium 3.1 | 8.201 | 2796 | 0.0185 | | 11 | Claude Opus 4.5 Thinking 16K | 8.200 | 2796 | 0.0170 | | 12 | Claude Opus 4.5 (no reasoning) | 8.195 | 2796 | 0.0172 | | 13 | Claude Sonnet 4.5 Thinking 16K | 8.169 | 2796 | 0.0176 | | 14 | Claude Sonnet 4.5 (no reasoning) | 8.112 | 2796 | 0.0179 | | 15 | Qwen 3 Max Preview | 8.091 | 2796 | 0.0233 | | 16 | Kimi K2.5 Thinking | 8.068 | 2796 | 0.0220 | | 17 | Claude Opus 4.1 (no reasoning) | 8.068 | 2796 | 0.0197 | | 18 | Qwen3 Max (2026-01-23) | 7.842 | 2796 | 0.0256 | | 19 | MiniMax-M2.1 | 7.777 | 2795 | 0.0226 | | 20 | Kimi K2 Thinking | 7.687 | 2796 | 0.0286 | | 21 | Deepseek V3.2 | 7.601 | 2796 | 0.0279 | | 22 | Mistral Large 3 | 7.595 | 2796 | 0.0215 | | 23 | Grok 4.1 Fast Reasoning | 7.567 | 2796 | 0.0297 | | 24 | Baidu Ernie 4.5 300B A47B | 7.506 | 2796 | 0.0252 | | 25 | GLM-4.6 | 7.452 | 2796 | 0.0285 | | 26 | Deepseek V3.2 Exp | 7.159 | 2796 | 0.0322 | | 27 | GLM-4.5 | 7.120 | 2796 | 0.0315 | | 28 | GPT-OSS-120B | 7.030 | 2796 | 0.0336 | | 29 | Cohere Command A | 6.794 | 2796 | 0.0302 | | 30 | Llama 4 Maverick | 5.777 | 2796 | 0.0304 |
Full normalized leaderboard
| Rank | LLM | Normalized Mean | |-----:|------------------------|-----------------:| | 1 | GPT-5.2 (medium reasoning) | 0.719 | | 2 | Claude Opus 4.6 Thinking 16K | 0.711 | | 3 | GPT-5 Pro | 0.705 | | 4 | Claude Opus 4.6 (no reasoning) | 0.684 | | 5 | GPT-5 (medium reasoning) | 0.666 | | 6 | Kimi K2-0905 | 0.588 | | 7 | GPT-5.1 (medium reasoning) | 0.584 | | 8 | Gemini 3 Pro Preview | 0.399 | | 9 | Mistral Medium 3.1 | 0.396 | | 10 | Gemini 2.5 Pro | 0.394 | | 11 | Qwen 3 Max Preview | 0.371 | | 12 | Claude Opus 4.5 (no reasoning) | 0.359 | | 13 | Claude Opus 4.5 Thinking 16K | 0.345 | | 14 | Claude Sonnet 4.5 Thinking 16K | 0.331 | | 15 | Kimi K2.5 Thinking | 0.293 | | 16 | Claude Sonnet 4.5 (no reasoning) | 0.251 | | 17 | Claude Opus 4.1 (no reasoning) | 0.238 | | 18 | Qwen3 Max (2026-01-23) | 0.095 | | 19 | Kimi K2 Thinking | -0.094 | | 20 | MiniMax-M2.1 | -0.122 | | 21 | Grok 4.1 Fast Reasoning | -0.198 | | 22 | Deepseek V3.2 | -0.244 | | 23 | Mistral Large 3 | -0.384 | | 24 | Baidu Ernie 4.5 300B A47B | -0.393 | | 25 | GLM-4.6 | -0.397 | | 26 | Deepseek V3.2 Exp | -0.708 | | 27 | GLM-4.5 | -0.772 | | 28 | GPT-OSS-120B | -0.844 | | 29 | Cohere Command A | -1.257 | | 30 | Llama 4 Maverick | -2.715 |
Element integration only (9A–9J)
A valid concern is whether LLM graders can accurately score questions 1 to 8 (Major Story Aspects), such as Character Development & Motivation. However, questions 9A to 9J (Element Integration) are clearly easier for graders to evaluate reliably. We observe high correlation between the per‑(grader, LLM) means for craft (Q1–Q8) and element‑fit (9A–9J), and a strong overall correlation aggregated across all files. While we cannot be certain these ratings are correct without human validation, their consistency suggests that something real is being measured. For an element‑only view, you can ignore Q1–Q8 and use only 9A–9J:

Normalized view (per‑grader z‑scores):

Craft only (Q1–Q8)

Normalized view (per‑grader z‑scores):

LLM vs. Question (Detailed)
The detailed heatmap shows each model’s mean score on each rubric question. It is a fast way to spot models that excel at voice and prose but trail on plot or element integration, or vice versa.

Which model “wins” the most prompts?
For every prompt, we rank models by their cross‑grader story score and tally the number of #1 finishes. This captures consistency at the very top rather than just the average.

Grader ↔ LLM interactions
We publish two complementary views:
- Mean heatmap (Grader × LLM). Useful for seeing whether any model is especially favored or disfavored by a particular grader.

- Normalized heatmap. Z‑scores each grader’s scale so only relative preferences remain.

Additional view: grader–grader correlation (how graders align with each other).

Method Summary
Stories and length. Each model contributes short stories that must land in a strict 600–800 word range. We verify counts, flag outliers, and generate compliance charts before any grading.
Grading. Each story is scored independently by seven grader LLMs using the 18‑question rubric above.
Aggregation. For every story: compute the power mean (Hölder mean) with p = 0.5 across the 18 questions with a 60/40 per‑question weighting (Q1–Q8 vs. 9A–9J), then average across graders. For every model: average across its stories. We also compute per‑question means so readers can see where a model is strong (e.g., prose) or weak (e.g., plot or tone fit).
Grading LLMs
The following grader models scored stories:
- Claude Sonnet 4.5 (no reasoning)
- DeepSeek V3.2 Exp
- Gemini 3 Pro Preview
- GPT-5.1 (low reasoning)
- Grok 4.1 Fast Reasoning
- Kimi K2-0905
- Qwen 3 Max
How the ten required elements are chosen
We use a two‑stage LLM‑assisted pipeline that starts from large curated pools and converges on one coherent set per prompt:
- The ten categories are defined in the elements catalog (character, object, core concept, attribute, action, method, setting, timeframe, motivation, tone).
- Seed prompts with candidates: For each seed index, we randomly sample ten options per category (plus the literal option “None”) from those pools and write a selection prompt.
- Proposer selection: Multiple proposer LLMs each pick exactly one element per category, allowing “None” in at most one category when that improves coherence. Each proposer returns a complete 10‑line set.
- Rate for fit: We deduplicate sets per seed and have several independent rater LLMs score how well each set “hangs together” (1–10). Scores are z‑normalized per rater to remove leniency differences and then averaged.
- Choose the winner: For each seed we take the top normalized‑mean set (ties are broken consistently). That set becomes the “required elements” block for the final story prompt.
Notes
- Within a prompt, categories never repeat; one category may be “None,” which means that element is not required for that prompt.
- There is no separate cross‑prompt coverage optimizer. Variety comes from the breadth of the curated pools and independent per‑seed sampling plus LLM selection and rating. As a result, duplicates across different prompts are possible but uncommon.
Scoring scale
- Scale: 0.0–10.0 per question, in 0.1 increments (e.g., 7.3).
- Story score: Power mean (Hölder mean) with p = 0.5 across the 18 questions with 60/40 per‑question weights (Q1–Q8 vs. 9A–9J), then averaged across graders.
- Model score: average of its story scores. Uncertainty bands reflec
Related Skills
qqbot-channel
343.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
async-pr-review
99.7kTrigger this skill when the user wants to start an asynchronous PR review, run background checks on a PR, or check the status of a previously started async PR review.
ci
99.7kCI Replicate & Status This skill enables the agent to efficiently monitor GitHub Actions, triage failures, and bridge remote CI errors to local development. It defaults to automatic replication
code-reviewer
99.7kCode Reviewer This skill guides the agent in conducting professional and thorough code reviews for both local development and remote Pull Requests. Workflow 1. Determine Review Target
