SkillAgentSearch skills...

Divergent

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

Install / Use

/learn @lechmazur/Divergent
About this skill

Quality Score

0/100

Supported Platforms

Claude Code
Claude Desktop
Gemini CLI

README

LLM Divergent Thinking Creativity Benchmark

Open-ended divergent thinking tests, which ask individuals to list words as distinct from one another as possible, are often used to evaluate originality and fluency. This benchmark presents a more challenging variation of the divergent thinking test, where LLMs are provided with an initial list of 50 random words. The task requires the LLMs to generate 25 words that are not only highly distinct from each other but also unrelated to the initial 50 words, with the additional constraint that each word must start with a specified letter.


Method

  • Each LLM generates 25 words that are as distinct as possible from one another and from the initial 50 words, with each word beginning with a specified letter
  • The first letters used are: a, b, c, d, e, f, g, h, i, k, l, m, n, o, p, r, s, t, u, w, y, and one of v, x, z, j, or q.
  • Each LLM is prompted 88 times to generate 25 words, resulting in a total of 2,200 words generated per LLM.
  • Each pair of potentially related words (1,209,932 unique combinations) is evaluated by four LLMs: GPT-4o, Claude 3.5 Sonnet (2024-10-22), Grok 2 (12-12), and Gemini 1.5 Pro on a of scale 0 to 10. For each generated word, the average LLM score of minimum divergences between this word and other words was used.
  • Each generated word is also evaluated by these four LLMs to determine how well it follows the rules (e.g., no proper nouns, real English words, no hyphens).

Results

Higher scores indicate better performance.

scores

For completeness, a chart with the Y-axis starting at 0 is also provided:

graph-0

| Model | Score | |---------------------------------|-------| | o1-preview | 4.79 | | Gemini 2.0 Flash Exp | 4.65 | | Claude 3 Opus | 4.47 | | Grok 2 12-12 | 4.45 | | Llama 3.3 70B | 4.44 | | Gemini 2.0 Flash Thinking Exp | 4.41 | | Claude 3.5 Sonnet 2024-10-22 | 4.41 | | Gemma 2 27B | 4.37 | | o1-mini | 4.20 | | Claude 3.5 Haiku | 4.16 | | Mistral Large 2 | 4.14 | | GPT-4o mini | 4.12 | | Gemini 1.5 Flash | 4.09 | | Gemini 1.5 Pro (Sept) | 4.07 | | Claude 3 Haiku | 3.98 | | Qwen 2.5 72B | 3.89 | | Llama 3.1 405B | 3.83 | | DeepSeek-V2.5 | 3.76 | | GPT-4o | 3.73 |

The table below highlights the percentage of repeated words, which helps explain GPT-4o's poor performance:

chart2

| Model Name | % Repeats | |------------------------------------------|-----------| | Llama 3.3 70B | 0.00% | | o1-preview | 0.00% | | Gemini 2.0 Flash Thinking Exp | 0.18% | | Claude 3.5 Sonnet 2024-10-22 | 0.50% | | Gemini 1.5 Flash | 0.68% | | Gemini 1.5 Pro (Sept) | 0.95% | | Llama 3.1 405B | 1.00% | | Gemma 2 27B | 1.14% | | o1-mini | 2.23% | | Gemini 2.0 Flash Exp | 2.64% | | Claude 3 Opus | 4.00% | | Mistral Large 2 | 5.05% | | Claude 3.5 Haiku | 5.45% | | Grok 2 12-12 | 5.64% | | GPT-4o mini | 8.45% | | Qwen 2.5 72B | 11.23% | | Claude 3 Haiku | 16.36% | | GPT-4o | 23.68% | | DeepSeek-V2.5 | 25.09% |


Other multi-agent benchmarks

Other benchmarks


Updates

View on GitHub
GitHub Stars34
CategoryDevelopment
Updated17d ago
Forks1

Security Score

80/100

Audited on Mar 15, 2026

No findings