Behavioral Metacognition Under Social Pressure

MEDLEY-BENCH measures how AI models revise beliefs, resist unjustified pressure, and ground reasoning in evidence.

leaderboard Leaderboard description Paper

35 Models Evaluated

130 Instances

5 Domains

12.6pt MMS Range

Does it know when it's wrong?

Most AI benchmarks ask whether a model got the answer right. MEDLEY-BENCH asks something harder: does it know when it might be wrong?

By testing 35 models under real social pressure, we uncovered a universal knowing-doing gap — AI can recognise good reasoning in others but consistently fails to apply that standard to itself. We also reveal that models fall into distinct cognitive profiles: monitoring-dominant models that track their own uncertainty well, control-dominant models that adjust strategy under pressure, self-regulation-dominant models that manage their reasoning process, and statistics-followers that simply defer to consensus regardless of argument quality.

Scale doesn't fix these differences. Newer models don't escape them. MEDLEY-BENCH gives researchers, developers, and regulators a principled way to evaluate AI metacognition before it matters in the real world.

podcasts

Podcast: The Metacognition Gap in AI

timer ~18 min

Listen to a deep dive into the research methodology and the "Knowing-Doing Gap" discovered across 35 frontier models.

videocam

The Metacognition Paradox

visibility 1080p HD

A visual exploration of how models cluster into cognitive signatures and why evaluation is the universal bottleneck.

videocam

MEDLEY-BENCH Explainer

visibility YouTube

A walkthrough of the benchmark design, three-step protocol, and how to run MEDLEY-BENCH on your own models.

Install & Run

MEDLEY-BENCH is available as a Python library on PyPI. The 130-instance dataset is bundled — no separate download needed.

                $
                pip install medley-bench
            

                    # Benchmark any model — cloud or local

                    medley-bench benchmark --models "anthropic/claude-haiku-4.5"

                    medley-bench benchmark --models "ollama/gemma3:12b"

                    # Add a live judge for full scoring

                    medley-bench benchmark --models "ollama/gemma3:12b" \

                      --judge-model gemini-2.5-flash

                    # View results

                    medley-bench leaderboard --results results/

Provider	Model ID	API Key
Anthropic	claude-*	ANTHROPIC_API_KEY
OpenAI	gpt-, o1-, o3-*	OPENAI_API_KEY
Google	gemini-*	GOOGLE_API_KEY
Ollama	ollama/model	None (local)
OpenRouter	org/model	OPENROUTER_API_KEY

download PyPI Package laptop Open in Colab leaderboard Leaderboard menu_book Documentation

Full run = 520 API calls (390 target + 130 judge). ~1 hr on fast APIs, several hours on slower providers. Results save incrementally and runs are resumable.

Paper & Citation

Citation

Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane

SMAILE Core Facility, Karolinska Institutet, Stockholm, Sweden

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition (2026)

picture_as_pdf Preprint PDF open_in_new arXiv emoji_events Kaggle Write-up

BibTeX

@article{abtahi2026medleybench,
  title={MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition},
  author={Abtahi, Farhad and Karbalaie, Abdolamir and
          Illueca-Fernandez, Eduardo and Seoane, Fernando},
  year={2026},
  note={Preprint}
}

bash — medley-bench install

$ pip install medley-bench[all-providers]
$ medley-bench benchmark --track metacognition \
    --models "openai/gpt-4.1"

Ready to evaluate?

Use as a Python library or CLI tool. 5 provider backends, 95 tests, async API. Optional provider dependencies — install only what you need.

menu_book PyPI Package code GitHub Repo