MEDLEY-BENCH measures how AI models revise beliefs, resist unjustified pressure, and ground reasoning in evidence.
Most AI benchmarks ask whether a model got the answer right. MEDLEY-BENCH asks something harder: does it know when it might be wrong?
By testing 35 models under real social pressure, we uncovered a universal knowing-doing gap — AI can recognise good reasoning in others but consistently fails to apply that standard to itself. We also reveal that models fall into distinct cognitive profiles: monitoring-dominant models that track their own uncertainty well, control-dominant models that adjust strategy under pressure, self-regulation-dominant models that manage their reasoning process, and statistics-followers that simply defer to consensus regardless of argument quality.
Scale doesn't fix these differences. Newer models don't escape them. MEDLEY-BENCH gives researchers, developers, and regulators a principled way to evaluate AI metacognition before it matters in the real world.
Listen to a deep dive into the research methodology and the "Knowing-Doing Gap" discovered across 35 frontier models.
A visual exploration of how models cluster into cognitive signatures and why evaluation is the universal bottleneck.
A walkthrough of the benchmark design, three-step protocol, and how to run MEDLEY-BENCH on your own models.
MEDLEY-BENCH is available as a Python library on PyPI. The 130-instance dataset is bundled — no separate download needed.
pip install medley-bench
| Provider | Model ID | API Key |
|---|---|---|
| Anthropic | claude-* | ANTHROPIC_API_KEY |
| OpenAI | gpt-*, o1-*, o3-* | OPENAI_API_KEY |
| gemini-* | GOOGLE_API_KEY | |
| Ollama | ollama/model | None (local) |
| OpenRouter | org/model | OPENROUTER_API_KEY |
Full run = 520 API calls (390 target + 130 judge). ~1 hr on fast APIs, several hours on slower providers. Results save incrementally and runs are resumable.
Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane
SMAILE Core Facility, Karolinska Institutet, Stockholm, Sweden
MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition (2026)
@article{abtahi2026medleybench,
title={MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition},
author={Abtahi, Farhad and Karbalaie, Abdolamir and
Illueca-Fernandez, Eduardo and Seoane, Fernando},
year={2026},
note={Preprint}
}
$ pip install medley-bench[all-providers]
$ medley-bench benchmark --track metacognition \
--models "openai/gpt-4.1"
Use as a Python library or CLI tool. 5 provider backends, 95 tests, async API. Optional provider dependencies — install only what you need.