Behavioral Metacognition Under Social Pressure

MEDLEY-BENCH measures how AI models revise beliefs, resist unjustified pressure, and ground reasoning in evidence.

35 Models Evaluated
130 Instances
5 Domains
12.6pt MMS Range

Does it know when it's wrong?

Most AI benchmarks ask whether a model got the answer right. MEDLEY-BENCH asks something harder: does it know when it might be wrong?

By testing 35 models under real social pressure, we uncovered a universal knowing-doing gap — AI can recognise good reasoning in others but consistently fails to apply that standard to itself. We also reveal that models fall into distinct cognitive profiles: monitoring-dominant models that track their own uncertainty well, control-dominant models that adjust strategy under pressure, self-regulation-dominant models that manage their reasoning process, and statistics-followers that simply defer to consensus regardless of argument quality.

Scale doesn't fix these differences. Newer models don't escape them. MEDLEY-BENCH gives researchers, developers, and regulators a principled way to evaluate AI metacognition before it matters in the real world.

podcasts

Podcast: The Metacognition Gap in AI

timer ~18 min

Listen to a deep dive into the research methodology and the "Knowing-Doing Gap" discovered across 35 frontier models.

videocam

The Metacognition Paradox

visibility 1080p HD

A visual exploration of how models cluster into cognitive signatures and why evaluation is the universal bottleneck.

videocam

MEDLEY-BENCH Explainer

visibility YouTube

A walkthrough of the benchmark design, three-step protocol, and how to run MEDLEY-BENCH on your own models.

Install & Run

MEDLEY-BENCH is available as a Python library on PyPI. The 130-instance dataset is bundled — no separate download needed.

$ pip install medley-bench
# Benchmark any model — cloud or local
medley-bench benchmark --models "anthropic/claude-haiku-4.5"
medley-bench benchmark --models "ollama/gemma3:12b"

# Add a live judge for full scoring
medley-bench benchmark --models "ollama/gemma3:12b" \
  --judge-model gemini-2.5-flash

# View results
medley-bench leaderboard --results results/
Provider Model ID API Key
Anthropicclaude-*ANTHROPIC_API_KEY
OpenAIgpt-*, o1-*, o3-*OPENAI_API_KEY
Googlegemini-*GOOGLE_API_KEY
Ollamaollama/modelNone (local)
OpenRouterorg/modelOPENROUTER_API_KEY

Full run = 520 API calls (390 target + 130 judge). ~1 hr on fast APIs, several hours on slower providers. Results save incrementally and runs are resumable.

Paper & Citation

Citation

Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane

SMAILE Core Facility, Karolinska Institutet, Stockholm, Sweden

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition (2026)

BibTeX

@article{abtahi2026medleybench,
  title={MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition},
  author={Abtahi, Farhad and Karbalaie, Abdolamir and
          Illueca-Fernandez, Eduardo and Seoane, Fernando},
  year={2026},
  note={Preprint}
}
bash — medley-bench install
$ pip install medley-bench[all-providers]
$ medley-bench benchmark --track metacognition \
    --models "openai/gpt-4.1"

Ready to evaluate?

Use as a Python library or CLI tool. 5 provider backends, 95 tests, async API. Optional provider dependencies — install only what you need.

menu_book PyPI Package code GitHub Repo