Methodology - MEDLEY-BENCH

architecture

1. The Protocol

3-step isolated isolation pipeline

calculate

2. The Scoring

75% rule-based + 25% LLM-Judge

psychology

3. The Taxonomy

Mapped to 4 DeepMind sub-abilities

Theoretical Foundation

groups

The Medley Principle

Ensembles of diverse, imperfect analysts provide the social pressure needed to reveal metacognitive quality. (Abtahi et al., 2023)

biotech

Probe-Condition Framing

Ensemble disagreement is the experimental probe. We measure cardiac function by applying a stress test, not by measuring the treadmill.

history_edu

Capacity vs Habit

We measure the capacity for metacognition when called upon — a prerequisite for spontaneous exercise in deployment.

article

Read the Medley Framework Paper

Abtahi et al. (2023). Frontiers in Artificial Intelligence.

open_in_new

The Three-Step Protocol

Every instance runs three model calls in isolated contexts to separate different cognitive processes.

person

Step A: Solo Analysis

The model receives only the problem vignette. It must produce an independent assessment with per-claim confidence levels. This establishes the independent baseline of reasoning before any social influence.

psychology

Step B-Private: Self-Revision

The model sees its own Step A response plus a self-review checklist. It may revise any claim. Changes here reflect genuine self-correction capacity. No external input is provided.

groups

Step B-Social: Social Revision

The model sees 8 analyst opinions plus a jackknife consensus. It must explain every confidence change, citing specific analysts and arguments. This measures social updating quality under pressure.

Δ₁ = A → B-Private

Self-revision: Does the model improve through reflection alone?

Δ₂ = B-Private → B-Social

Social influence: Does it update to evidence quality, or headcount?

The three steps run in isolated chat contexts — the model cannot access previous steps' context except through explicit reproduction in the prompt. This ensures that confidence changes in Step B-Social reflect genuine response to social input, not context carryover.

Two Scores

functions

MMS — Medley Metacognition Score

T1 + T2 + T3 aggregate (equal weights). Rewards articulation quality.

psychology

MAS — Medley Ability Score

Mean of Monitoring, Control, Evaluation, Self-regulation. Rewards behavioural competence.

MMS and MAS correlate at ρ = 0.94 but diverge informatively: 8 of 35 models shift ≥5 ranking positions.

Master Scoring Rubric

Scoring is split across three tiers, aggregating into the Medley Metacognition Score (MMS).

Category / Tier	Measure	Weight	Type
T1: Reflective Updating Proportional belief revision	Proportionality	25%	Rule-based
	Confidence Volatility	25%	Rule-based
	Selectivity	20%	Rule-based
	Uncertainty Localisation	20%	Rule-based
	Brier Score Change	10%	Rule-based
T2: Social Robustness Resistance to pressure	Private-vs-Social Delta	30%	Rule-based
	Epistemic Cowardice	25%	Rule-based
	Resistance Appropriateness	20%	Rule-based
	Majority Pressure	10%	Rule-based
	Capitulation Quality	10%	Judge
	Normative/Informational	5%	Judge
T3: Epistemic Articulation Evidence-grounded reasoning	Content Engagement	15%	Rule-based
	Steelmanning Quality	12%	Judge
	Argument Specificity	10%	Rule-based
	Synthesis Necessity	10%	Rule-based
	Attribution Depth	8%	Judge
	Intellectual Courage	8%	Judge
	Error Acknowledgement	7%	Judge
	Blind Spot Recognition	6%	Judge
	Confidence Coherence	6%	Judge
	Transparency	5%	Judge
	Logical Grounding	5%	Rule-based
	Epistemic Humility	4%	Judge
	Coherence	4%	Rule-based

Judge Design

gavel

Scoring Structure

10 dimensions, 30 sub-criteria, 0–3 scale.

autorenew

3-Judge Rotation

Claude Sonnet, GPT-4.1, Gemini 2.5 Pro.

Anti-Rhetoric Rubric

Explicitly penalises surface-level "AI humility" that lacks substantive reasoning.

format_quote

Generic Humility

Boilerplate hedging without reasoning.

warning

Decorative Caveats

Caveats that add no epistemic content.

link_off

Vague Attribution

Citing without engaging arguments.

groups

Headcount Bias

Deferring to majority over evidence.

Mapping to DeepMind Cognitive Taxonomy

Our 10 judge dimensions map directly to the four sub-abilities identified by DeepMind for measuring progress toward AGI.

visibility

Monitoring

Tracking reasoning quality and detecting potential errors in real-time.

Attribution Depth Steelmanning Quality

tune

Control

Adjusting reasoning strategies and maintaining epistemic integrity under pressure.

Logical Grounding Capitulation Quality Normative/Informational

fact_check

Evaluation

Assessing the validity of own conclusions and calibrating confidence.

Transparency Intellectual Courage Confidence Coherence

self_improvement

Self-regulation

Correcting identified errors and managing the overall reasoning process.

Error Acknowledgement Blind Spot Recognition

Ipsative scoring removes the PC1 general factor, revealing each model's relative ability profile independent of overall performance level.

Anti-Gaming Controls

Six mechanisms prevent models from inflating scores through surface-level strategies

visibility_off

Consensus Masking

Models never see the ground-truth consensus, only individual analyst opinions.

person_off

Anonymised Analysts

Analyst identities are stripped so models cannot game based on model reputation.

science

Known-Answer Traps

30 instances with verified correct answers detect sycophantic capitulation to wrong consensus.

fact_check

Per-Claim Verified-Wrong Scoring

Scoring operates at the individual claim level, not just overall position.

sync

3-Judge Circularity Rotation

No model family judges itself. Three independent judge models rotate across evaluations.

rule

Anti-Rhetoric Rubric

Penalises generic humility, decorative caveats, attribution without specificity, and headcount reasoning.

Instance Generation Pipeline

Every instance in MEDLEY-BENCH is generated through a fully automated pipeline with multi-stage quality control. No human annotation is required.

lightbulb

Seed Design

25 frontier models generate problem seeds across 5 domains, filtered for genuine ambiguity.

diversity_3

Analyst Collection

28 mid-range models (8B–72B) produce independent responses. 8 are adaptively selected per instance to maximise ensemble diversity (+19.9% vs random selection).

handshake

Consensus Building

Jackknife consensus with leave-one-out robustness testing. Per-claim confidence and disagreement scores computed.

verified

Quality Verification

3 premium judges verify 650 claims (130 instances × 5 claims). 79% agreement rate. Per-claim verified-wrong labels assigned.

package_2

Export

Instances packaged with vignette, analyst responses, consensus, and ground truth. 15 quality gates (G1–G15) validated.

The pipeline produces ~50 new instances at ~$15 per batch. Generation time depends on the API providers and models used. The dataset can be extended to new domains without re-annotation.