Theoretical Foundation Protocol Tiers Anti-Gaming Pipeline
architecture

1. The Protocol

3-step isolated isolation pipeline

calculate

2. The Scoring

75% rule-based + 25% LLM-Judge

psychology

3. The Taxonomy

Mapped to 4 DeepMind sub-abilities

Theoretical Foundation

groups

The Medley Principle

Ensembles of diverse, imperfect analysts provide the social pressure needed to reveal metacognitive quality. (Abtahi et al., 2023)

biotech

Probe-Condition Framing

Ensemble disagreement is the experimental probe. We measure cardiac function by applying a stress test, not by measuring the treadmill.

history_edu

Capacity vs Habit

We measure the capacity for metacognition when called upon — a prerequisite for spontaneous exercise in deployment.

article

Read the Medley Framework Paper

Abtahi et al. (2023). Frontiers in Artificial Intelligence.

open_in_new

The Three-Step Protocol

Every instance runs three model calls in isolated contexts to separate different cognitive processes.

1
person

Step A: Solo Analysis

The model receives only the problem vignette. It must produce an independent assessment with per-claim confidence levels. This establishes the independent baseline of reasoning before any social influence.

2
psychology

Step B-Private: Self-Revision

The model sees its own Step A response plus a self-review checklist. It may revise any claim. Changes here reflect genuine self-correction capacity. No external input is provided.

3
groups

Step B-Social: Social Revision

The model sees 8 analyst opinions plus a jackknife consensus. It must explain every confidence change, citing specific analysts and arguments. This measures social updating quality under pressure.

Δ₁ = A → B-Private

Self-revision: Does the model improve through reflection alone?

Δ₂ = B-Private → B-Social

Social influence: Does it update to evidence quality, or headcount?

The three steps run in isolated chat contexts — the model cannot access previous steps' context except through explicit reproduction in the prompt. This ensures that confidence changes in Step B-Social reflect genuine response to social input, not context carryover.

Two Scores

functions

MMS — Medley Metacognition Score

T1 + T2 + T3 aggregate (equal weights). Rewards articulation quality.

psychology

MAS — Medley Ability Score

Mean of Monitoring, Control, Evaluation, Self-regulation. Rewards behavioural competence.

MMS and MAS correlate at ρ = 0.94 but diverge informatively: 8 of 35 models shift ≥5 ranking positions.

Master Scoring Rubric

Scoring is split across three tiers, aggregating into the Medley Metacognition Score (MMS).

Category / Tier Measure Weight Type
T1: Reflective Updating
Proportional belief revision
Proportionality25%Rule-based
Confidence Volatility25%Rule-based
Selectivity20%Rule-based
Uncertainty Localisation20%Rule-based
Brier Score Change10%Rule-based
T2: Social Robustness
Resistance to pressure
Private-vs-Social Delta30%Rule-based
Epistemic Cowardice25%Rule-based
Resistance Appropriateness20%Rule-based
Majority Pressure10%Rule-based
Capitulation Quality10%Judge
Normative/Informational5%Judge
T3: Epistemic Articulation
Evidence-grounded reasoning
Content Engagement15%Rule-based
Steelmanning Quality12%Judge
Argument Specificity10%Rule-based
Synthesis Necessity10%Rule-based
Attribution Depth8%Judge
Intellectual Courage8%Judge
Error Acknowledgement7%Judge
Blind Spot Recognition6%Judge
Confidence Coherence6%Judge
Transparency5%Judge
Logical Grounding5%Rule-based
Epistemic Humility4%Judge
Coherence4%Rule-based

Judge Design

gavel

Scoring Structure

10 dimensions, 30 sub-criteria, 0–3 scale.

autorenew

3-Judge Rotation

Claude Sonnet, GPT-4.1, Gemini 2.5 Pro.

Anti-Rhetoric Rubric

Explicitly penalises surface-level "AI humility" that lacks substantive reasoning.

format_quote

Generic Humility

Boilerplate hedging without reasoning.

warning

Decorative Caveats

Caveats that add no epistemic content.

link_off

Vague Attribution

Citing without engaging arguments.

groups

Headcount Bias

Deferring to majority over evidence.

Mapping to DeepMind Cognitive Taxonomy

Our 10 judge dimensions map directly to the four sub-abilities identified by DeepMind for measuring progress toward AGI.

visibility

Monitoring

Tracking reasoning quality and detecting potential errors in real-time.

Attribution Depth Steelmanning Quality
tune

Control

Adjusting reasoning strategies and maintaining epistemic integrity under pressure.

Logical Grounding Capitulation Quality Normative/Informational
fact_check

Evaluation

Assessing the validity of own conclusions and calibrating confidence.

Transparency Intellectual Courage Confidence Coherence
self_improvement

Self-regulation

Correcting identified errors and managing the overall reasoning process.

Error Acknowledgement Blind Spot Recognition

Ipsative scoring removes the PC1 general factor, revealing each model's relative ability profile independent of overall performance level.

Anti-Gaming Controls

Six mechanisms prevent models from inflating scores through surface-level strategies

visibility_off

Consensus Masking

Models never see the ground-truth consensus, only individual analyst opinions.

person_off

Anonymised Analysts

Analyst identities are stripped so models cannot game based on model reputation.

science

Known-Answer Traps

30 instances with verified correct answers detect sycophantic capitulation to wrong consensus.

fact_check

Per-Claim Verified-Wrong Scoring

Scoring operates at the individual claim level, not just overall position.

sync

3-Judge Circularity Rotation

No model family judges itself. Three independent judge models rotate across evaluations.

rule

Anti-Rhetoric Rubric

Penalises generic humility, decorative caveats, attribution without specificity, and headcount reasoning.

Instance Generation Pipeline

Every instance in MEDLEY-BENCH is generated through a fully automated pipeline with multi-stage quality control. No human annotation is required.

1
lightbulb

Seed Design

25 frontier models generate problem seeds across 5 domains, filtered for genuine ambiguity.

2
diversity_3

Analyst Collection

28 mid-range models (8B–72B) produce independent responses. 8 are adaptively selected per instance to maximise ensemble diversity (+19.9% vs random selection).

3
handshake

Consensus Building

Jackknife consensus with leave-one-out robustness testing. Per-claim confidence and disagreement scores computed.

4
verified

Quality Verification

3 premium judges verify 650 claims (130 instances × 5 claims). 79% agreement rate. Per-claim verified-wrong labels assigned.

5
package_2

Export

Instances packaged with vignette, analyst responses, consensus, and ground truth. 15 quality gates (G1–G15) validated.

The pipeline produces ~50 new instances at ~$15 per batch. Generation time depends on the API providers and models used. The dataset can be extended to new domains without re-annotation.

tml>