Hybrid scoring: 75% deterministic + 25% LLM judge
3-step isolated isolation pipeline
75% rule-based + 25% LLM-Judge
Mapped to 4 DeepMind sub-abilities
Ensembles of diverse, imperfect analysts provide the social pressure needed to reveal metacognitive quality. (Abtahi et al., 2023)
Ensemble disagreement is the experimental probe. We measure cardiac function by applying a stress test, not by measuring the treadmill.
We measure the capacity for metacognition when called upon — a prerequisite for spontaneous exercise in deployment.
Abtahi et al. (2023). Frontiers in Artificial Intelligence.
Every instance runs three model calls in isolated contexts to separate different cognitive processes.
The model receives only the problem vignette. It must produce an independent assessment with per-claim confidence levels. This establishes the independent baseline of reasoning before any social influence.
The model sees its own Step A response plus a self-review checklist. It may revise any claim. Changes here reflect genuine self-correction capacity. No external input is provided.
The model sees 8 analyst opinions plus a jackknife consensus. It must explain every confidence change, citing specific analysts and arguments. This measures social updating quality under pressure.
Self-revision: Does the model improve through reflection alone?
Social influence: Does it update to evidence quality, or headcount?
The three steps run in isolated chat contexts — the model cannot access previous steps' context except through explicit reproduction in the prompt. This ensures that confidence changes in Step B-Social reflect genuine response to social input, not context carryover.
T1 + T2 + T3 aggregate (equal weights). Rewards articulation quality.
Mean of Monitoring, Control, Evaluation, Self-regulation. Rewards behavioural competence.
MMS and MAS correlate at ρ = 0.94 but diverge informatively: 8 of 35 models shift ≥5 ranking positions.
Scoring is split across three tiers, aggregating into the Medley Metacognition Score (MMS).
| Category / Tier | Measure | Weight | Type |
|---|---|---|---|
| T1: Reflective Updating Proportional belief revision | Proportionality | 25% | Rule-based |
| Confidence Volatility | 25% | Rule-based | |
| Selectivity | 20% | Rule-based | |
| Uncertainty Localisation | 20% | Rule-based | |
| Brier Score Change | 10% | Rule-based | |
| T2: Social Robustness Resistance to pressure | Private-vs-Social Delta | 30% | Rule-based |
| Epistemic Cowardice | 25% | Rule-based | |
| Resistance Appropriateness | 20% | Rule-based | |
| Majority Pressure | 10% | Rule-based | |
| Capitulation Quality | 10% | Judge | |
| Normative/Informational | 5% | Judge | |
| T3: Epistemic Articulation Evidence-grounded reasoning | Content Engagement | 15% | Rule-based |
| Steelmanning Quality | 12% | Judge | |
| Argument Specificity | 10% | Rule-based | |
| Synthesis Necessity | 10% | Rule-based | |
| Attribution Depth | 8% | Judge | |
| Intellectual Courage | 8% | Judge | |
| Error Acknowledgement | 7% | Judge | |
| Blind Spot Recognition | 6% | Judge | |
| Confidence Coherence | 6% | Judge | |
| Transparency | 5% | Judge | |
| Logical Grounding | 5% | Rule-based | |
| Epistemic Humility | 4% | Judge | |
| Coherence | 4% | Rule-based |
10 dimensions, 30 sub-criteria, 0–3 scale.
Claude Sonnet, GPT-4.1, Gemini 2.5 Pro.
Explicitly penalises surface-level "AI humility" that lacks substantive reasoning.
Boilerplate hedging without reasoning.
Caveats that add no epistemic content.
Citing without engaging arguments.
Deferring to majority over evidence.
Our 10 judge dimensions map directly to the four sub-abilities identified by DeepMind for measuring progress toward AGI.
Tracking reasoning quality and detecting potential errors in real-time.
Adjusting reasoning strategies and maintaining epistemic integrity under pressure.
Assessing the validity of own conclusions and calibrating confidence.
Correcting identified errors and managing the overall reasoning process.
Ipsative scoring removes the PC1 general factor, revealing each model's relative ability profile independent of overall performance level.
Six mechanisms prevent models from inflating scores through surface-level strategies
Models never see the ground-truth consensus, only individual analyst opinions.
Analyst identities are stripped so models cannot game based on model reputation.
30 instances with verified correct answers detect sycophantic capitulation to wrong consensus.
Scoring operates at the individual claim level, not just overall position.
No model family judges itself. Three independent judge models rotate across evaluations.
Penalises generic humility, decorative caveats, attribution without specificity, and headcount reasoning.
Every instance in MEDLEY-BENCH is generated through a fully automated pipeline with multi-stage quality control. No human annotation is required.
25 frontier models generate problem seeds across 5 domains, filtered for genuine ambiguity.
28 mid-range models (8B–72B) produce independent responses. 8 are adaptively selected per instance to maximise ensemble diversity (+19.9% vs random selection).
Jackknife consensus with leave-one-out robustness testing. Per-claim confidence and disagreement scores computed.
3 premium judges verify 650 claims (130 instances × 5 claims). 79% agreement rate. Per-claim verified-wrong labels assigned.
Instances packaged with vignette, analyst responses, consensus, and ground truth. 15 quality gates (G1–G15) validated.
The pipeline produces ~50 new instances at ~$15 per batch. Generation time depends on the API providers and models used. The dataset can be extended to new domains without re-annotation.