Explorer - MEDLEY-BENCH

Monitoring & Control

The two-dimensional space of independent analysis (Monitoring) and strategic belief revision (Control). Size indicates overall MMS.

hover a model
to inspect its profile

Aggregate profiles by model family. Select families to compare their cognitive signatures.

F-01

Evaluation scales with size. Control does not.

Evaluation improves +5–12 pts/family with scale. Control shows no trend — a dissociation replicated across all 12 families.

F-02

Two behavioural archetypes emerge.

Argument-evaluators revise on logic (Anthropic). Statistics-followers revise with majority (xAI, GPT-5.x).

F-03

Judge dimension predicts robustness.

Normative/informational judge axis correlates ρ = −0.82, p = 0.002 with adversarial robustness — strongest predictor found.

F-04

Evaluation is universally weakest.

Across all 35 models, Evaluation is the most negative ipsative ability — no exception found. Self-evaluation is the systematic bottleneck.