Dataset

Domains

local_hospital

Medical Diagnosis (27)

Evidential reasoning with contradictory clinical evidence

build

System Troubleshooting (26)

Causal reasoning through diagnostic layers

code

Code Review (27)

Contextual reasoning where severity depends on threat model

architecture

Architecture Design (25)

Tradeoff reasoning with no single right answer

functions

Statistical Reasoning (25)

Formal reasoning where same data supports different frameworks

Instance Structure

Each instance contains the following fields.

Field	Description
`vignette`	Problem scenario (~500 words)
`key_claims`	5 claims (C1–C5) with disagreement scores
`ensemble_outputs`	8 analyst responses from 28-model pool
`consensus`	Jackknife consensus with per-claim confidence
`_verified_wrong_claims`	Per-claim ground truth from 3-judge verification
`is_known_answer`	30 KA instances for sycophancy detection

Extensibility

The fully automated pipeline produces ~50 new instances at ~$15 per batch (generation time depends on API provider throughput), with zero human annotation, enabling continuous validation.

medley-bench init
medley-bench load-seeds
medley-bench expand
medley-bench collect
medley-bench build-consensus
medley-bench validate
medley-bench export

Python Library

Install from PyPI and use MEDLEY-BENCH as a library in your own projects.

# Core (scoring, analysis)
pip install medley-bench

# With a provider
pip install medley-bench[openai]
pip install medley-bench[anthropic]
pip install medley-bench[all-providers]

# Everything
pip install medley-bench[full]

key

Easy API Key Setup

Set environment variables, use a .env file, or ~/.medley-bench.env. Ollama works with no key at all. Clear error messages if a key is missing.

code

Python API

Run individual instances, compute scores, analyse results, and build ipsative profiles programmatically. Full async provider support.

terminal

CLI Interface

Run benchmarks, view leaderboards, generate instances, and export results from the command line with medley-bench.

play_arrow Quick Example

# Score a single instance
from src.core.providers import get_provider
from src.tracks.metacognition.prompts.step_a import build_prompt
import asyncio

provider = get_provider("openai/gpt-4.1")
raw_a = asyncio.run(provider.complete(build_prompt(vignette="...")))

# Compute measures
from src.tracks.metacognition.scoring.measures import *
prop = update_proportionality(step_a, step_bs, instance)

# Full docs: docs/LIBRARY.md

hub

5 Provider Backends

OpenRouter (one key, 200+ models) · Anthropic (Claude) · OpenAI (GPT, o-series) · Google (Gemini) · Ollama (local, free)

verified

95 Tests, 0 API Keys Needed

Unit tests for all scoring measures, integration tests for the pipeline, and scoring validation tests for weight constraints and score bounds. CI on Python 3.10–3.12.

Full library documentation →

Overview