130 instances across 5 domains, fully synthetic, no human annotation
Entirely synthetic data with multi-stage quality control. No copyrighted material, no human subjects.
Evidential reasoning with contradictory clinical evidence
Causal reasoning through diagnostic layers
Contextual reasoning where severity depends on threat model
Tradeoff reasoning with no single right answer
Formal reasoning where same data supports different frameworks
Each instance contains the following fields.
| Field | Description |
|---|---|
vignette |
Problem scenario (~500 words) |
key_claims |
5 claims (C1–C5) with disagreement scores |
ensemble_outputs |
8 analyst responses from 28-model pool |
consensus |
Jackknife consensus with per-claim confidence |
_verified_wrong_claims |
Per-claim ground truth from 3-judge verification |
is_known_answer |
30 KA instances for sycophancy detection |
The dataset is included in the GitHub repository under data/. It is also available as a standalone Kaggle dataset.
The fully automated pipeline produces ~50 new instances at ~$15 per batch (generation time depends on API provider throughput), with zero human annotation, enabling continuous validation.
medley-bench init medley-bench load-seeds medley-bench expand medley-bench collect medley-bench build-consensus medley-bench validate medley-bench export
Install from PyPI and use MEDLEY-BENCH as a library in your own projects.
# Core (scoring, analysis) pip install medley-bench # With a provider pip install medley-bench[openai] pip install medley-bench[anthropic] pip install medley-bench[all-providers] # Everything pip install medley-bench[full]
Set environment variables, use a .env file, or ~/.medley-bench.env. Ollama works with no key at all. Clear error messages if a key is missing.
Run individual instances, compute scores, analyse results, and build ipsative profiles programmatically. Full async provider support.
Run benchmarks, view leaderboards, generate instances, and export results from the command line with medley-bench.
# Score a single instance from src.core.providers import get_provider from src.tracks.metacognition.prompts.step_a import build_prompt import asyncio provider = get_provider("openai/gpt-4.1") raw_a = asyncio.run(provider.complete(build_prompt(vignette="..."))) # Compute measures from src.tracks.metacognition.scoring.measures import * prop = update_proportionality(step_a, step_bs, instance) # Full docs: docs/LIBRARY.md
OpenRouter (one key, 200+ models) · Anthropic (Claude) · OpenAI (GPT, o-series) · Google (Gemini) · Ollama (local, free)
Unit tests for all scoring measures, integration tests for the pipeline, and scoring validation tests for weight constraints and score bounds. CI on Python 3.10–3.12.