Dataset

130 instances across 5 domains, fully synthetic, no human annotation

Overview

130 Instances
5 Domains
650 Claims (130×5)
28 Analyst Models
30 Known-Answer Traps
~$15 Per 50 New Instances

Entirely synthetic data with multi-stage quality control. No copyrighted material, no human subjects.

Domains

local_hospital

Medical Diagnosis (27)

Evidential reasoning with contradictory clinical evidence

build

System Troubleshooting (26)

Causal reasoning through diagnostic layers

code

Code Review (27)

Contextual reasoning where severity depends on threat model

architecture

Architecture Design (25)

Tradeoff reasoning with no single right answer

functions

Statistical Reasoning (25)

Formal reasoning where same data supports different frameworks

Instance Structure

Each instance contains the following fields.

Field Description
vignette Problem scenario (~500 words)
key_claims 5 claims (C1–C5) with disagreement scores
ensemble_outputs 8 analyst responses from 28-model pool
consensus Jackknife consensus with per-claim confidence
_verified_wrong_claims Per-claim ground truth from 3-judge verification
is_known_answer 30 KA instances for sycophancy detection

Download

The dataset is included in the GitHub repository under data/. It is also available as a standalone Kaggle dataset.

Extensibility

The fully automated pipeline produces ~50 new instances at ~$15 per batch (generation time depends on API provider throughput), with zero human annotation, enabling continuous validation.

medley-bench init
medley-bench load-seeds
medley-bench expand
medley-bench collect
medley-bench build-consensus
medley-bench validate
medley-bench export

Python Library

Install from PyPI and use MEDLEY-BENCH as a library in your own projects.

# Core (scoring, analysis)
pip install medley-bench

# With a provider
pip install medley-bench[openai]
pip install medley-bench[anthropic]
pip install medley-bench[all-providers]

# Everything
pip install medley-bench[full]
key

Easy API Key Setup

Set environment variables, use a .env file, or ~/.medley-bench.env. Ollama works with no key at all. Clear error messages if a key is missing.

code

Python API

Run individual instances, compute scores, analyse results, and build ipsative profiles programmatically. Full async provider support.

terminal

CLI Interface

Run benchmarks, view leaderboards, generate instances, and export results from the command line with medley-bench.

play_arrow Quick Example

# Score a single instance
from src.core.providers import get_provider
from src.tracks.metacognition.prompts.step_a import build_prompt
import asyncio

provider = get_provider("openai/gpt-4.1")
raw_a = asyncio.run(provider.complete(build_prompt(vignette="...")))

# Compute measures
from src.tracks.metacognition.scoring.measures import *
prop = update_proportionality(step_a, step_bs, instance)

# Full docs: docs/LIBRARY.md
hub

5 Provider Backends

OpenRouter (one key, 200+ models) · Anthropic (Claude) · OpenAI (GPT, o-series) · Google (Gemini) · Ollama (local, free)

verified

95 Tests, 0 API Keys Needed

Unit tests for all scoring measures, integration tests for the pipeline, and scoring validation tests for weight constraints and score bounds. CI on Python 3.10–3.12.