Data Leakage Detection

categoryTypes of Data Leakage

Four categories of leakage commonly encountered in medical machine learning

person CRITICAL

Patient Leakage

The same patient's data appears in both training and test sets. This is the most common form of leakage in healthcare ML, occurring when patients have multiple samples (repeated measurements, multiple images, multi-modal data like CT + MRI).

Multiple images per patient
Longitudinal studies with temporal data
Multi-modal data from the same patient

schedule HIGH

Temporal Leakage

Using future information to predict past events. This happens when time series data is shuffled or when post-treatment data is used to predict pre-treatment outcomes.

Training on 2024 data to predict 2023 outcomes
Post-treatment data predicting pre-treatment
Time series without proper temporal gaps

map MEDIUM

Spatial Leakage

Spatially correlated samples end up in both train and test sets. Adjacent tissue samples, neighboring geographic regions, or overlapping image patches share information that inflates apparent performance.

Adjacent tissue samples
Neighboring geographic regions
Overlapping image patches

account_tree HIGH

Hierarchical Leakage

Related samples through hierarchical structure appear in both sets. Hospital-specific practices, scanner-specific artifacts, or site-specific protocols create hidden correlations.

Same hospital in train/test
Same scanner/device artifacts
Same clinical site or research group

codeDetection Examples

Side-by-side comparison of wrong vs. correct approaches for each leakage type

personExample 1: Patient Leakage in Medical Imaging

cancel Wrong: Standard KFold

from sklearn.model_selection import KFold
import numpy as np

# Multiple images per patient
n_images = 1000
n_patients = 200
patient_ids = np.repeat(
    np.arange(n_patients),
    n_images // n_patients
)

X = np.random.randn(n_images, 224, 224, 3)
y = np.random.randint(0, 2, n_images)

# WRONG: ignores patient grouping
standard_cv = KFold(n_splits=5)
for train_idx, test_idx in standard_cv.split(X):
    # Same patient can be in train AND test!
    pass

check_circle Correct: GroupKFoldMedical

from trustcv import GroupKFoldMedical
from trustcv import DataLeakageChecker

checker = DataLeakageChecker()

# CORRECT: groups by patient ID
grouped_cv = GroupKFoldMedical(n_splits=5)
for train_idx, test_idx in grouped_cv.split(
    X, y, groups=patient_ids
):
    # Verify no leakage in this fold
    report = checker.check(
        X, y, groups=patient_ids, n_splits=5
    )
    # No leakage: patients stay together

scheduleExample 2: Temporal Leakage in Clinical Time Series

cancel Wrong: Shuffled KFold

from sklearn.model_selection import KFold

n_timepoints = 365
n_patients = 100
timestamps = np.repeat(
    np.arange(n_timepoints), n_patients
)
X = np.random.randn(
    n_timepoints * n_patients, 20
)
y = np.random.randint(
    0, 2, n_timepoints * n_patients
)

# WRONG: shuffles time series data!
cv = KFold(n_splits=5, shuffle=True)
for train_idx, test_idx in cv.split(X):
    # Future data leaks into training!
    pass

check_circle Correct: PurgedKFoldCV

from trustcv import PurgedKFoldCV

# CORRECT: respects temporal order with gap
purged_cv = PurgedKFoldCV(
    n_splits=5,
    purge_gap=7  # 7-day gap
)

for train_idx, test_idx in purged_cv.split(
    X, y, groups=timestamps
):
    # Purged CV prevents temporal leakage
    # No future data leaks into training

mapExample 3: Spatial Leakage in Pathology Images

cancel Wrong: Random Split

from sklearn.model_selection import KFold

n_patches = 500
coordinates = np.random.randn(
    n_patches, 2
) * 100  # x, y positions
X = np.random.randn(n_patches, 2048)
y = np.random.randint(0, 2, n_patches)

# WRONG: ignores spatial correlation
random_cv = KFold(n_splits=5, shuffle=True)
for train_idx, test_idx in random_cv.split(X):
    # Adjacent patches in train and test!
    pass

check_circle Correct: BufferedSpatialCV

from trustcv import BufferedSpatialCV

# CORRECT: spatial blocking with buffer
spatial_cv = BufferedSpatialCV(
    n_splits=5,
    spatial_coordinates=coordinates,
    buffer_size=10  # 10-unit buffer
)

for train_idx, test_idx in spatial_cv.split(X):
    # Buffer zones prevent spatial leakage
    # No adjacent patches in train and test

account_treeExample 4: Hierarchical Leakage in Multi-Center Studies

cancel Wrong: Standard KFold

from sklearn.model_selection import KFold

# Multi-center clinical trial data
# countries -> hospitals -> patients -> samples
countries = np.array(countries)
hospitals = np.array(hospitals)
patients = np.array(patients)

# WRONG: ignores hierarchical structure
cv = KFold(n_splits=5)
for train_idx, test_idx in cv.split(
    range(len(patients))
):
    # Leakage at patient, hospital,
    # and country levels!
    pass

check_circle Correct: HierarchicalGroupKFold

from trustcv import HierarchicalGroupKFold

# CORRECT: respects hierarchy levels
hierarchical_cv = HierarchicalGroupKFold(
    n_splits=5,
    hierarchy_levels=[
        'country', 'hospital', 'patient'
    ]
)

# Properly separates at chosen level
# Entire hospitals kept together
# No cross-contamination between groups

auto_fix_highAutomated Detection

Use DataLeakageChecker and LeakageDetectionCallback to automatically detect leakage during cross-validation

searchDataLeakageChecker

Standalone checker you can use to verify any train/test split for all types of leakage.

from trustcv import DataLeakageChecker

checker = DataLeakageChecker(verbose=True)

# Check all leakage types at once
report = checker.check(
    X, y,
    groups=patient_ids,        # patient grouping
    timestamps=timestamps,     # temporal order
    coordinates=coordinates,   # spatial data
    n_splits=5,
    random_state=42
)

# Review the report
print(report.summary)
print(f"Has leakage: {report.has_leakage}")
print(f"Severity: {report.severity}")
print(f"Types: {report.leakage_types}")

play_arrowLeakageDetectionCallback

Integrate leakage detection directly into your UniversalCVRunner pipeline for automatic fold-by-fold checking.

from trustcv import UniversalCVRunner
from trustcv.core.callbacks import LeakageDetectionCallback

# Create callback with your metadata
leakage_cb = LeakageDetectionCallback(
    patient_ids=patient_ids,
    timestamps=timestamps,
    coordinates=spatial_coords
)

# Run CV with automatic leakage detection
runner = UniversalCVRunner(cv_splitter=your_cv)
results = runner.run(
    model=your_model,
    data=(X, y),
    callbacks=[leakage_cb]
)

# Callback automatically:
# - Checks each fold on_fold_start
# - Reports severity and violations
# - Prints summary on_cv_end

verified_userPrevention Strategies

Quick reference for preventing each type of leakage

group

Always Use Patient Grouping

When you have multiple samples per patient, always group by patient ID to keep all of a patient's data on the same side of the split.

cv = GroupKFoldMedical(n_splits=5)
for train_idx, test_idx in cv.split(
    X, y, groups=patient_ids
):
    # Safe from patient leakage
    pass

hourglass_empty

Add Temporal Gaps

For time series data, add a purge gap between training and test periods to prevent temporal information from leaking across folds.

# Add purge gap for time series
cv = PurgedKFoldCV(
    n_splits=5,
    purge_gap=30  # 30-day gap
)

fence

Use Spatial Buffers

For spatially correlated data, add buffer zones between train and test regions to eliminate spatial autocorrelation.

# Add buffer zones for spatial data
cv = BufferedSpatialCV(
    buffer_size=100  # 100m buffer
)

lan

Respect Hierarchical Structure

In multi-center studies, group by the highest hierarchical level (e.g., hospital) to prevent cross-contamination.

# Group by highest level
cv = GroupKFoldMedical(n_splits=5)
for train_idx, test_idx in cv.split(
    X, y, groups=hospital_ids
):
    # Entire hospitals kept together
    pass

Pitfall	Consequence	Solution
Using patient's left and right eye images in different sets	Model learns patient-specific features	Group by patient ID
Training on 2023 data, testing on 2022	Impossible in real-world practice	Use TimeSeriesSplit
Adjacent tissue samples in train/test	Spatial correlation inflates performance	Use SpatialBlockCV
Same MRI scanner in all training data	Model learns scanner artifacts	Stratify by scanner
Data augmentation before splitting	Augmented versions in both sets	Augment after splitting