bolt Quick Reference

The most important do's and don'ts at a glance.

check_circle Always Do

  • check Use stratified splits for imbalanced medical datasets
  • check Keep patient data together — never split records from the same patient
  • check Check for data leakage before training
  • check Report confidence intervals, not just mean performance
  • check Save random seeds for reproducibility
  • check Document preprocessing steps for regulatory submissions

cancel Never Do

  • close Preprocess before splitting — causes data leakage
  • close Ignore temporal order in longitudinal data
  • close Use regular k-fold with grouped patient data
  • close Trust a single train-test split — high variance estimates
  • close Mix validation and test sets — keep a separate holdout

account_tree Method Selection

Follow this decision tree to choose the right cross-validation strategy.

Is your data temporal (time-series)?
Yes
Use TimeSeriesSplit or TimeSeriesSplit
Multiple patients?
Use GroupedTimeSeriesSplit
No
Multiple records per patient?
Yes
Use GroupKFold
Imbalanced? Use StratifiedGroupKFold
No
Imbalanced (>70/30)?
Yes
Use StratifiedKFold
No
Use standard KFold

warning Common Pitfalls

Side-by-side comparison of wrong and correct approaches.

1. Data Leakage — Preprocessing Before Splitting

close Wrong
# Preprocessing before splitting - LEAKAGE!
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
check Correct
# Split first, then preprocess
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform!

2. Patient Data Mixing

close Wrong
# Regular k-fold ignores patient grouping
kf = KFold(n_splits=5)
for train, test in kf.split(X):
    # Same patient can be in both sets!
check Correct
from trustcv.splitters import GroupKFold

pgkf = GroupKFold(n_splits=5)
for train, test in pgkf.split(X, groups=patient_ids):
    # Patient data stays together

3. Ignoring Class Imbalance

close Wrong
# With 95% negative, 5% positive cases
cv_scores = cross_val_score(model, X, y, cv=5)
# Some folds might have NO positive cases!
check Correct
skf = StratifiedKFold(n_splits=5)
cv_scores = cross_val_score(model, X, y, cv=skf)
# Each fold maintains 95/5 ratio

4. Temporal Leakage

close Wrong
# Random splitting of time-series data
X_train, X_test = train_test_split(
    temporal_data, random_state=42
)
# Future data leaks into training!
check Correct
from trustcv.splitters import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train, test in tscv.split(X, timestamps=dates):
    # Always train on past, test on future

local_hospital Medical-Specific Considerations

Key factors unique to clinical and biomedical machine learning.

bar_chart Sample Size Requirements

Minimum samples per class per fold:

  • Binary classification: at least 30 per class
  • Multi-class: at least 10 per class
  • Rare diseases: consider LOOCV or bootstrap
def check_sample_size(y, n_splits=5):
    """Check if sample size adequate for CV"""
    unique, counts = np.unique(y, return_counts=True)
    min_class_size = counts.min()
    samples_per_fold = min_class_size // n_splits

    if samples_per_fold < 30:
        warnings.warn(
            f"Only {samples_per_fold} samples/fold "
            "for minority class. Consider fewer "
            "splits or different method."
        )
    return samples_per_fold

domain Multi-Site Clinical Trials

When data comes from multiple hospitals or sites, ensure site effects do not bias results by splitting at the site level.

from trustcv.splitters import HierarchicalGroupKFold

# Ensure site effects don't bias results
hgkf = HierarchicalGroupKFold(
    n_splits=5,
    hierarchy_level='site'  # Split by site
)

# Prevents overfitting to site-specific patterns

timeline Longitudinal Patient Data

For repeated measurements over time, you must respect both patient grouping and temporal ordering.

from trustcv import TrustCVValidator

validator = TrustCVValidator(
    method='grouped_temporal',
    patient_grouping=True,
    temporal_ordering=True
)

# Each patient's visits stay together
# and temporal order is preserved

biotech Rare Disease Classification

For extremely imbalanced datasets (<1% positive), use fewer folds and apply oversampling only on training data.

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)  # Fewer splits

for train_idx, test_idx in skf.split(X, y):
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]

    # SMOTE on training data only
    smote = SMOTE(random_state=42)
    X_bal, y_bal = smote.fit_resample(
        X_train, y_train
    )
    model.fit(X_bal, y_bal)
    score = model.score(X_test, y_test)

integration_instructions Complete Pipeline

A full best-practice pipeline combining leakage checks, preprocessing, medical-aware validation, and regulatory reporting.

from trustcv import TrustCVValidator
from trustcv.checkers import DataLeakageChecker
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 1. Check for data leakage
checker = DataLeakageChecker()
leakage_report = checker.check_cv_splits(
    X_train, X_test,
    patient_ids_train, patient_ids_test
)

if leakage_report.has_leakage:
    raise ValueError(f"Data leakage detected: {leakage_report}")

# 2. Create preprocessing pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# 3. Set up medical-aware validation
validator = TrustCVValidator(
    method='stratified_group_kfold',
    n_splits=5,
    check_leakage=True,
    check_balance=True,
    compliance='FDA'
)

# 4. Perform validation
results = validator.validate(
    model=pipeline, X=X, y=y,
    groups=patient_ids
)

# 5. Get comprehensive results
print(results.summary())

# 6. Export results
print(results.to_dict())

checklist Medical ML Project Checklist

Click each item to mark it as done. Verify all items before training any model.