Best Practices - trustcv

bolt Quick Reference

The most important do's and don'ts at a glance.

check_circle Always Do

check Use stratified splits for imbalanced medical datasets
check Keep patient data together — never split records from the same patient
check Check for data leakage before training
check Report confidence intervals, not just mean performance
check Save random seeds for reproducibility
check Document preprocessing steps for regulatory submissions

cancel Never Do

close Preprocess before splitting — causes data leakage
close Ignore temporal order in longitudinal data
close Use regular k-fold with grouped patient data
close Trust a single train-test split — high variance estimates
close Mix validation and test sets — keep a separate holdout

account_tree Method Selection

Follow this decision tree to choose the right cross-validation strategy.

Is your data temporal (time-series)?

Yes

Use TimeSeriesSplit or TimeSeriesSplit

Multiple patients?

Use GroupedTimeSeriesSplit

No

Multiple records per patient?

Yes

Use GroupKFold
Imbalanced? Use StratifiedGroupKFold

No

Imbalanced (>70/30)?

Yes

Use StratifiedKFold

No

Use standard KFold

warning Common Pitfalls

Side-by-side comparison of wrong and correct approaches.

1. Data Leakage — Preprocessing Before Splitting

close Wrong

# Preprocessing before splitting - LEAKAGE!
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test = train_test_split(X_scaled)

check Correct

# Split first, then preprocess
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform!

2. Patient Data Mixing

close Wrong

# Regular k-fold ignores patient grouping
kf = KFold(n_splits=5)
for train, test in kf.split(X):
    # Same patient can be in both sets!

check Correct

from trustcv.splitters import GroupKFold

pgkf = GroupKFold(n_splits=5)
for train, test in pgkf.split(X, groups=patient_ids):
    # Patient data stays together

3. Ignoring Class Imbalance

close Wrong

# With 95% negative, 5% positive cases
cv_scores = cross_val_score(model, X, y, cv=5)
# Some folds might have NO positive cases!

check Correct

skf = StratifiedKFold(n_splits=5)
cv_scores = cross_val_score(model, X, y, cv=skf)
# Each fold maintains 95/5 ratio

4. Temporal Leakage

close Wrong

# Random splitting of time-series data
X_train, X_test = train_test_split(
    temporal_data, random_state=42
)
# Future data leaks into training!

check Correct

from trustcv.splitters import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train, test in tscv.split(X, timestamps=dates):
    # Always train on past, test on future

local_hospital Medical-Specific Considerations

Key factors unique to clinical and biomedical machine learning.

bar_chart Sample Size Requirements

Minimum samples per class per fold:

Binary classification: at least 30 per class
Multi-class: at least 10 per class
Rare diseases: consider LOOCV or bootstrap

def check_sample_size(y, n_splits=5):
    """Check if sample size adequate for CV"""
    unique, counts = np.unique(y, return_counts=True)
    min_class_size = counts.min()
    samples_per_fold = min_class_size // n_splits

    if samples_per_fold < 30:
        warnings.warn(
            f"Only {samples_per_fold} samples/fold "
            "for minority class. Consider fewer "
            "splits or different method."
        )
    return samples_per_fold

domain Multi-Site Clinical Trials

When data comes from multiple hospitals or sites, ensure site effects do not bias results by splitting at the site level.

from trustcv.splitters import HierarchicalGroupKFold

# Ensure site effects don't bias results
hgkf = HierarchicalGroupKFold(
    n_splits=5,
    hierarchy_level='site'  # Split by site
)

# Prevents overfitting to site-specific patterns

timeline Longitudinal Patient Data

For repeated measurements over time, you must respect both patient grouping and temporal ordering.

from trustcv import TrustCVValidator

validator = TrustCVValidator(
    method='grouped_temporal',
    patient_grouping=True,
    temporal_ordering=True
)

# Each patient's visits stay together
# and temporal order is preserved

biotech Rare Disease Classification

For extremely imbalanced datasets (<1% positive), use fewer folds and apply oversampling only on training data.

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)  # Fewer splits

for train_idx, test_idx in skf.split(X, y):
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]

    # SMOTE on training data only
    smote = SMOTE(random_state=42)
    X_bal, y_bal = smote.fit_resample(
        X_train, y_train
    )
    model.fit(X_bal, y_bal)
    score = model.score(X_test, y_test)

integration_instructions Complete Pipeline

A full best-practice pipeline combining leakage checks, preprocessing, medical-aware validation, and regulatory reporting.

from trustcv import TrustCVValidator
from trustcv.checkers import DataLeakageChecker
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 1. Check for data leakage
checker = DataLeakageChecker()
leakage_report = checker.check_cv_splits(
    X_train, X_test,
    patient_ids_train, patient_ids_test
)

if leakage_report.has_leakage:
    raise ValueError(f"Data leakage detected: {leakage_report}")

# 2. Create preprocessing pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# 3. Set up medical-aware validation
validator = TrustCVValidator(
    method='stratified_group_kfold',
    n_splits=5,
    check_leakage=True,
    check_balance=True,
    compliance='FDA'
)

# 4. Perform validation
results = validator.validate(
    model=pipeline, X=X, y=y,
    groups=patient_ids
)

# 5. Get comprehensive results
print(results.summary())

# 6. Export results
print(results.to_dict())

checklist Medical ML Project Checklist

Click each item to mark it as done. Verify all items before training any model.

Data is split before any preprocessing
Patient records are kept together (no patient in both train/test)
Temporal order is preserved (if applicable)
Class distribution is maintained across folds
Random seed is set for reproducibility
Separate holdout test set exists (for final evaluation)
Confidence intervals are calculated
Subgroup performance is evaluated
Clinical significance is considered (not just statistical)
Documentation meets regulatory requirements

Cross-Validation Best Practices for Medical ML

bolt Quick Reference

check_circle Always Do

cancel Never Do

account_tree Method Selection

warning Common Pitfalls

local_hospital Medical-Specific Considerations

bar_chart Sample Size Requirements

domain Multi-Site Clinical Trials

timeline Longitudinal Patient Data

biotech Rare Disease Classification

integration_instructions Complete Pipeline

checklist Medical ML Project Checklist