Interactive Tutorials - Trustworthy Cross-Validation

Your Learning Progress

0% Complete - Let's get started!

🚨 Understanding Data Leakage

Learn how wrong CV methods can dangerously overestimate model performance

Beginner

🎯 Choosing the Right CV Method

Interactive decision tree to select appropriate CV for your data

Beginner

👥 Patient-Level Grouping

Handle multiple records per patient correctly

Intermediate

⏰ Temporal Cross-Validation

Validate time series medical data properly

Intermediate

🔄 Nested Cross-Validation

Unbiased hyperparameter tuning for ML

Advanced

🗺️ Spatial Cross-Validation

Handle geographic data and disease spread

Advanced

🚨 Understanding Data Leakage

Data leakage is one of the most dangerous mistakes in ML. Let's explore how it happens and its consequences.

Interactive Demo: Patient Data Leakage

You have 100 patients, each with 5 visits. Let's see what happens with different CV methods:

Results:

⚠️ Real-World Impact

A 25% overestimation means a model that appears 90% accurate is actually only 65% accurate in deployment. This could mean hundreds of missed diagnoses!

from trustcv.splitters import KFold, GroupKFold
from sklearn.model_selection import cross_val_score

# WRONG: Standard K-Fold (patient leakage)
cv_wrong = KFold(n_splits=5)
score_wrong = cross_val_score(model, X, y, cv=cv_wrong)
print(f"Biased AUC: {score_wrong.mean():.3f}")  # 0.92 (inflated!)

# CORRECT: Grouped K-Fold (no leakage)
cv_correct = GroupKFold(n_splits=5)
score_correct = cross_val_score(model, X, y, cv=cv_correct, groups=patient_ids)
print(f"True AUC: {score_correct.mean():.3f}")  # 0.73 (realistic)
            

Quick Check

Why does standard K-fold cause data leakage with patient data?

It doesn't shuffle the data It puts the same patient's visits in both train and test sets It uses too few folds

🎯 Choosing the Right CV Method

Answer these questions to find the perfect CV method for your medical data:

Interactive CV Selector

1. What type of data structure do you have?

Recommended CV Method:

💡 Pro Tip

When in doubt, use the more conservative CV method. It's better to underestimate performance than to deploy an inadequate model!

👥 Patient-Level Grouping

Many medical datasets have multiple records per patient. Learn how to handle this correctly.

Try It Yourself

Given this patient data structure, identify the correct grouping variable:

data = {
    'patient_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'visit_date': ['2023-01', '2023-02', '2023-03', '2023-01', '2023-02', 
                   '2023-01', '2023-02', '2023-03', '2023-04'],
    'blood_pressure': [120, 125, 130, 110, 115, 140, 145, 150, 155],
    'diagnosis': [0, 0, 1, 0, 0, 1, 1, 1, 1]
}
                

What should you use as the 'groups' parameter in GroupKFold?

💡 Key Principle

All records from the same patient must stay together - either all in training or all in test, never split!

⏰ Temporal Cross-Validation

Time series medical data requires special handling to prevent future information from leaking into training.

Visualize Temporal Splits

See how different temporal CV methods split your data:

⚠️ Common Mistake

Using random K-fold on time series data allows the model to "see the future" - training on tomorrow's data to predict today!

🔄 Nested Cross-Validation

Learn how to perform unbiased hyperparameter tuning with nested CV.

from trustcv.splitters import GroupKFold, NestedGroupedCV
from sklearn.model_selection import GridSearchCV

# Option 1: Manual nested CV with group awareness
outer_cv = GroupKFold(n_splits=5)
inner_cv = GroupKFold(n_splits=3)

for train_idx, test_idx in outer_cv.split(X, y, groups):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Tune hyperparameters on training set only
    grid_search = GridSearchCV(model, param_grid, cv=inner_cv)
    grid_search.fit(X_train, y_train, groups=groups[train_idx])

    # Evaluate on held-out test set
    score = grid_search.score(X_test, y_test)

# Option 2: Use trustcv's NestedGroupedCV
nested_cv = NestedGroupedCV(outer_cv=GroupKFold(5), inner_cv=GroupKFold(3))
            

💡 Why Nested CV?

Without nesting, you're using the same data for both selecting hyperparameters AND evaluating performance - leading to optimistic bias!

🗺️ Spatial Cross-Validation

Geographic data requires special CV methods to handle spatial autocorrelation.

Spatial Autocorrelation Demo

See why random splits fail with spatial data:

💡 Buffer Zones

Use buffered spatial CV to create "no data zones" between training and test regions, preventing spatial leakage.

📝 Test Your Knowledge

Comprehensive CV Quiz

Question 1: You have ICU data with hourly measurements. Which CV method should you use?

Standard K-Fold Time Series Split Leave-One-Out

Question 2: Multi-center trial data should use:

Random splits Leave-Hospital-Out CV Bootstrap validation

Question 3: Data leakage can cause performance overestimation of:

5-10% 20-40% 50-60%