πŸŽ“ Interactive Cross-Validation Tutorials

Learn medical CV methods through hands-on exercises

Your Learning Progress

0% Complete - Let's get started!

🚨 Understanding Data Leakage

Learn how wrong CV methods can dangerously overestimate model performance

Beginner

🎯 Choosing the Right CV Method

Interactive decision tree to select appropriate CV for your data

Beginner

πŸ‘₯ Patient-Level Grouping

Handle multiple records per patient correctly

Intermediate

⏰ Temporal Cross-Validation

Validate time series medical data properly

Intermediate

πŸ”„ Nested Cross-Validation

Unbiased hyperparameter tuning for ML

Advanced

πŸ—ΊοΈ Spatial Cross-Validation

Handle geographic data and disease spread

Advanced

🚨 Understanding Data Leakage

Data leakage is one of the most dangerous mistakes in ML. Let's explore how it happens and its consequences.

Interactive Demo: Patient Data Leakage

You have 100 patients, each with 5 visits. Let's see what happens with different CV methods:

Results:

⚠️ Real-World Impact

A 25% overestimation means a model that appears 90% accurate is actually only 65% accurate in deployment. This could mean hundreds of missed diagnoses!

from trustcv.splitters import KFold, GroupKFold from sklearn.model_selection import cross_val_score # WRONG: Standard K-Fold (patient leakage) cv_wrong = KFold(n_splits=5) score_wrong = cross_val_score(model, X, y, cv=cv_wrong) print(f"Biased AUC: {score_wrong.mean():.3f}") # 0.92 (inflated!) # CORRECT: Grouped K-Fold (no leakage) cv_correct = GroupKFold(n_splits=5) score_correct = cross_val_score(model, X, y, cv=cv_correct, groups=patient_ids) print(f"True AUC: {score_correct.mean():.3f}") # 0.73 (realistic)

Quick Check

Why does standard K-fold cause data leakage with patient data?

🎯 Choosing the Right CV Method

Answer these questions to find the perfect CV method for your medical data:

Interactive CV Selector

1. What type of data structure do you have?

Recommended CV Method:

πŸ’‘ Pro Tip

When in doubt, use the more conservative CV method. It's better to underestimate performance than to deploy an inadequate model!

πŸ‘₯ Patient-Level Grouping

Many medical datasets have multiple records per patient. Learn how to handle this correctly.

Try It Yourself

Given this patient data structure, identify the correct grouping variable:

data = { 'patient_id': [1, 1, 1, 2, 2, 3, 3, 3, 3], 'visit_date': ['2023-01', '2023-02', '2023-03', '2023-01', '2023-02', '2023-01', '2023-02', '2023-03', '2023-04'], 'blood_pressure': [120, 125, 130, 110, 115, 140, 145, 150, 155], 'diagnosis': [0, 0, 1, 0, 0, 1, 1, 1, 1] }

What should you use as the 'groups' parameter in GroupKFold?

πŸ’‘ Key Principle

All records from the same patient must stay together - either all in training or all in test, never split!

⏰ Temporal Cross-Validation

Time series medical data requires special handling to prevent future information from leaking into training.

Visualize Temporal Splits

See how different temporal CV methods split your data:

⚠️ Common Mistake

Using random K-fold on time series data allows the model to "see the future" - training on tomorrow's data to predict today!

πŸ”„ Nested Cross-Validation

Learn how to perform unbiased hyperparameter tuning with nested CV.

from trustcv.splitters import GroupKFold, NestedGroupedCV from sklearn.model_selection import GridSearchCV # Option 1: Manual nested CV with group awareness outer_cv = GroupKFold(n_splits=5) inner_cv = GroupKFold(n_splits=3) for train_idx, test_idx in outer_cv.split(X, y, groups): X_train, X_test = X[train_idx], X[test_idx] y_train, y_test = y[train_idx], y[test_idx] # Tune hyperparameters on training set only grid_search = GridSearchCV(model, param_grid, cv=inner_cv) grid_search.fit(X_train, y_train, groups=groups[train_idx]) # Evaluate on held-out test set score = grid_search.score(X_test, y_test) # Option 2: Use trustcv's NestedGroupedCV nested_cv = NestedGroupedCV(outer_cv=GroupKFold(5), inner_cv=GroupKFold(3))

πŸ’‘ Why Nested CV?

Without nesting, you're using the same data for both selecting hyperparameters AND evaluating performance - leading to optimistic bias!

πŸ—ΊοΈ Spatial Cross-Validation

Geographic data requires special CV methods to handle spatial autocorrelation.

Spatial Autocorrelation Demo

See why random splits fail with spatial data:

πŸ’‘ Buffer Zones

Use buffered spatial CV to create "no data zones" between training and test regions, preventing spatial leakage.

πŸ“ Test Your Knowledge

Comprehensive CV Quiz

Question 1: You have ICU data with hourly measurements. Which CV method should you use?

Question 2: Multi-center trial data should use:

Question 3: Data leakage can cause performance overestimation of: