Learn medical CV methods through hands-on exercises
0% Complete - Let's get started!
Learn how wrong CV methods can dangerously overestimate model performance
BeginnerInteractive decision tree to select appropriate CV for your data
BeginnerHandle multiple records per patient correctly
IntermediateValidate time series medical data properly
IntermediateUnbiased hyperparameter tuning for ML
AdvancedHandle geographic data and disease spread
AdvancedData leakage is one of the most dangerous mistakes in ML. Let's explore how it happens and its consequences.
You have 100 patients, each with 5 visits. Let's see what happens with different CV methods:
A 25% overestimation means a model that appears 90% accurate is actually only 65% accurate in deployment. This could mean hundreds of missed diagnoses!
Why does standard K-fold cause data leakage with patient data?
Answer these questions to find the perfect CV method for your medical data:
When in doubt, use the more conservative CV method. It's better to underestimate performance than to deploy an inadequate model!
Many medical datasets have multiple records per patient. Learn how to handle this correctly.
Given this patient data structure, identify the correct grouping variable:
What should you use as the 'groups' parameter in GroupKFold?
All records from the same patient must stay together - either all in training or all in test, never split!
Time series medical data requires special handling to prevent future information from leaking into training.
See how different temporal CV methods split your data:
Using random K-fold on time series data allows the model to "see the future" - training on tomorrow's data to predict today!
Learn how to perform unbiased hyperparameter tuning with nested CV.
Without nesting, you're using the same data for both selecting hyperparameters AND evaluating performance - leading to optimistic bias!
Geographic data requires special CV methods to handle spatial autocorrelation.
See why random splits fail with spatial data:
Use buffered spatial CV to create "no data zones" between training and test regions, preventing spatial leakage.
Question 1: You have ICU data with hourly measurements. Which CV method should you use?
Question 2: Multi-center trial data should use:
Question 3: Data leakage can cause performance overestimation of: