Cross-Validation
Cross-validation helps measure model performance more reliably by using multiple subsets of the data instead of a single validation set.
Why Not Use a Single Validation Set?
- Using only one validation set can give noisy or luck-dependent results.
- Example: In a dataset with 5000 rows, keeping 1000 as validation may give a misleading score.
How Cross-Validation Works
- Split data into k folds (e.g., 5 folds, each 20% of the data).
- For each fold:
- Use the fold as the validation set.
- Use remaining folds for training.
- Repeat for all folds so every row is used for validation once.
- Average the performance metrics across all folds for a reliable score.
When to Use Cross-Validation
- Small datasets: Recommended, because you can reuse all data for validation.
- Large datasets: Single validation set is often sufficient.
Implementation Example (Python)
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
my_pipeline = Pipeline(steps=[
('preprocessor', SimpleImputer()),
('model', RandomForestRegressor(n_estimators=50, random_state=0))
])
# 5-fold cross-validation
scores = -1 * cross_val_score(
my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error'
)
print("MAE scores:", scores)
print("Average MAE:", scores.mean())
Summary
- Reduces randomness in model evaluation.
- Gives a more accurate measure of performance.
- Very useful for small datasets or when comparing models.
- Pipelines simplify cross-validation and make code cleaner.