Model Validation Explained
In machine learning, model validation is the process of checking how good your model is at making predictions on new, unseen data. The goal is simple: we want our model to perform well in the real world, not just on the data it was trained on.
Why Do We Need Model Validation?
A common mistake is evaluating a model using the same data it was trained on. This is called an "in-sample evaluation".
Simple Example
Imagine your dataset shows that houses with green doors are expensive. The model may learn this pattern and assume all green-door houses are expensive.
- This pattern may only exist in your training data
- It may not be true in real-world data
- So the model will fail when used in practice
The Solution: Train–Validation Split
To fix this, we split the dataset into two parts:
Training Data: Used to build the model
Validation Data: Used to test the model on unseen data
Measuring Accuracy: Mean Absolute Error (MAE)
One common way to measure model performance is Mean Absolute Error (MAE).
MAE tells us:
"On average, how far off are our predictions?"
Python Example
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
# Split data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
# Create model
model = DecisionTreeRegressor()
# Train model
model.fit(train_X, train_y)
# Predict on validation data
val_predictions = model.predict(val_X)
# Calculate error
mae = mean_absolute_error(val_y, val_predictions)
print(mae)
Summary
This means it is not reliable for real-world predictions.
- Always evaluate models on unseen data
- Never trust training accuracy alone
- Use validation data to estimate real-world performance
- Lower MAE means better predictions