At the core of machine learning lies a key challenge: building models that perform well not just on training data, but also on new, unseen data.
Two common problems that affect model performance are underfitting and overfitting.
Overfitting:
Overfitting occurs when a model learns the training data too well, including noise and random fluctuations.
In decision trees, this happens when the tree becomes very deep, creating many splits and leaves.
Each leaf ends up containing very few data points, so predictions become highly specific to the training data.
While this leads to very accurate results on training data, the model performs poorly on new or validation data because it fails to generalize.
Underfitting:
Underfitting happens when a model is too simple to capture important patterns in the data.
For example, a very shallow decision tree with only a few splits groups many different data points together.
As a result, predictions are inaccurate even on the training data itself, and performance remains poor on validation data as well.
The Trade-off:
There is always a balance between underfitting and overfitting.
- Increasing model complexity (e.g., more tree depth or more leaf nodes) reduces underfitting but increases the risk of overfitting.
- Decreasing complexity reduces overfitting but may lead to underfitting.
The goal is to find a “sweet spot” where the model captures meaningful patterns without memorizing noise.
This is typically achieved by evaluating model performance on validation data, which is not used during training.
Controlling Model Complexity:
In decision trees, one effective way to control this balance is using parameters like max_leaf_nodes.
- Fewer leaf nodes → simpler model → risk of underfitting
- More leaf nodes → complex model → risk of overfitting
Conclusion:
A good model should generalize well to new data.
Avoiding underfitting and overfitting is essential for building accurate and reliable machine learning systems.