Gradient Boosting and XGBoost
1. Ensemble Methods Recap
Random Forests combine many decision trees by averaging predictions. Gradient Boosting is another ensemble method, but instead of averaging, it adds models sequentially, each one correcting the errors of the previous ones.
2. How Gradient Boosting Works
- Start with a simple model (can be inaccurate).
- Predict values and compute a loss function (like mean squared error).
- Train a new model to correct the errors of the current ensemble.
- Add the new model to the ensemble.
- Repeat iteratively — this is why it’s called “boosting”.
The “gradient” part comes from using gradient descent to minimize the loss when adding each new model.
3. XGBoost
XGBoost is a high-performance implementation of gradient boosting. Optimized for speed and accuracy, it works especially well with standard tabular datasets (like those in Pandas).
4. Model Fitting Example
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
my_model = XGBRegressor()
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_valid)
print("MAE:", mean_absolute_error(predictions, y_valid))
MAE (Mean Absolute Error) measures prediction error. The model iteratively adds trees to reduce this error.
5. Important Parameters
- n_estimators – Number of trees in the ensemble. Too low → underfitting, too high → overfitting.
- early_stopping_rounds – Stop adding trees when validation error stops improving. Automatically finds optimal number of trees.
- learning_rate – Shrinks each tree’s contribution to reduce overfitting. Smaller learning rate + more trees usually improves accuracy.
- n_jobs – Number of CPU cores for parallel computation (speeds up training).
6. Summary
- Gradient boosting builds sequential models, each correcting the last.
- XGBoost is fast, accurate, and widely used in practice.
- Parameter tuning (
n_estimators,learning_rate,early_stopping_rounds) is essential for optimal performance. - Best for tabular data, not images or unstructured data.
Gradient boosting iteratively improves models by focusing on mistakes, and XGBoost is a powerful, optimized library to do this efficiently.
Random Forest vs Gradient Boosting
Random Forest builds many decision trees independently and averages their predictions. Each tree may be weak alone, but averaging reduces variance and improves accuracy.
Gradient Boosting builds trees sequentially, where each new tree focuses on correcting the mistakes of the previous ones. This step-by-step approach gradually reduces bias and improves predictions.
Random Forest improves predictions by averaging multiple trees, while Gradient Boosting improves predictions by learning from errors sequentially.