Random Forest Classifier
Decision trees can leave you with a difficult choice.
A deep tree with many leaves tends to overfit because each prediction comes from only a few examples in its leaf.
On the other hand, a shallow tree with fewer leaves tends to underfit, failing to capture enough distinctions in the raw data.
Even the most sophisticated models today face this trade-off between underfitting and overfitting.
However, many models have clever techniques that improve performance. For example, the random forest.
The random forest combines many decision trees, making predictions by averaging the results of all the trees.
It generally achieves much better predictive accuracy than a single tree and works well with default parameters.
With further modeling and parameter tuning, you can achieve even better performance, though some models are sensitive to parameter choices.
Example in Python:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Train a single decision tree
tree_model = DecisionTreeRegressor(random_state=1)
tree_model.fit(X_train, y_train)
tree_predictions = tree_model.predict(X_test)
# Train a random forest
forest_model = RandomForestRegressor(n_estimators=100, random_state=1)
forest_model.fit(X_train, y_train)
forest_predictions = forest_model.predict(X_test)
A Random Forest Classifier is a machine learning algorithm used for classification tasks (predicting categories or labels). It builds upon the concept of decision trees, combining many of them to make more robust predictions.
1. Core Idea
- A single decision tree splits data based on feature values to reach a prediction.
- Decision trees are powerful but unstable: small data changes can affect predictions.
- Random Forest creates many trees and lets them vote on the final prediction.
2. How it Works
- Bootstrap sampling (Bagging): Each tree trains on a random subset of the training data (with replacement).
- Random feature selection: At each split, only a random subset of features is considered.
- Voting (Majority Rule): Each tree predicts a sample, and the most common prediction is chosen.
3. Advantages
- Handles high-dimensional data well.
- Reduces overfitting compared to a single decision tree.
- Can estimate feature importance.
- Works with missing values or noisy data.
4. Disadvantages
- Less interpretable than a single decision tree.
- Slower for very large datasets.
5. Example in Python (Scikit-learn)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# X = features, y = labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create Random Forest with 100 trees
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predict on test data
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Random Forest combines many weak decision trees to make a strong, stable classifier that is less prone to overfitting than a single tree.