Main Differences Between MSE Loss and Cross-Entropy Loss
This document presents a clear and mathematical comparison between Mean Squared Error (MSE) and Cross-Entropy Loss commonly used in machine learning and deep learning.
1. Type of Problems They Are Used For
- Cross-Entropy Loss: Used for classification (binary or multi-class).
- MSE Loss: Used mainly for regression.
2. Mathematical Formulas
Mean Squared Error (MSE) Loss
The MSE loss for target \( y \) and prediction \( \hat{y} \) is:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]
Cross-Entropy Loss
Binary Cross-Entropy (BCE)
\[ \text{BCE} = - \left[ y \log(\hat{y}) + (1-y)\log(1-\hat{y}) \right] \]
Categorical Cross-Entropy (Multi-Class)
For predicted probability of the correct class:
\[ \text{CE} = - \log(\hat{p}_{y}) \]
Or using one-hot encoded targets:
\[ \text{CE} = - \sum_{i=1}^{C} t_i \log(\hat{p}_i) \]
3. Gradient Behavior
MSE Gradient
For sigmoid output:
\[ \hat{y} = \sigma(z) \]
Gradient becomes:
\[ \frac{\partial \text{MSE}}{\partial z} = (\hat{y} - y)\hat{y}(1-\hat{y}) \]
When the sigmoid saturates: \[ \hat{y}(1-\hat{y}) \approx 0 \] → Very small gradient → Slow learning
Cross-Entropy Gradient
For sigmoid + BCE:
\[ \frac{\partial \text{CE}}{\partial z} = \hat{y} - y \]
This avoids gradient shrinkage and gives:
Stable, strong gradients → Faster training
4. Output Layer Compatibility
- Cross-Entropy: Works naturally with softmax (multi-class) and sigmoid (binary).
- MSE: Not ideal for classification; gradients are often misleading or too weak.
5. Interpretation
Cross-Entropy
Measures the distance between the true distribution and predicted probabilities.
\[ \text{If } \hat{p}_{y} \rightarrow 0,\quad \text{CE} \rightarrow \infty \]
Strong penalty for confident wrong predictions.
MSE
Measures squared Euclidean distance:
\[ (y - \hat{y})^2 \]
Not meaningful when predicting class labels.
Summary Table
| Feature | MSE Loss | Cross-Entropy Loss |
|---|---|---|
| Best for | Regression | Classification |
| Formula | \(\frac{1}{n}\sum (y-\hat{y})^2\) | \(-\sum y \log(\hat{p})\) |
| Output type | Continuous values | Probabilities |
| Gradient strength | Weak (can vanish) | Strong and stable |
| Convergence speed | Slow | Fast |
| Penalty for confident mistakes | Weak | Strong |
| Works with Softmax/Sigmoid? | No | Yes |
- MSE → Regression tasks
- Cross-Entropy → Classification tasks
- Cross-Entropy provides better gradients, faster learning, and better accuracy for classification.