Softmax + NLL and Cross-Entropy Theory
1. Key Purpose of Using Softmax + NLL
Main idea:
To convert raw model outputs into a valid probability distribution and then measure how well the model matches the true class — in a numerically stable way.
1.1 Softmax Turns Logits into Probabilities
A neural network outputs raw scores (logits), for example:
[2.3, -1.2, 0.7]
These are not probabilities. Softmax converts them into a probability distribution:
This is required because NLL operates on probabilities.
1.2 NLL Measures the Likelihood of the Correct Class
NLL looks only at the predicted probability of the true class:
- High probability → small loss
- Low probability → large loss
Thus, it encourages the model to increase the probability of the correct class.
1.3 Combined Effect: Probabilistic Learning
Softmax + NLL together lead to:
- Sharper probability distributions
- Higher confidence in correct predictions
- Better class separation
1.4 Numerical Stability
Frameworks compute:
log_softmax → NLL
instead of:
softmax → log → NLL
to avoid overflow/underflow issues with exponentials and logarithms.
1.5 One-Sentence Summary
Softmax produces probabilities, and NLL evaluates how well they match the true class — giving a strong and stable learning signal.
2. Theory of Cross-Entropy
2.1 Origin from Information Theory
Cross-entropy measures how different two probability distributions are: a true distribution \( p \) and predicted distribution \( q \).
Interpretation: How many bits are needed to encode data from \( p \) when using a code optimized for \( q \)?
2.2 Cross-Entropy in Machine Learning
For classification, the true distribution is usually one-hot. Example:
p = [0, 0, 1, 0]
Cross-entropy simplifies to:
This is exactly the negative log-likelihood (NLL). Thus:
Cross-entropy = Softmax + NLL.
2.3 Why Use Cross-Entropy?
- Rewards high probability for the correct class
- Penalizes confident wrong predictions
2.4 Relationship with Entropy and KL Divergence
Entropy:
Cross-entropy:
KL divergence:
In classification, \( H(p) \) is constant, so minimizing \( H(p,q) \) also minimizes KL divergence.
2.5 Why Cross-Entropy Is the Best Loss for Classification
- Correct probabilistic interpretation
- Convex for logistic regression
- Provides strong, informative gradients
- Derived directly from maximum likelihood estimation
2.6 Cross-Entropy as Maximum Likelihood
Softmax model:
Likelihood of true class \( y \):
Maximizing likelihood is equivalent to minimizing:
This is exactly cross-entropy.
2.7 Summary Table
| Concept | Meaning |
|---|---|
| Softmax | Converts logits into probabilities |
| NLL | Penalizes low probability for correct class |
| Cross-entropy | Measures mismatch between true and predicted distributions |
| Theory | From information theory + maximum likelihood |